# Brainmaker

Nanos gigantium humeris insidentes!

## how to use rating

• http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
• http://www.evanmiller.org/bayesian-average-ratings.html
• http://www.evanmiller.org/ranking-items-with-star-ratings.html

Popularity is meaningless, the average people’s wisdom is only informative when it comes to entertainment. Personalization based on personal interest would only end up in a concentrated  converged small subset. If you keep narrowing down the side, it would end up falls into the top 10 best seller  that genre.

What makes a new generation of recommendations system?

1. some level of diverse — Aesthetically Tired
2. high quality guiding source — just like when you follow some hot shots on twitter, you obtain information from them, not from average people
3. Editor picks for you requires a human editor — not necessary to be picking everything, but need to provide sources and probably quality assurance.
4. 80/20: some portion of human work is better than pure machine. That 20% needs not be total amount of repeating work, could be collaboration — providing source.
5. Designing a new strategy — Do not optimize over crap. Think from scratch.
6. Cocktail strategy — how people explore new things — a little bit of everything — best sellers, trending, personal interest, people who share similar taste, something different.

## If a person is born deaf, which language do they think in?

http://www.quora.com/Human-Brain/If-a-person-is-born-deaf-which-language-do-they-think-in

## Learning Topics

Resampling

• bootstrapping test
• jackknifing test
• permutation test

Learning Models

• Regression
• OLS
• GLM
• linear component
• error structures
• Classification
• SVM
• Ordinal Classification

Goodness of Fit

• Pearson Chi-square test
• Root mean square error

Lab

• t-test
• Chi-squared test

## Four Assumptions Of Multiple Regression That Researchers Should Always Test

From:http://pareonline.net/getvn.asp?n=2&v=8

1. VARIABLES ARE NORMALLY DISTRIBUTED.
2. A LINEAR RELATIONSHIP BETWEEN THE INDEPENDENT AND DEPENDENT VARIABLE(S).
3. VARIABLES ARE MEASURED WITHOUT ERROR (RELIABLY)
4. HOMOSCEDASTICITY

## Recommendation System

The distinction between the physical and on-line worlds has been called the long tail  phenomenon, and it is suggested in Fig. 9.2. The vertical axis represents popularity  (the number of times an item is chosen). The items are ordered on the horizontal axis according to their popularity. Physical institutions provide only the most popular items to the left of the vertical line, while the corresponding on-line institutions provide the entire range of items: the tail as well as the popular items.

The long tail: physical institutions can only provide what is popular,
while on-line institutions can make everything available

There are two basic architectures for a recommendation system:

1. Content-Based  systems focus on properties of items. Similarity of items is determined by measuring the similarity in their properties.

2. Collaborative-Filtering  systems focus on the relationship between users and items. Similarity of items is determined by the similarity

1. Content-based

Item Profile: In a content-based system, we must construct for each item a profile , which is a record or collection of records representing important characteristics of that item.

User Profile:

We not only need to create vectors describing items; we need to create vectors with the same components that describe the user’s preferences.

With profile vectors for both users and items, we can estimate the degree to which a user would prefer an item by computing the cosine distance between the user’s and item’s vectors.

2. Collaborative Filtering

Measure Similarity of Users

Cluster users or items

## Naive Bayes classifier Probabilistic model

Abstractly, the probability model for a classifier is a conditional model.
$p(C vert F_1,dots,F_n),$
over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables $F_1$ through $F_n$. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable.

## Probit Models — An Application Example

http://www.sts.uzh.ch/past/hs09/em/topic9a_p.pdf