Brainmaker

Nanos gigantium humeris insidentes!

classifiers model reference

  • July 19, 2010 11:09 pm

naive bayes classifer

http://www.statsoft.com/textbook/naive-bayes-classifier/

http://en.wikipedia.org/wiki/Naive_Bayes_classifier#The_naive_Bayes_probabilistic_model

support vector machine

http://www.statsoft.com/textbook/support-vector-machines/

Vladimir N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.

几篇论文–确定是否有帮助

  • July 19, 2010 1:37 am

***Feature subsumption of opinion analysis. Proceedings of EMNLP, 2006.
Semantic role extracting
有较具体的实现,打印出来
有一个实现所用的程序没有公开源码

Extracting Appraisal Expressions

看起来有用

Sentiment analysis: a combined approach

一个比较详细的表格比较此前的多个研究的成果,非常有参考意义

Automated learning of appraisal extraction patterns§.

无法下载

Sentiment analysis: a new approach for effective use of linguistic knowledge and exploiting similarities in a set of documents to be classified

Sentiment Classification using Word Sub-Sequences and Dependency Sub-Tree

无法下载

Assessing Sentiment of Text by Semantic Dependency and Contextual Valence Analysis

无法下载

Note of Lexical Filtering on Overall Opinion Polarity Identification

  • July 18, 2010 4:56 pm

F. Salvetti, S.Lewis, C.Reichenbach.Impact of Lexical Filtering on Overall Opinion Polarity Identification

Flow

HTML documents were converted to plain text, tagged using the Brill tagger, and fed into filters and classifiers.

Basic Assumption or Points:

Related Research

Research has demonstrated that there is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion (Wiebe, Bruce, & O’ Hara 1999).

Hatzivassiloglou & McKeown 1997), combined a log-linear statistical model that examined the conjunctions between adjectives,(such as “and”, “but”, “or”), with a clustering algorithm that grouped the adjectives into two sets which were then labelled positive and negative.

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. He combined the number of co-occurrences of the adjective under investigation NEAR the adjective ‘excellent’ and NEAR the ‘poor’, thinking that high occurrence NEAR ‘excellent’ implies positive polarity.

Corpus

The cornell data consists of 27,000 movie reviews in HTML form, using 35 different rating scales such as A…F or 1…10 in addition to the common 5 star system. We divided them into two classes (positive and negative) and took 100 reviews from each class as the test set.

Methodology

Features for analysis

Three basic approaches for handling this kind of data pre-processing come to mind:

  • Leave the data as-is : Each word will be represented by itself
  • Parts-of-speech tagging: Each word is enriched by a POS tag, as determined by a standard tagging technique (such as the Brill Tagger(Brill 1995))
  • Perform POS taggin and parser (Using e.g. the Penn Tree-bank (Marcus, Santorini, & marcinkiewicz 1994))—severe performance issues

We thus focus our analysis in this paper on POS-tagged data (sentences consisting of words enriched with information about their parts of speech).

We Thus make the following assumptions about our test and training data:

  1. All words are transformed into upper case,
  2. All words are stemmed,
  3. All words are transformed into (word,POS) tuples by POS tagging (notation word/ POS).

All of these are computationally easy to achieve ( with a reasonable amount of accuracy ) using the Brill Tagger.

Experiments

Setting

  • Data: cornell
  • Part-of-speech tagger: Brill tagger (Brill 1995)
  • wordnet: 1.7.13

Part of Speech Filters

Any portion that does not contribute to the OvOP is noise. To reduce noise, filters were developed that use POS tags to do the following.

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

The POS filters are not designed to reduce the effects of conflicting polarity. They are only designed to reduce the effect of lack of polarity.

One design principle of the filter rules is that they filter out parts of speech that do not contribute to the semantic orientation and keep the parts of speech that do contribute such meaning. Based on analysis of movie review texts, we  devised “filter rules” that take Brill-tagged text as input and return less noisy, more concentrated sentences that have a combination of words and word/POS-tag pairs removed from the original.  A summary of the filter rules defined in this experiment is shown in Table 2.

Table 2: Summary of POS filter rules
POS r1 r2 r3 r4 r5
JJ K K K K K
RB D K K K K
VBG K K K K D
VBN K K K K D
NN G G G G G
VBZ D D K K D
CC D D D K K
COP K K K K K

K: keep                   D: Drop               G:Generalize

Wiebe et al., as well as other researchers, showed that subjectivity is especially concentrated in adjectives ( Wiebe, Bruce, & O’ Hara 1999; Turney & Littman 2003). Therefore, no adjectives or their tags were removed, nor were copula verbs or negative markers. However, noisy information such as determiners, foreign words, prepositions, modal verbs, possesives, particles, interjections, etc. were removed from the text stream. Other parts of speech, such as nouns and verbs, were removed but their POS-tags were retained.


WordNet filtering

generalization

===============Summary by me===============

There is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion. (paper to read)

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. (paper to read)

___________Brill tagger                    POS filter                     Classifiers
plain text========>Result 1======>Result2=========>Result3.

About the POS filter

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

Accuracy and Recall

  • July 18, 2010 1:44 pm

http://en.wikipedia.org/wiki/Precision_and_recall

In an information retrieval scenario, Precision is defined as the number of relevant documents retrieved by a search divided by the total number of documents retrieved by that search, and Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents (which should have been retrieved).

Paper Structure

  • July 17, 2010 7:27 pm

General Structure

  • Introduction
  • Related work
  • Experimental(Our) methodology
  • Evaluation
    • experimental setup
    • results
  • Discussion/conclusion
  • Reference

For My Paper

  • Introduction
  • Related Work
  • Methodology
    • corpus extension
    • feature selection
    • document vector generation
      • menthod 1
      • menthod 2
    • Classifiers
      • Bayes
      • SVM
  • Evaluation
    • Setup
    • Results
  • Discussion
  • Reference

论文进度

  • July 17, 2010 7:12 pm

17日晚之前:完成 reduplication,并完整地写如何生成synopis 信息

18日下午之前:完成document vector generation 的两个基本方案(基于语素近邻、基于co-ocurrence),要求必需是成熟方案

(看多那三篇论文)

18日晚开始写论文,要求完成论文的主体

19日完善,如有可能,最好真的实现,论文完成

20日投稿

20日之后:要求将系统真地实现

The Reduplication of B.Pang’s work on Sentiment Classification

  • July 17, 2010 4:03 pm

B.Pang, L.Lee  Thumbs up? Sentiment Classificaion using Machine Learning Techniques

Source

“Our  data source was the Internet Movie Database (IMDb) archive of the rec.arts.movies.reviews newsgroup(http://reviews.imdb.com/reviews). We selected only reviews where the author rating was expressed either with stars or some numerical value. Ratings were automatically extracted and converted into one of the three categories: positive, negative, or neutral. For the work described in this paper, we concentrated only on discriminating between positive and negative sentiment”

Corpus

“To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 review per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.”

“To prepare the documents, we automatically removed the rating indicators and extracted the textual information from the original HTML document format, treating punctuation as separate lexical items. No stemming or stoplists were used”

“To create a data set with uniform class distribution (studying the effect of skewed class distributions was out of the scope of this study), we randomly selected 700 positive-sentiment and 700 negative-sentiment documents. We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold. “

Feature Selection

  • bigrams with frequency at 7

“For this study, we focused on features based on unigrams(with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at lest four times in our 1400 document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams,since we consider bigrams ( and n-grams in general) to be an orthogonal way to incorporate context.”

Document Vector Generation

  • Frequency: the # of times fi occurs in document, and d = (n1(d),n2(d),…,nm(d))

“To implement these machine learning algorithms on our document data, we used the following standard bag-of-features framework. Let {f1,…,fm} be a predefined set of m features that can appear in a document; examples include the word “still” or the bigram “really stinks”. Let ni(d) be the number of times fi occurs in document d. Then, each document d is represented by the document vector d:=(n1(d),n2(d),…,nm(d))

  • Presence: ni(d) either is 1 or 0

“However, the definition of the MaxEnt feature/ class functions Fi,c only reflects the presence or absence of a feature, rather than directly incorporating feature frequency. In order to investigate whether reliance on frequency information could account for the higher accuracies of Naive Bayes and SVMs, we binarized the document vectors, setting ni(d) to 1 if and only feature fi appears in d, and reran Naive Bayes and SVMlight on these new vectors”

SVM:  SVMlight with default setting

“We used Joachim’s SVMlight package for training and testing,with all paramenters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly)”

Hand-by-hand Manual

  1. 700 / 700 pos neg randomly: really randomly?
  2. divide into 3 equal-size folds: how is that possible?
  3. bigrams with frequency >= 7
  4. Generate the document vector frequency/presence: write the program
  5. SVM : how to do the train / test see the ML book for detail

关于corpus的修改

  • July 17, 2010 3:34 pm

corpus在这里:http://www.imdb.com/reviews/index.html

点击进去之后能看到 http://www.imdb.com/reviews/00/0088.html,有synopsis 或者 plot summary

论文的最后主意

  • July 15, 2010 10:36 pm

Idea 1: with regard to the background

很多文章提到,处理feature的一个关键,就是去除无关的feature,而保留有用的feature.我的方法其实就是保留有用的feature

  • 先利用电影的背景介绍进行unigram,提取出feature F1
  • 预处理
    • 是否能加入coreference resolution
    • wiebe 关于adjective
  • 两种方案提取利用最终要用的特征 F2
    • 利用2-grams提取出与F1中各个向量一同出现的向量F2
    • 利用co-occurence 提取出与F1中各向量一同出现的向量F2
  • 文章的两种表达方案
    • frequency
    • presence
  • ML
    • Supervised
      • SVM: which is claimed to best for movie-review data
      • Bayes: which seems to be better for my model
    • semi-supervised
      • I have to use semi in each of my problem
  • 和Pang的文章 比较结果就好,所以要把pang的所有设置方案记下来

movie review project

  • July 15, 2010 11:34 am
  1. paper reading in detail on 15 Jul.
  2. Reduplicate the basic version of ml on the problem on 15 Jul.
  3. Read the extending papers to get ideas of improvement on 15 Jul.
  4. Decide how to improve the method
  5. Come out with the abstract