Brainmaker

Nanos gigantium humeris insidentes!

movie review 相关的论文

  • July 15, 2010 12:49 am
  1. 基本论文
  2. 关于feature 选择的论文
  3. 关于等级分类的论文
  4. 关于局部与全局的论文
  5. 关于是否中立的论文


Paper: Automatic Sentiment Analysis in On-line Text

  • July 14, 2010 8:12 pm

Erik Boiy el.

We will give an overview of various techniques used to tackle the problems in the domain of sentiment analysis, and add some of our own results.

2.2 Emothions in Written Text

Appraisal

A lot of linguistic scholars agree on the three dimensions of Osgood and al. [1] who investigated how the meaning of words can be mapped in a semantic space.

(1)Evaluation( positive/ negative)

(2) Potency (powerful/(unpowerful)

(2.1) Proximity (near/far)

(2.2) Specificity(clear/vague)

(2.3) Certainty (confident/doubtful)

(3) Intensifiers (more/less)

3. Methodology

There are two main techniques for sentiment classification: symbolic techniques and machine learning techniques. The symbolic approach uses manually crafted rules and lexicons, where the machine learning approach uses unsupervised, weakly supervised or fully supervised learning to construct a model from a large training corpus.

3.2 Machine Learning Techniques

3.2.1 Feature Selection

The most important decision to make when classfiying documents , is the choice of the feature set. Several features are commonly used, like unigrams or part-of-speech data. Features and their values are commonly stored in a feature vector.

Unigrams

This is the classic approach to feature selection, in which each document is represented as a feature vector, where the elements indicated the presenece of a word in the document. (keywords)

N-grams

A word N-grams is a subsequence of N words from a given sequence. This means that the features in the document representation are not single words, but pairs(bigrams), triples(trigrams) or even bigger tuples of words.

Lemmas

basic dictionary form

Negation

A solution for this is to tag each word after the negation until the first punctuation.

Opinion words

Adjectives

Wiebe noted in [15] that adjectives are good indicators for subjectivity in a document. Salvetti used wordnet to enrich the only-adjective feature vectors.

3.2.2 machine Learning Techniques

Supervised Methods

The method that in the literature often yields the highest accuracy regards a Support vector machine classifier.

(1) SVM

SVM operate by constructing a hyplerplane with maximal Euclidean distance to the closest training expamples.

(2) Naive Bayes Multinomial

A naive Bayes classifier uses Bayes rule (which states how to update or revise believe in the light of new evidence) as its main equation, under the naive assumption of conditional independence.

(3) Maximum Entropy (Maxent)

The approach tries to preserve as much uncertainty as possible. A number of models are computed, where each feature corresponds to a constraint on the model. The model with most entropy over all models that satisfy these constraints is selected for classification.

unsupervised and weakly-supervised methods

Unsupervised methods can label a corpus,that is later used for supervised learning ( especially semantic orientation is helpful for this).

4. Challenges

4.3 Cross-domain Classification

How can we learn classifiers on one domain and use them on another domain. One possible approach is to train the classifier on a domain-mixed set of data instead of training it on one specific domain.

5. Results

5.3.2 Our Experiments

SVM -> SVM light

naive Bayes multinomial -> Weka

Maximum Entropy -> Maxent from OpenNLP

6. Dicussion

The advantages of unigrams and bigrams over the other features are that they are faster to extract, and requried no extra resources to use, while e.g. adjectives requrire a POS tagger to be run on the data frist, and subjectivity analysis requires an additional classifier to be used.

NBM is considerably faster.

=================My Summary===============

How to read the related papers?

  • Problem: is it a binary problem, e.g, supportive / in supportive , or a rating problem?
  • Do they do the syntactic preprocessing? 
    • part-of-speech parser
    • wordnet for synonym
  • What is the dataset
    • well-formed corpus
    • corpus collected directly from the internet
  • What is the machine learning mathodology
    • Feature selection
      • Unigrams: each documents is represented as a feature vector
      • N-grams: a subsequence of N words from a given sequence
      • Lemmas: basic dictionary form
      • Negation
      • Opinion words
      • Adjectives
    • Techniques(Classifier)
      • Supervised
        • SVM
        • Naive Bayes Multinomial
        • Maximum Entropy
      • Unsupervised
      • Semi-supervised
    • Evaluation
      • How many folds
  • Other Statistic Method
    • Markov Model
    • Conditional Random Field
    • N-grams model
    • semantic Orientation
  • What is the setup– the detail

corpus

  • July 13, 2010 3:02 pm

http://en.wikipedia.org/wiki/Text_corpus

free

http://www.anc.org/

repository—the Open Directory Project (ODP, http://www.dmoz.org). largest Web directory to date, where concepts correspond to categories of the directory,

cornell NLP group

http://www.cs.cornell.edu/Info/Projects/NLP/data.html

http://www.cs.cornell.edu/people/pabo/movie-review-data/


nlp tool kit

  • July 12, 2010 3:47 pm

http://opennlp.sourceforge.net/projects.html

Wiki project

  • July 12, 2010 3:31 pm

pay attention to Wikipedia’s sister projects

http://en.wikipedia.org/wiki/Main_Page

And each article is catergorized in a certain way, that’s also a good way for naming entity recognition

Project2有参考价值的论文

  • July 12, 2010 2:36 am

Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia.pdf

Relation Extraction from Wikipedia Using Subtree Mining.pdf

Using Wikipedia for AutomaticWord Sense Disambiguation.pdf

2009-Semi-supervised Semantic Role Labeling.pdf

2007-Semi-Supervised Learning for Semantic Parsing using support vector machine.pdf

Word Sense Disambiguation with Semi-Supervised Learning

一个较验结果的特别方法

  • July 12, 2010 1:42 am

使用google的搜索结果来判断一个句子是否有意义

the dog barks            About 1,010,000 results

the flower barks         1 result

detail on semi supervised project

  • July 11, 2010 11:41 pm

方案

  • Apply semi-supervised learning to certain corpus, and compare the result– basically similar to what we did in ML course
  • Improve others’ supervised learning to semi-supervised
    • 2007-Semi-Supervised Learning for Semantic Parsing using support vector machine.pdf
  • apply the same semi-supervised on general wiki to simple wiki

整理

  • 整理corpus
  • 整理semi-supervised的方法

好书

Semi-supervised learning

问题

无法找到合适的免费的corpus —如果要参加那个会议,就不能用corpus,要用noisy text

如果确定不要用wiki,那么可以把搜到的论文整理一下,看他们都用什么corpus

http://www.d.umn.edu/~tpederse/data.html

想法

Name Entity Recognition (NER) 是一个相对易于操作的方向


computational linguistics

  • July 11, 2010 11:39 pm

Logical Natural language processing and syntax-based computational linguistics

Statistical natural language processing and corpus-based computational linguistics

http://nlp.stanford.edu/links/statnlp.html

Paper to read for the two candidative projects

  • July 11, 2010 4:10 pm

P1:

  • Introduction to FOPC in Chinese version of Wikipedia
    • a discussion on problem of FOPC
  • Toward the expressive power of natural language  in  Principles of Semantic Networks
    • Topic on expressive power
  • Natural Language, Knowledge Representation, and Logical Form
    • has a discussion on expressive power
  • Language, Proof and Logic BC61 .B38 2002
    • Other expressive limitations of first-order logic

P2: