Nanos gigantium humeris insidentes!

Automatic Sentiment Analysis in On-line Text

  • July 29, 2010 6:07 pm

Erik Boiy el.

We will give an overview of various techniques used to tackle the problems in the domain of sentiment analysis, and add some of our own results.

2.2 Emothions in Written Text


A lot of linguistic scholars agree on the three dimensions of Osgood and al. [1] who investigated how the meaning of words can be mapped in a semantic space.

(1)Evaluation( positive/ negative)

(2) Potency (powerful/(unpowerful)

(2.1) Proximity (near/far)

(2.2) Specificity(clear/vague)

(2.3) Certainty (confident/doubtful)

(3) Intensifiers (more/less)

3. Methodology

There are two main techniques for sentiment classification: symbolic techniques and machine learning techniques. The symbolic approach uses manually crafted rules and lexicons, where the machine learning approach uses unsupervised, weakly supervised or fully supervised learning to construct a model from a large training corpus.

3.2 Machine Learning Techniques

3.2.1 Feature Selection

The most important decision to make when classfiying documents , is the choice of the feature set. Several features are commonly used, like unigrams or part-of-speech data. Features and their values are commonly stored in a feature vector.


This is the classic approach to feature selection, in which each document is represented as a feature vector, where the elements indicated the presenece of a word in the document. (keywords)


A word N-grams is a subsequence of N words from a given sequence. This means that the features in the document representation are not single words, but pairs(bigrams), triples(trigrams) or even bigger tuples of words.


basic dictionary form


A solution for this is to tag each word after the negation until the first punctuation.

Opinion words


Wiebe noted in [15] that adjectives are good indicators for subjectivity in a document. Salvetti used wordnet to enrich the only-adjective feature vectors.

3.2.2 machine Learning Techniques

Supervised Methods

The method that in the literature often yields the highest accuracy regards a Support vector machine classifier.

(1) SVM

SVM operate by constructing a hyplerplane with maximal Euclidean distance to the closest training expamples.

(2) Naive Bayes Multinomial

A naive Bayes classifier uses Bayes rule (which states how to update or revise believe in the light of new evidence) as its main equation, under the naive assumption of conditional independence.

(3) Maximum Entropy (Maxent)

The approach tries to preserve as much uncertainty as possible. A number of models are computed, where each feature corresponds to a constraint on the model. The model with most entropy over all models that satisfy these constraints is selected for classification.

unsupervised and weakly-supervised methods

Unsupervised methods can label a corpus,that is later used for supervised learning ( especially semantic orientation is helpful for this).

4. Challenges

4.3 Cross-domain Classification

How can we learn classifiers on one domain and use them on another domain. One possible approach is to train the classifier on a domain-mixed set of data instead of training it on one specific domain.

5. Results

5.3.2 Our Experiments

SVM -> SVM light

naive Bayes multinomial -> Weka

Maximum Entropy -> Maxent from OpenNLP

6. Dicussion

The advantages of unigrams and bigrams over the other features are that they are faster to extract, and requried no extra resources to use, while e.g. adjectives requrire a POS tagger to be run on the data frist, and subjectivity analysis requires an additional classifier to be used.

NBM is considerably faster.

=================My Summary===============

How to read the related papers?

  • Problem: is it a binary problem, e.g, supportive / in supportive , or a rating problem?
  • Do they do the syntactic preprocessing? 
    • part-of-speech parser
    • wordnet for synonym
  • What is the dataset
    • well-formed corpus
    • corpus collected directly from the internet
  • What is the machine learning mathodology
    • Feature selection
      • Unigrams: each documents is represented as a feature vector
      • N-grams: a subsequence of N words from a given sequence
      • Lemmas: basic dictionary form
      • Negation
      • Opinion words
      • Adjectives
    • Techniques(Classifier)
      • Supervised
        • SVM
        • Naive Bayes Multinomial
        • Maximum Entropy
      • Unsupervised
      • Semi-supervised
    • Evaluation
      • How many folds
  • Other Statistic Method
    • Markov Model
    • Conditional Random Field
    • N-grams model
    • semantic Orientation
  • What is the setup– the detail

Summary of NLP

  • July 29, 2010 4:01 pm


  • Stemming
  • Named Entity Recognization
  • Coreference
  • Part of speech labelling
  • Semantic Role Annotation


  • Logic-based
  • Statistic based
    • Markov Model
    • Conditional Random field
    • Naive Bayes
    • Machine Learning


Interested Field

  • Classification
  • Relation Extraction


  • July 29, 2010 2:40 pm


  • 1996-Building Probabilistic Models for natural Language
  • 2009-Statistical Language Models for Information Retrieval

Machine Learning(Book)

  • Pattern Recognition and Machine Learning
  • Semi-Supervised Learning

Relation Extraction(paper)

  • Extracting relations from Text: From Word Sequences to dependency Paths
  • Relation Extraction from wikipedia using subtree mining
  • Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia
  • Semi-supervised Semantic Role Labeling Using the Latent Words Language Model
  • Kernel Methods for Relation Extraction


  • July 27, 2010 11:30 am

sudo apt-get remove ubuntu-desktop

sudo apt-get install ubuntu-desktop 可就解决更新后变慢的问题

wordspace visualization

  • July 26, 2010 3:29 pm

the wordspace visualization can extend the application from qwtplot3d

qwt might be also needed

some lib might be also needed

another mapping function which might be useful to him


export LD_LIBRARY_PATH=”/home/zekai/addonlib/qwtplot3d/lib/”

a simple comprising method is:
write a class as the ripple mesh, inherited from the Function class
build a matrix according to the reduced dimensional matrix, set z  for no-exist point as 0, and then applying

dimensional reduction library

convert to svm light format

mapping function

compiled and runable

but the way they represent matrix is really weired, they kinda convert a matrix to a one long one dimensional array

add the function to label the points, but doesn’t work and i dont know why

about the labeling problem:

use opengl extension


使用sammon projection,然后把Eclucic distance改成edit distance for string,这样就能显示两个word的距离

我使用两维vector 的方式不正确,不能用push_back的方法直接把对象压入,因为对象已经不存在

The Reduplication of B.Pang’s work on Sentiment Classification

  • July 25, 2010 6:24 pm

B.Pang, L.Lee Thumbs up? Sentiment Classification using Machine Learning Techniques


“Our data source was the Internet Movie Database (IMDb) archive of the newsgroup( We selected only reviews where the author rating was expressed either with stars or some numerical value. Ratings were automatically extracted and converted into one of the three categories: positive, negative, or neutral. For the work described in this paper, we concentrated only on discriminating between positive and negative sentiment”


“To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 review per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.”

“To prepare the documents, we automatically removed the rating indicators and extracted the textual information from the original HTML document format, treating punctuation as separate lexical items. No stemming or stoplists were used”

“To create a data set with uniform class distribution (studying the effect of skewed class distributions was out of the scope of this study), we randomly selected 700 positive-sentiment and 700 negative-sentiment documents. We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold. “

Feature Selection

  • bigrams with frequency at 7

“For this study, we focused on features based on unigrams(with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at lest four times in our 1400 document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams,since we consider bigrams ( and n-grams in general) to be an orthogonal way to incorporate context.”

Document Vector Generation

  • Frequency: the # of times fi occurs in document, and d = (n1(d),n2(d),…,nm(d))

“To implement these machine learning algorithms on our document data, we used the following standard bag-of-features framework. Let {f1,…,fm} be a predefined set of m features that can appear in a document; examples include the word “still” or the bigram “really stinks”. Let ni(d) be the number of times fi occurs in document d. Then, each document d is represented by the document vector d:=(n1(d),n2(d),…,nm(d))

  • Presence: ni(d) either is 1 or 0

“However, the definition of the MaxEnt feature/ class functions Fi,c only reflects the presence or absence of a feature, rather than directly incorporating feature frequency. In order to investigate whether reliance on frequency information could account for the higher accuracies of Naive Bayes and SVMs, we binarized the document vectors, setting ni(d) to 1 if and only feature fi appears in d, and reran Naive Bayes and SVMlight on these new vectors”

SVM: SVMlight with default setting

“We used Joachim’s SVMlight package for training and testing,with all paramenters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly)”

Hand-by-hand Manual

  1. 700 / 700 pos neg randomly: really randomly?
  2. divide into 3 equal-size folds: how is that possible?
  3. bigrams with frequency >= 7
  4. Generate the document vector frequency/presence: write the program
  5. SVM : how to do the train / test see the ML book for detail

Note of Exploiting Syntactic and semantic Information

  • July 25, 2010 3:27 pm

Paper: Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia

The Semantic Web is based on RDF[3], a representation language using Notation 3 or N3[4]. We follow the formalism of Semantic Web, specifically N3, in which we structure Wikipedia’s content as a collection of statements. Each statement consists of a subject, a predicate and an object. … The statements with the use of a domain-specific ontology can then be straightforwardly transformed into RDF format that in turn serves as machine-processable knowledge base.

Our method, unlike other works, mines the key patterns from syntactic and semantic structure to measure similarity between entity pairs rather using only lexical information as in [5-8] or hard matching of dependency paths as in [9].

In details, we attempt to integrate syntactic and semantic information of text to form an unified structure. We then decompose the structure into subsequences and mine the frequent ones with the aim to capture the key patterns for each relationship.

2. Problem Statement

We aim at extracting binary relations between entities from English version of Wikipedia articles. A 2-tuple (ep, es) and a triple (ep,rel, es) denote an entity pair and a binary relation respectively,
where ep and es are entities which may be PERSON, ORGANIZATION, LOCATION, TIME OR ARTIFACT and
rel denotes the directed relationship between ep and es, which may be one of following 13 relations: CEO, FOUNDER, CHAIRMAN, COO, PRESIDENT, DIRECTOR, VICE CHAIRMAN, SPOUSE, BIRTH DATA, BIRTH PLACE, FOUNDATION, PRODUCT  and LOCATION.

We follow [5] to define the entity mainly discussed in an article as principal entity, and other mentioned entities in the same article as secondary entities. We assume that interested entities in this problem should have a descriptive article in Wikipedia. Thus, no entity disambiguation and entity recognition is required in our system. The identifier of an entity is defined as the URL address to its appropriate article.

Our system predicts only the relations between the principal entity and each mentionded secondary entity in an article. As one more assumption, the relationship between an entity pair can be completely expressed in one sentence.

So that ,for an article, only the sentences that contain an principal entity and a secondary entity are necessarily to be analyzed.

4. Extract Relations from Wikipedia

4.1 Relation Extraction Framework

  1. articles should be processed to remove the HTML tags, extract hyperlinks which point to other wikipedia’s articles.passed to pre-processor: Sentence Splitter, Tokenizer and Phrase Chunker
  2. Parallelly processed to anchor all occurrences of principal entities and secondary entities. The Secondary Entity detector simply labels appropriate surface text of the hyperlinks as secondary entities.
  3. Sentence Selector chooses only sentences which contain the principal entity and at least one secondary entity. Each of such pairs becomes a relation candidate.
  4. The trainer receives articles with HTML tags to identify summary sections and extract ground truce relations annotated by human editors.
  5. Previously selected sentences that contain entity pairs from ground true relations are identified as training data.
  6. The trainer will learn the key patterns with respect to each relation.
  7. During testing, for each sentence and an entity pair on it, the Relation Extractor will identify the descriptive label and then outputs the final results.

Principal Entity Detector

  • Most of the pronouns in an article refer to the principal entity.
  • The first sentence of the article is often used to briefly define the principal entity.

We use rules to identify a set of referents to the principal entity, including three types[10]:

  • pronoun (“he”,”him”,”they”…)
  • proper noun (e.g., Bill Gates, William Henry Gates, Microsoft, …)
  • common nouns ( the company, the software, …)

Supported by the nature of Wikipedia, our technique performs better than those of the co-reference tools in LingPipe library and in OpenNLP tool set. All the occurrences of the collected referents are labeled as principal entity.

Training Data Builder

examine whether the pair is in ground true relation set or not. if yes, it attaches the relation label to the pair and create a new training sentence for the relation. For a relation r, the purpose of building training data is to collect the sentences that exactly express r. To reduce noise in training data, it is necessary to eliminate the pairs from the ground truce set which hold more than one relation.

4.2 Learning Patterns with Dependency Path

In this section, we will explain our first method to extracting relation using syntactic information.

Follow the idea in [9] we assume that the shortest dependency path tracing from a principal entity through the dependency tree to a secondary entity gives a concrete syntactic structure expressing relation between the pair.

Key patterns learning from the dependency paths for each relationship.

  1. derive dependency trees of the training sentences by Minipar parser and extract paths between entity pairs
  2. transform the paths into sequences which are in turn decomposed into subsequences
  3. From the subsequence collections of a relation r, we can identify the frequent subsequences for r.
  4. During testing, dependency path between an entity pair in an novel sentence is also converted into sequence and match with the previously mined subsequences.

Sequential Representation of Dependency Path

A word together with its Part-Of-Speech tag will be an element of the sequence.


Learning Key Patterns as Mining Frequent Sequence

PrefixSpan, which is introduced in [12], is known as an efficient method to mining sequential patterns. A sequence s=s1s2…sn;, where si is an itemset, is called subsequence of a sequence p =p1p2…pm; if there exists integers 1 <= j12<…n<=m such that s1 ⊆ pj1, …, sn ⊆ pjn.

In this research, we use the implementation tool of PrefixSpan developed by Taku kudo.

From here, sequence database denotes the set of sequences converted from dependency paths with respect to a relation.

Weighting The Patterns

It is necessary for each mined pattern to be assigned a weight with respect to a relation for estimating the relevance. Factors:

  • Length of the pattern: if two paths share a long common subpattern, it is more likely that the paths express the same relationship.
  • Support of the pattern: is the number of sequences that contain the pattern. It is more likely that a pattern with high support should be a key pattern
  • Amount of lexical information: although  the sequences contain both words and dependency relations from the original dependency path, we found that wordbased items are more important.
  • Number of sequence databases in which the pattern appear: if the pattern can be found in various sequence databases, it is more likely that the pattern is common and it should not be a key pattern of any relation.

wr(p) = ir f(p) x supportDr(p) x l(p) x elex(p)/|Dr|

Relation Selection

Given a novel sentence and the anchors of an entity pair in it, we will predict the appropriate relation of the pair. We extract the dependency path P, transform P into sequential pattern and then accumulate the scores of its subsequences for each relation r:

L_r(P)= displaystylesum_{p in S(p)} omega_{r (P)}

  • Lr(P) likelihood score to say that P expresses relation r
  • S(P) set of all subsequences of the sequential representation of P
  • The appropriate relation should  be the one giving highest score to P:

R= argdisplaystylemax_r L_r(P)

4.3 Learning Patterns with Dependency Path and Semantic Role

We use the SNoW-based Semantic role Labeler [16], a state-of-art in SRL task which conforms the defintion of PropBank and CoNLL-2005 shared task on SRL. Since the SRL task just labels roles to constituents or phrases without indicateing which primitive concept playing the role, we still use dependency parsing information to further analyze the phrases.

[5] Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and patterns in Text.

[9]Extracting Relations from Text

[10]Correference for nlp applications.

[12] Mining Sequential patterns by Pattern-Growth: The prefixspan approach.


  • July 24, 2010 6:16 pm

Definitely, I can use ResearchCyc to do the Named Entity Recognition.


  • July 20, 2010 2:03 am


A survey on sentiment detection of reviews


Sentiment Analysis: A Combined Approach

Useful Survey

  • July 20, 2010 2:03 am

A survey on sentiment detection of reviews

Sentiment Analysis: A Combined Approach