Nanos gigantium humeris insidentes!
You are currently browsing the Sentiment Classification category

Note of Lexical Filtering on Overall Opinion Polarity Identification

  • August 1, 2010 10:20 am

F. Salvetti, S.Lewis, C.Reichenbach.Impact of Lexical Filtering on Overall Opinion Polarity Identification


HTML documents were converted to plain text, tagged using the Brill tagger, and fed into filters and classifiers.

Basic Assumption or Points:

Related Research

Research has demonstrated that there is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion (Wiebe, Bruce, & O’ Hara 1999).

Hatzivassiloglou & McKeown 1997), combined a log-linear statistical model that examined the conjunctions between adjectives,(such as “and”, “but”, “or”), with a clustering algorithm that grouped the adjectives into two sets which were then labelled positive and negative.

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. He combined the number of co-occurrences of the adjective under investigation NEAR the adjective ‘excellent’ and NEAR the ‘poor’, thinking that high occurrence NEAR ‘excellent’ implies positive polarity.


The cornell data consists of 27,000 movie reviews in HTML form, using 35 different rating scales such as A…F or 1…10 in addition to the common 5 star system. We divided them into two classes (positive and negative) and took 100 reviews from each class as the test set.


Features for analysis

Three basic approaches for handling this kind of data pre-processing come to mind:

  • Leave the data as-is : Each word will be represented by itself
  • Parts-of-speech tagging: Each word is enriched by a POS tag, as determined by a standard tagging technique (such as the Brill Tagger(Brill 1995))
  • Perform POS taggin and parser (Using e.g. the Penn Tree-bank (Marcus, Santorini, & marcinkiewicz 1994))—severe performance issues

We thus focus our analysis in this paper on POS-tagged data (sentences consisting of words enriched with information about their parts of speech).

We Thus make the following assumptions about our test and training data:

  1. All words are transformed into upper case,
  2. All words are stemmed,
  3. All words are transformed into (word,POS) tuples by POS tagging (notation word/ POS).

All of these are computationally easy to achieve ( with a reasonable amount of accuracy ) using the Brill Tagger.



  • Data: cornell
  • Part-of-speech tagger: Brill tagger (Brill 1995)
  • wordnet: 1.7.13

Part of Speech Filters

Any portion that does not contribute to the OvOP is noise. To reduce noise, filters were developed that use POS tags to do the following.

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

The POS filters are not designed to reduce the effects of conflicting polarity. They are only designed to reduce the effect of lack of polarity.

One design principle of the filter rules is that they filter out parts of speech that do not contribute to the semantic orientation and keep the parts of speech that do contribute such meaning. Based on analysis of movie review texts, we devised “filter rules” that take Brill-tagged text as input and return less noisy, more concentrated sentences that have a combination of words and word/POS-tag pairs removed from the original. A summary of the filter rules defined in this experiment is shown in Table 2.

Table 2: Summary of POS filter rules
POS r1 r2 r3 r4 r5

K: keep D: Drop G:Generalize

Wiebe et al., as well as other researchers, showed that subjectivity is especially concentrated in adjectives ( Wiebe, Bruce, & O’ Hara 1999; Turney & Littman 2003). Therefore, no adjectives or their tags were removed, nor were copula verbs or negative markers. However, noisy information such as determiners, foreign words, prepositions, modal verbs, possesives, particles, interjections, etc. were removed from the text stream. Other parts of speech, such as nouns and verbs, were removed but their POS-tags were retained.

WordNet filtering


===============Summary by me===============

There is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion. (paper to read)

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. (paper to read)

___________Brill tagger POS filter Classifiers
plain text========>Result 1======>Result2=========>Result3.

About the POS filter

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

Automatic Sentiment Analysis in On-line Text

  • July 29, 2010 6:07 pm

Erik Boiy el.

We will give an overview of various techniques used to tackle the problems in the domain of sentiment analysis, and add some of our own results.

2.2 Emothions in Written Text


A lot of linguistic scholars agree on the three dimensions of Osgood and al. [1] who investigated how the meaning of words can be mapped in a semantic space.

(1)Evaluation( positive/ negative)

(2) Potency (powerful/(unpowerful)

(2.1) Proximity (near/far)

(2.2) Specificity(clear/vague)

(2.3) Certainty (confident/doubtful)

(3) Intensifiers (more/less)

3. Methodology

There are two main techniques for sentiment classification: symbolic techniques and machine learning techniques. The symbolic approach uses manually crafted rules and lexicons, where the machine learning approach uses unsupervised, weakly supervised or fully supervised learning to construct a model from a large training corpus.

3.2 Machine Learning Techniques

3.2.1 Feature Selection

The most important decision to make when classfiying documents , is the choice of the feature set. Several features are commonly used, like unigrams or part-of-speech data. Features and their values are commonly stored in a feature vector.


This is the classic approach to feature selection, in which each document is represented as a feature vector, where the elements indicated the presenece of a word in the document. (keywords)


A word N-grams is a subsequence of N words from a given sequence. This means that the features in the document representation are not single words, but pairs(bigrams), triples(trigrams) or even bigger tuples of words.


basic dictionary form


A solution for this is to tag each word after the negation until the first punctuation.

Opinion words


Wiebe noted in [15] that adjectives are good indicators for subjectivity in a document. Salvetti used wordnet to enrich the only-adjective feature vectors.

3.2.2 machine Learning Techniques

Supervised Methods

The method that in the literature often yields the highest accuracy regards a Support vector machine classifier.

(1) SVM

SVM operate by constructing a hyplerplane with maximal Euclidean distance to the closest training expamples.

(2) Naive Bayes Multinomial

A naive Bayes classifier uses Bayes rule (which states how to update or revise believe in the light of new evidence) as its main equation, under the naive assumption of conditional independence.

(3) Maximum Entropy (Maxent)

The approach tries to preserve as much uncertainty as possible. A number of models are computed, where each feature corresponds to a constraint on the model. The model with most entropy over all models that satisfy these constraints is selected for classification.

unsupervised and weakly-supervised methods

Unsupervised methods can label a corpus,that is later used for supervised learning ( especially semantic orientation is helpful for this).

4. Challenges

4.3 Cross-domain Classification

How can we learn classifiers on one domain and use them on another domain. One possible approach is to train the classifier on a domain-mixed set of data instead of training it on one specific domain.

5. Results

5.3.2 Our Experiments

SVM -> SVM light

naive Bayes multinomial -> Weka

Maximum Entropy -> Maxent from OpenNLP

6. Dicussion

The advantages of unigrams and bigrams over the other features are that they are faster to extract, and requried no extra resources to use, while e.g. adjectives requrire a POS tagger to be run on the data frist, and subjectivity analysis requires an additional classifier to be used.

NBM is considerably faster.

=================My Summary===============

How to read the related papers?

  • Problem: is it a binary problem, e.g, supportive / in supportive , or a rating problem?
  • Do they do the syntactic preprocessing? 
    • part-of-speech parser
    • wordnet for synonym
  • What is the dataset
    • well-formed corpus
    • corpus collected directly from the internet
  • What is the machine learning mathodology
    • Feature selection
      • Unigrams: each documents is represented as a feature vector
      • N-grams: a subsequence of N words from a given sequence
      • Lemmas: basic dictionary form
      • Negation
      • Opinion words
      • Adjectives
    • Techniques(Classifier)
      • Supervised
        • SVM
        • Naive Bayes Multinomial
        • Maximum Entropy
      • Unsupervised
      • Semi-supervised
    • Evaluation
      • How many folds
  • Other Statistic Method
    • Markov Model
    • Conditional Random Field
    • N-grams model
    • semantic Orientation
  • What is the setup– the detail

The Reduplication of B.Pang’s work on Sentiment Classification

  • July 25, 2010 6:24 pm

B.Pang, L.Lee Thumbs up? Sentiment Classification using Machine Learning Techniques


“Our data source was the Internet Movie Database (IMDb) archive of the newsgroup( We selected only reviews where the author rating was expressed either with stars or some numerical value. Ratings were automatically extracted and converted into one of the three categories: positive, negative, or neutral. For the work described in this paper, we concentrated only on discriminating between positive and negative sentiment”


“To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 review per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.”

“To prepare the documents, we automatically removed the rating indicators and extracted the textual information from the original HTML document format, treating punctuation as separate lexical items. No stemming or stoplists were used”

“To create a data set with uniform class distribution (studying the effect of skewed class distributions was out of the scope of this study), we randomly selected 700 positive-sentiment and 700 negative-sentiment documents. We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold. “

Feature Selection

  • bigrams with frequency at 7

“For this study, we focused on features based on unigrams(with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at lest four times in our 1400 document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams,since we consider bigrams ( and n-grams in general) to be an orthogonal way to incorporate context.”

Document Vector Generation

  • Frequency: the # of times fi occurs in document, and d = (n1(d),n2(d),…,nm(d))

“To implement these machine learning algorithms on our document data, we used the following standard bag-of-features framework. Let {f1,…,fm} be a predefined set of m features that can appear in a document; examples include the word “still” or the bigram “really stinks”. Let ni(d) be the number of times fi occurs in document d. Then, each document d is represented by the document vector d:=(n1(d),n2(d),…,nm(d))

  • Presence: ni(d) either is 1 or 0

“However, the definition of the MaxEnt feature/ class functions Fi,c only reflects the presence or absence of a feature, rather than directly incorporating feature frequency. In order to investigate whether reliance on frequency information could account for the higher accuracies of Naive Bayes and SVMs, we binarized the document vectors, setting ni(d) to 1 if and only feature fi appears in d, and reran Naive Bayes and SVMlight on these new vectors”

SVM: SVMlight with default setting

“We used Joachim’s SVMlight package for training and testing,with all paramenters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly)”

Hand-by-hand Manual

  1. 700 / 700 pos neg randomly: really randomly?
  2. divide into 3 equal-size folds: how is that possible?
  3. bigrams with frequency >= 7
  4. Generate the document vector frequency/presence: write the program
  5. SVM : how to do the train / test see the ML book for detail