Brainmaker

Nanos gigantium humeris insidentes!

Note of Lexical Filtering on Overall Opinion Polarity Identification

  • August 1, 2010 10:20 am

F. Salvetti, S.Lewis, C.Reichenbach.Impact of Lexical Filtering on Overall Opinion Polarity Identification

Flow

HTML documents were converted to plain text, tagged using the Brill tagger, and fed into filters and classifiers.

Basic Assumption or Points:

Related Research

Research has demonstrated that there is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion (Wiebe, Bruce, & O’ Hara 1999).

Hatzivassiloglou & McKeown 1997), combined a log-linear statistical model that examined the conjunctions between adjectives,(such as “and”, “but”, “or”), with a clustering algorithm that grouped the adjectives into two sets which were then labelled positive and negative.

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. He combined the number of co-occurrences of the adjective under investigation NEAR the adjective ‘excellent’ and NEAR the ‘poor’, thinking that high occurrence NEAR ‘excellent’ implies positive polarity.

Corpus

The cornell data consists of 27,000 movie reviews in HTML form, using 35 different rating scales such as A…F or 1…10 in addition to the common 5 star system. We divided them into two classes (positive and negative) and took 100 reviews from each class as the test set.

Methodology

Features for analysis

Three basic approaches for handling this kind of data pre-processing come to mind:

  • Leave the data as-is : Each word will be represented by itself
  • Parts-of-speech tagging: Each word is enriched by a POS tag, as determined by a standard tagging technique (such as the Brill Tagger(Brill 1995))
  • Perform POS taggin and parser (Using e.g. the Penn Tree-bank (Marcus, Santorini, & marcinkiewicz 1994))—severe performance issues

We thus focus our analysis in this paper on POS-tagged data (sentences consisting of words enriched with information about their parts of speech).

We Thus make the following assumptions about our test and training data:

  1. All words are transformed into upper case,
  2. All words are stemmed,
  3. All words are transformed into (word,POS) tuples by POS tagging (notation word/ POS).

All of these are computationally easy to achieve ( with a reasonable amount of accuracy ) using the Brill Tagger.

Experiments

Setting

  • Data: cornell
  • Part-of-speech tagger: Brill tagger (Brill 1995)
  • wordnet: 1.7.13

Part of Speech Filters

Any portion that does not contribute to the OvOP is noise. To reduce noise, filters were developed that use POS tags to do the following.

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

The POS filters are not designed to reduce the effects of conflicting polarity. They are only designed to reduce the effect of lack of polarity.

One design principle of the filter rules is that they filter out parts of speech that do not contribute to the semantic orientation and keep the parts of speech that do contribute such meaning. Based on analysis of movie review texts, we devised “filter rules” that take Brill-tagged text as input and return less noisy, more concentrated sentences that have a combination of words and word/POS-tag pairs removed from the original. A summary of the filter rules defined in this experiment is shown in Table 2.

Table 2: Summary of POS filter rules
POS r1 r2 r3 r4 r5
JJ K K K K K
RB D K K K K
VBG K K K K D
VBN K K K K D
NN G G G G G
VBZ D D K K D
CC D D D K K
COP K K K K K

K: keep D: Drop G:Generalize

Wiebe et al., as well as other researchers, showed that subjectivity is especially concentrated in adjectives ( Wiebe, Bruce, & O’ Hara 1999; Turney & Littman 2003). Therefore, no adjectives or their tags were removed, nor were copula verbs or negative markers. However, noisy information such as determiners, foreign words, prepositions, modal verbs, possesives, particles, interjections, etc. were removed from the text stream. Other parts of speech, such as nouns and verbs, were removed but their POS-tags were retained.


WordNet filtering

generalization

===============Summary by me===============

There is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion. (paper to read)

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. (paper to read)

___________Brill tagger POS filter Classifiers
plain text========>Result 1======>Result2=========>Result3.

About the POS filter

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only
Print Friendly