Nanos gigantium humeris insidentes!
You are currently browsing the Paper category

Reduplication of Nguyen’s RE 2007 — Tool

  • July 31, 2010 2:28 pm
  • Entity Dector
    • use: technique based on the wikipedia’s nature (algorithm given)
    • option: co-reference tools in LingPipe library and in OpenNLP tool set
  • Derive dependency trees
    • Minipar parser
  • Mining sequential patterns
  • PA structure
    • use: SNoW-based Semantic role Labeler — here

  1. articles should be processed to 
    1. remove the HTML tags,
    2. extract hyperlinks which point to other wikipedia’s articles.
  2. Parallelly processed to anchor all occurrences of
    1. principal entities —- Algorithm provided or OpenNLP
    2. secondary entities —- Simple
  3. Sentence Selector 
    1. chooses sentences which contain the principal entity and at least one secondary entity–simple
    2. Each of such pairs becomes a relation candidate.
  4. The trainer receives articles with HTML tags to
    1. identify summary sections and–simple
    2. extract ground truce relations annotated by human editors.  —seems difficult
  5. Previously selected sentences that contain entity pairs from ground true relations are
    1. identified as training data.
  6. The trainer will 
    1. learn the key patterns with respect to each relation.  —  based on Bunescu’s work or minipar, prefixSpan, SNow
  7. During testing, for each sentence and an entity pair on it, the Relation Extractor will
    1. identify the descriptive label and then
    2. outputs the final results.

Reduplication of Nguyen’s RE 2007 — Procedure

  • July 31, 2010 11:11 am

Paper: Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia

For background and Assumption , see the note  here.

This paper has two versions, and the reduplication is based on the second version

Abstract: Mined frequent subsequences from the path between an entity pair in the syntactic and semantic structure in order to explore key patterns reflecting the relationship between the pair. The relations between entities are treated as multiclass labels, and the entities pair and labels can be passed to SVM for training.

4. Extract Relations from Wikipedia

4.1 Relation Extraction Framework

  1. articles should be processed to remove the HTML tags, extract hyperlinks which point to other wikipedia’s articles.passed to pre-processor: Sentence Splitter, Tokenizer and Phrase Chunker
  2. Parallelly processed to anchor all occurrences of principal entities and secondary entities. The Secondary Entity detector simply labels appropriate surface text of the hyperlinks as secondary entities.
  3. Sentence Selector chooses only sentences which contain the principal entity and at least one secondary entity. Each of such pairs becomes a relation candidate.
  4. The trainer receives articles with HTML tags to identify summary sections and extract ground truce relations annotated by human editors.
  5. Previously selected sentences that contain entity pairs from ground true relations are identified as training data.
  6. The trainer will learn the key patterns with respect to each relation.
  7. During testing, for each sentence and an entity pair on it, the Relation Extractor will identify the descriptive label and then outputs the final results.

Principal Entity Detector

  • Most of the pronouns in an article refer to the principal entity.
  • The first sentence of the article is often used to briefly define the principal entity.

We use rules to identify a set of referents to the principal entity, including three types[10]:

  • pronoun (“he”,”him”,”they”…)
  • proper noun (e.g., Bill Gates, William Henry Gates, Microsoft, …)
  • common nouns ( the company, the software, …)

Steps to collect the referents for D include:

  1. Start with D = {}
  2. Select the first two chunks: the proper chunk of the article title and the first proper chunk in the first sentence of the article if any. We define a proper chunk as a chunk that contains at least a proper noun, word with NNP or NNPS tag. We call the first proper chunks selected so far (up to two) in D as names of the principal entity.
  3. If D contains no items, return D and stop the algorithm. Otherwise, continue.
  4. Get a list of all other proper chunks of the articles. For each proper chunk p, if p is derived from any of the names, then D← p. We call a proper chunk p1 is derived from a proper chunk p2 if the set of all proper nouns of p1 is a sub set of all proper nouns of p2.
  5. From the article, select c as the most frequent pronouns, find c’ as its equivalent pronoun and add them to D. For example, ‘he’ is equivalent to ‘him’, ‘she’ is equivalent to ‘her’ …We call c and c’ pronoun referents.
  6. Select all the chunks with the pattern [DT N1 … Nk] where DT is a determiner and Nk is a common noun. Only those chunks which appear more frequently than the pronoun referents are added into D.

Supported by the nature of Wikipedia, our technique performs better than those of the co-reference tools in LingPipe library and in OpenNLP tool set. All the occurrences of the collected referents are labeled as principal entity.

Training Data Builder

Examining whether the pair is in ground true relation set or not. if yes, it attaches the relation label to the pair and create a new training sentence for the relation.

We define a training sentence as a six-tuple: (s, l1, r1, l2, r2, rel)  where s is the sentence itself, (l1, r1) and (l2, r2) indicate the token-based boundaries of the  principal entity and secondary entity, and rel indicates the relation label between the entity  pair. With a sentence s selected from the above procedure, we receive the identifiers of the  entities and search for a ground true relation r such that r.ep is the identifier of the principal  entity and is the identifier of the secondary entity. Then, we create a new training sentence  from s and r.

For a relation r, the purpose of building training data is to collect the sentences that exactly express r. To reduce noise in training data, it is necessary to eliminate the pairs from the ground truce set which hold more than one relation.

4.2 Learning Patterns with Dependency Path

In this section, we will explain our first method to extracting relation using syntactic information.

Follow the idea in [9] we assume that the shortest dependency path tracing from a principal entity through the dependency tree to a secondary entity gives a concrete syntactic structure expressing relation between the pair.

Key patterns learning from the dependency paths for each relationship.

  1. derive dependency trees of the training sentences by Minipar parser [11] and extract paths between entity pairs
  2. transform the paths into sequences which are in turn decomposed into subsequences
  3. From the subsequence collections of a relation r, we can identify the frequent subsequences for r.
  4. During testing, dependency path between an entity pair in an novel sentence is also converted into sequence and match with the previously mined subsequences.

Sequential Representation of Dependency Path

A word together with its Part-Of-Speech tag will be an element of the sequence.


  1. For each word excluding the first and the last ones in the dependency path, we consider  the combination of its base form and its Part-Of-Speech (POS) as an element of the sequence.
  2. For the first and the last word, only the POS is concerned as a sequence element.
  3. For a dependency relation, a pair of direction and relation label is considered as an  element of the sequence.

Learning Key Patterns as Mining Frequent Sequence

PrefixSpan, which is introduced in [12], is known as an efficient method to mining sequential patterns. A sequence s=<s1s2…sn>;, where si is an itemset, is called subsequence of a sequence p =<p1p2…pm>; if there exists integers 1 <= j12<…n<=m such that s1 ⊆ pj1, …, sn ⊆ pjn.

In this research, we use the implementation tool of PrefixSpan developed by Taku kudo.

From here, sequence database denotes the set of sequences converted from dependency paths with respect to a relation.

Weighting The Patterns

It is necessary for each mined pattern to be assigned a weight with respect to a relation for estimating the relevance. Factors:

  • Length of the pattern: if two paths share a long common subpattern, it is more likely that the paths express the same relationship.
  • Support of the pattern: is the number of sequences that contain the pattern. It is more likely that a pattern with high support should be a key pattern
  • Amount of lexical information: although the sequences contain both words and dependency relations from the original dependency path, we found that wordbased items are more important.
  • Number of sequence databases in which the pattern appear: if the pattern can be found in various sequence databases, it is more likely that the pattern is common and it should not be a key pattern of any relation.


Therefore, weight of a pattern with respect to a relation r is calculated as:

w_r(p) = \frac{\irf(p)\times support_{D_r}(p)\times l(p) \times e^{lex(p)}}{\mid D_r \mid}

  • Dr is the sequence database of r, supportDr(p) is the support of p in Dr.
  • irf(p) is Inverted Relation Frequency of p calculated by log(frac{|R|}{|M(p)|}), where R is set of relations and M(p) is set of sequence databases in which p occurs.
  • l(p) is length of p, lex(p) is the number of word-based items in p.

Relation Selection

Given a novel sentence and the anchors of an entity pair in it, we will predict the appropriate relation of the pair. We extract the dependency path P, transform P into sequential pattern and then accumulate the scores of its subsequences for each relation r:

L_r(P)= \displaystyle\sum_{p in S(p)} \omega_{r (P)}

  • Lr(P) likelihood score to say that P expresses relation r
  • S(P) set of all subsequences of the sequential representation of P
  • The appropriate relation should be the one giving highest score to P:

R= \arg\displaystyle\max_r L_r(P)

4.3 Learning Patterns with Dependency Path and Semantic Role

In this method, the only additional step is to augment dependency trees with PA structure, all the other steps are the same to those of method in Section 4.2.

Frame Semantics theory: A frame defines relationships between a predicate and its participants in a context, which form Predicate-Argument(PA) structure[13].

We use the SNoW-based Semantic role Labeler [16], a state-of-art in SRL task which conforms the definition of PropBank and CoNLL-2005 shared task on SRL. Since the SRL task just labels roles to constituents or phrases without indicateing which primitive concept playing the role, we still use dependency parsing information to further analyze the phrases. We combine the two infomation sources by integrating semantic role information into dependency parse tree of a sentence as follows:

  • For each predicate P and its role r, identify headwords of the two phrases.
  • Place the semantic relation between teh headwords into dependency tree. The relation is directed, receiving the headword of P as its head, head word of R as its tail and R as its label.

5. Experimental Setting

5.1 Data

  • real Wiki dumped on Aug 10,2006
  • articles including the summary sections  only

summary sections have templates: Infobox_Company, Infobox_Senetor, Infoxbox_Celebrity

[5] Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and patterns in Text.

[9]Extracting Relations from Text

[10]Correference for nlp applications.

[11] Dependency-Based Evaluation of Minipar

[12] Mining Sequential patterns by Pattern-Growth: The prefixspan approach.

[13] The proposition Bank: An Annotated Corpus of Semantic Roles

[16] Generalized Inference with Multiple Semantic Role Labeling Systems

Automatic Sentiment Analysis in On-line Text

  • July 29, 2010 6:07 pm

Erik Boiy el.

We will give an overview of various techniques used to tackle the problems in the domain of sentiment analysis, and add some of our own results.

2.2 Emothions in Written Text


A lot of linguistic scholars agree on the three dimensions of Osgood and al. [1] who investigated how the meaning of words can be mapped in a semantic space.

(1)Evaluation( positive/ negative)

(2) Potency (powerful/(unpowerful)

(2.1) Proximity (near/far)

(2.2) Specificity(clear/vague)

(2.3) Certainty (confident/doubtful)

(3) Intensifiers (more/less)

3. Methodology

There are two main techniques for sentiment classification: symbolic techniques and machine learning techniques. The symbolic approach uses manually crafted rules and lexicons, where the machine learning approach uses unsupervised, weakly supervised or fully supervised learning to construct a model from a large training corpus.

3.2 Machine Learning Techniques

3.2.1 Feature Selection

The most important decision to make when classfiying documents , is the choice of the feature set. Several features are commonly used, like unigrams or part-of-speech data. Features and their values are commonly stored in a feature vector.


This is the classic approach to feature selection, in which each document is represented as a feature vector, where the elements indicated the presenece of a word in the document. (keywords)


A word N-grams is a subsequence of N words from a given sequence. This means that the features in the document representation are not single words, but pairs(bigrams), triples(trigrams) or even bigger tuples of words.


basic dictionary form


A solution for this is to tag each word after the negation until the first punctuation.

Opinion words


Wiebe noted in [15] that adjectives are good indicators for subjectivity in a document. Salvetti used wordnet to enrich the only-adjective feature vectors.

3.2.2 machine Learning Techniques

Supervised Methods

The method that in the literature often yields the highest accuracy regards a Support vector machine classifier.

(1) SVM

SVM operate by constructing a hyplerplane with maximal Euclidean distance to the closest training expamples.

(2) Naive Bayes Multinomial

A naive Bayes classifier uses Bayes rule (which states how to update or revise believe in the light of new evidence) as its main equation, under the naive assumption of conditional independence.

(3) Maximum Entropy (Maxent)

The approach tries to preserve as much uncertainty as possible. A number of models are computed, where each feature corresponds to a constraint on the model. The model with most entropy over all models that satisfy these constraints is selected for classification.

unsupervised and weakly-supervised methods

Unsupervised methods can label a corpus,that is later used for supervised learning ( especially semantic orientation is helpful for this).

4. Challenges

4.3 Cross-domain Classification

How can we learn classifiers on one domain and use them on another domain. One possible approach is to train the classifier on a domain-mixed set of data instead of training it on one specific domain.

5. Results

5.3.2 Our Experiments

SVM -> SVM light

naive Bayes multinomial -> Weka

Maximum Entropy -> Maxent from OpenNLP

6. Dicussion

The advantages of unigrams and bigrams over the other features are that they are faster to extract, and requried no extra resources to use, while e.g. adjectives requrire a POS tagger to be run on the data frist, and subjectivity analysis requires an additional classifier to be used.

NBM is considerably faster.

=================My Summary===============

How to read the related papers?

  • Problem: is it a binary problem, e.g, supportive / in supportive , or a rating problem?
  • Do they do the syntactic preprocessing? 
    • part-of-speech parser
    • wordnet for synonym
  • What is the dataset
    • well-formed corpus
    • corpus collected directly from the internet
  • What is the machine learning mathodology
    • Feature selection
      • Unigrams: each documents is represented as a feature vector
      • N-grams: a subsequence of N words from a given sequence
      • Lemmas: basic dictionary form
      • Negation
      • Opinion words
      • Adjectives
    • Techniques(Classifier)
      • Supervised
        • SVM
        • Naive Bayes Multinomial
        • Maximum Entropy
      • Unsupervised
      • Semi-supervised
    • Evaluation
      • How many folds
  • Other Statistic Method
    • Markov Model
    • Conditional Random Field
    • N-grams model
    • semantic Orientation
  • What is the setup– the detail

The Reduplication of B.Pang’s work on Sentiment Classification

  • July 25, 2010 6:24 pm

B.Pang, L.Lee Thumbs up? Sentiment Classification using Machine Learning Techniques


“Our data source was the Internet Movie Database (IMDb) archive of the newsgroup( We selected only reviews where the author rating was expressed either with stars or some numerical value. Ratings were automatically extracted and converted into one of the three categories: positive, negative, or neutral. For the work described in this paper, we concentrated only on discriminating between positive and negative sentiment”


“To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 review per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.”

“To prepare the documents, we automatically removed the rating indicators and extracted the textual information from the original HTML document format, treating punctuation as separate lexical items. No stemming or stoplists were used”

“To create a data set with uniform class distribution (studying the effect of skewed class distributions was out of the scope of this study), we randomly selected 700 positive-sentiment and 700 negative-sentiment documents. We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold. “

Feature Selection

  • bigrams with frequency at 7

“For this study, we focused on features based on unigrams(with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at lest four times in our 1400 document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams,since we consider bigrams ( and n-grams in general) to be an orthogonal way to incorporate context.”

Document Vector Generation

  • Frequency: the # of times fi occurs in document, and d = (n1(d),n2(d),…,nm(d))

“To implement these machine learning algorithms on our document data, we used the following standard bag-of-features framework. Let {f1,…,fm} be a predefined set of m features that can appear in a document; examples include the word “still” or the bigram “really stinks”. Let ni(d) be the number of times fi occurs in document d. Then, each document d is represented by the document vector d:=(n1(d),n2(d),…,nm(d))

  • Presence: ni(d) either is 1 or 0

“However, the definition of the MaxEnt feature/ class functions Fi,c only reflects the presence or absence of a feature, rather than directly incorporating feature frequency. In order to investigate whether reliance on frequency information could account for the higher accuracies of Naive Bayes and SVMs, we binarized the document vectors, setting ni(d) to 1 if and only feature fi appears in d, and reran Naive Bayes and SVMlight on these new vectors”

SVM: SVMlight with default setting

“We used Joachim’s SVMlight package for training and testing,with all paramenters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly)”

Hand-by-hand Manual

  1. 700 / 700 pos neg randomly: really randomly?
  2. divide into 3 equal-size folds: how is that possible?
  3. bigrams with frequency >= 7
  4. Generate the document vector frequency/presence: write the program
  5. SVM : how to do the train / test see the ML book for detail
    Page 2 of 212