Nanos gigantium humeris insidentes!
You are currently browsing the References category


  • July 29, 2010 2:40 pm


  • 1996-Building Probabilistic Models for natural Language
  • 2009-Statistical Language Models for Information Retrieval

Machine Learning(Book)

  • Pattern Recognition and Machine Learning
  • Semi-Supervised Learning

Relation Extraction(paper)

  • Extracting relations from Text: From Word Sequences to dependency Paths
  • Relation Extraction from wikipedia using subtree mining
  • Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia
  • Semi-supervised Semantic Role Labeling Using the Latent Words Language Model
  • Kernel Methods for Relation Extraction

Note of Exploiting Syntactic and semantic Information

  • July 25, 2010 3:27 pm

Paper: Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia

The Semantic Web is based on RDF[3], a representation language using Notation 3 or N3[4]. We follow the formalism of Semantic Web, specifically N3, in which we structure Wikipedia’s content as a collection of statements. Each statement consists of a subject, a predicate and an object. … The statements with the use of a domain-specific ontology can then be straightforwardly transformed into RDF format that in turn serves as machine-processable knowledge base.

Our method, unlike other works, mines the key patterns from syntactic and semantic structure to measure similarity between entity pairs rather using only lexical information as in [5-8] or hard matching of dependency paths as in [9].

In details, we attempt to integrate syntactic and semantic information of text to form an unified structure. We then decompose the structure into subsequences and mine the frequent ones with the aim to capture the key patterns for each relationship.

2. Problem Statement

We aim at extracting binary relations between entities from English version of Wikipedia articles. A 2-tuple (ep, es) and a triple (ep,rel, es) denote an entity pair and a binary relation respectively,
where ep and es are entities which may be PERSON, ORGANIZATION, LOCATION, TIME OR ARTIFACT and
rel denotes the directed relationship between ep and es, which may be one of following 13 relations: CEO, FOUNDER, CHAIRMAN, COO, PRESIDENT, DIRECTOR, VICE CHAIRMAN, SPOUSE, BIRTH DATA, BIRTH PLACE, FOUNDATION, PRODUCT  and LOCATION.

We follow [5] to define the entity mainly discussed in an article as principal entity, and other mentioned entities in the same article as secondary entities. We assume that interested entities in this problem should have a descriptive article in Wikipedia. Thus, no entity disambiguation and entity recognition is required in our system. The identifier of an entity is defined as the URL address to its appropriate article.

Our system predicts only the relations between the principal entity and each mentionded secondary entity in an article. As one more assumption, the relationship between an entity pair can be completely expressed in one sentence.

So that ,for an article, only the sentences that contain an principal entity and a secondary entity are necessarily to be analyzed.

4. Extract Relations from Wikipedia

4.1 Relation Extraction Framework

  1. articles should be processed to remove the HTML tags, extract hyperlinks which point to other wikipedia’s articles.passed to pre-processor: Sentence Splitter, Tokenizer and Phrase Chunker
  2. Parallelly processed to anchor all occurrences of principal entities and secondary entities. The Secondary Entity detector simply labels appropriate surface text of the hyperlinks as secondary entities.
  3. Sentence Selector chooses only sentences which contain the principal entity and at least one secondary entity. Each of such pairs becomes a relation candidate.
  4. The trainer receives articles with HTML tags to identify summary sections and extract ground truce relations annotated by human editors.
  5. Previously selected sentences that contain entity pairs from ground true relations are identified as training data.
  6. The trainer will learn the key patterns with respect to each relation.
  7. During testing, for each sentence and an entity pair on it, the Relation Extractor will identify the descriptive label and then outputs the final results.

Principal Entity Detector

  • Most of the pronouns in an article refer to the principal entity.
  • The first sentence of the article is often used to briefly define the principal entity.

We use rules to identify a set of referents to the principal entity, including three types[10]:

  • pronoun (“he”,”him”,”they”…)
  • proper noun (e.g., Bill Gates, William Henry Gates, Microsoft, …)
  • common nouns ( the company, the software, …)

Supported by the nature of Wikipedia, our technique performs better than those of the co-reference tools in LingPipe library and in OpenNLP tool set. All the occurrences of the collected referents are labeled as principal entity.

Training Data Builder

examine whether the pair is in ground true relation set or not. if yes, it attaches the relation label to the pair and create a new training sentence for the relation. For a relation r, the purpose of building training data is to collect the sentences that exactly express r. To reduce noise in training data, it is necessary to eliminate the pairs from the ground truce set which hold more than one relation.

4.2 Learning Patterns with Dependency Path

In this section, we will explain our first method to extracting relation using syntactic information.

Follow the idea in [9] we assume that the shortest dependency path tracing from a principal entity through the dependency tree to a secondary entity gives a concrete syntactic structure expressing relation between the pair.

Key patterns learning from the dependency paths for each relationship.

  1. derive dependency trees of the training sentences by Minipar parser and extract paths between entity pairs
  2. transform the paths into sequences which are in turn decomposed into subsequences
  3. From the subsequence collections of a relation r, we can identify the frequent subsequences for r.
  4. During testing, dependency path between an entity pair in an novel sentence is also converted into sequence and match with the previously mined subsequences.

Sequential Representation of Dependency Path

A word together with its Part-Of-Speech tag will be an element of the sequence.


Learning Key Patterns as Mining Frequent Sequence

PrefixSpan, which is introduced in [12], is known as an efficient method to mining sequential patterns. A sequence s=s1s2…sn;, where si is an itemset, is called subsequence of a sequence p =p1p2…pm; if there exists integers 1 <= j12<…n<=m such that s1 ⊆ pj1, …, sn ⊆ pjn.

In this research, we use the implementation tool of PrefixSpan developed by Taku kudo.

From here, sequence database denotes the set of sequences converted from dependency paths with respect to a relation.

Weighting The Patterns

It is necessary for each mined pattern to be assigned a weight with respect to a relation for estimating the relevance. Factors:

  • Length of the pattern: if two paths share a long common subpattern, it is more likely that the paths express the same relationship.
  • Support of the pattern: is the number of sequences that contain the pattern. It is more likely that a pattern with high support should be a key pattern
  • Amount of lexical information: although  the sequences contain both words and dependency relations from the original dependency path, we found that wordbased items are more important.
  • Number of sequence databases in which the pattern appear: if the pattern can be found in various sequence databases, it is more likely that the pattern is common and it should not be a key pattern of any relation.

wr(p) = ir f(p) x supportDr(p) x l(p) x elex(p)/|Dr|

Relation Selection

Given a novel sentence and the anchors of an entity pair in it, we will predict the appropriate relation of the pair. We extract the dependency path P, transform P into sequential pattern and then accumulate the scores of its subsequences for each relation r:

L_r(P)= displaystylesum_{p in S(p)} omega_{r (P)}

  • Lr(P) likelihood score to say that P expresses relation r
  • S(P) set of all subsequences of the sequential representation of P
  • The appropriate relation should  be the one giving highest score to P:

R= argdisplaystylemax_r L_r(P)

4.3 Learning Patterns with Dependency Path and Semantic Role

We use the SNoW-based Semantic role Labeler [16], a state-of-art in SRL task which conforms the defintion of PropBank and CoNLL-2005 shared task on SRL. Since the SRL task just labels roles to constituents or phrases without indicateing which primitive concept playing the role, we still use dependency parsing information to further analyze the phrases.

[5] Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and patterns in Text.

[9]Extracting Relations from Text

[10]Correference for nlp applications.

[12] Mining Sequential patterns by Pattern-Growth: The prefixspan approach.

Useful Survey

  • July 20, 2010 2:03 am

A survey on sentiment detection of reviews

Sentiment Analysis: A Combined Approach


  • July 20, 2010 2:03 am


A survey on sentiment detection of reviews


Sentiment Analysis: A Combined Approach

classifiers model reference

  • July 19, 2010 11:09 pm

naive bayes classifer

support vector machine

Vladimir N. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.


  • July 19, 2010 1:37 am

***Feature subsumption of opinion analysis. Proceedings of EMNLP, 2006.
Semantic role extracting

Extracting Appraisal Expressions


Sentiment analysis: a combined approach


Automated learning of appraisal extraction patterns§.


Sentiment analysis: a new approach for effective use of linguistic knowledge and exploiting similarities in a set of documents to be classified

Sentiment Classification using Word Sub-Sequences and Dependency Sub-Tree


Assessing Sentiment of Text by Semantic Dependency and Contextual Valence Analysis


Note of Lexical Filtering on Overall Opinion Polarity Identification

  • July 18, 2010 4:56 pm

F. Salvetti, S.Lewis, C.Reichenbach.Impact of Lexical Filtering on Overall Opinion Polarity Identification


HTML documents were converted to plain text, tagged using the Brill tagger, and fed into filters and classifiers.

Basic Assumption or Points:

Related Research

Research has demonstrated that there is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion (Wiebe, Bruce, & O’ Hara 1999).

Hatzivassiloglou & McKeown 1997), combined a log-linear statistical model that examined the conjunctions between adjectives,(such as “and”, “but”, “or”), with a clustering algorithm that grouped the adjectives into two sets which were then labelled positive and negative.

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. He combined the number of co-occurrences of the adjective under investigation NEAR the adjective ‘excellent’ and NEAR the ‘poor’, thinking that high occurrence NEAR ‘excellent’ implies positive polarity.


The cornell data consists of 27,000 movie reviews in HTML form, using 35 different rating scales such as A…F or 1…10 in addition to the common 5 star system. We divided them into two classes (positive and negative) and took 100 reviews from each class as the test set.


Features for analysis

Three basic approaches for handling this kind of data pre-processing come to mind:

  • Leave the data as-is : Each word will be represented by itself
  • Parts-of-speech tagging: Each word is enriched by a POS tag, as determined by a standard tagging technique (such as the Brill Tagger(Brill 1995))
  • Perform POS taggin and parser (Using e.g. the Penn Tree-bank (Marcus, Santorini, & marcinkiewicz 1994))—severe performance issues

We thus focus our analysis in this paper on POS-tagged data (sentences consisting of words enriched with information about their parts of speech).

We Thus make the following assumptions about our test and training data:

  1. All words are transformed into upper case,
  2. All words are stemmed,
  3. All words are transformed into (word,POS) tuples by POS tagging (notation word/ POS).

All of these are computationally easy to achieve ( with a reasonable amount of accuracy ) using the Brill Tagger.



  • Data: cornell
  • Part-of-speech tagger: Brill tagger (Brill 1995)
  • wordnet: 1.7.13

Part of Speech Filters

Any portion that does not contribute to the OvOP is noise. To reduce noise, filters were developed that use POS tags to do the following.

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

The POS filters are not designed to reduce the effects of conflicting polarity. They are only designed to reduce the effect of lack of polarity.

One design principle of the filter rules is that they filter out parts of speech that do not contribute to the semantic orientation and keep the parts of speech that do contribute such meaning. Based on analysis of movie review texts, we  devised “filter rules” that take Brill-tagged text as input and return less noisy, more concentrated sentences that have a combination of words and word/POS-tag pairs removed from the original.  A summary of the filter rules defined in this experiment is shown in Table 2.

Table 2: Summary of POS filter rules
POS r1 r2 r3 r4 r5

K: keep                   D: Drop               G:Generalize

Wiebe et al., as well as other researchers, showed that subjectivity is especially concentrated in adjectives ( Wiebe, Bruce, & O’ Hara 1999; Turney & Littman 2003). Therefore, no adjectives or their tags were removed, nor were copula verbs or negative markers. However, noisy information such as determiners, foreign words, prepositions, modal verbs, possesives, particles, interjections, etc. were removed from the text stream. Other parts of speech, such as nouns and verbs, were removed but their POS-tags were retained.

WordNet filtering


===============Summary by me===============

There is a strong positive correlation between the presence of adjectives in a sentence and the presence of opinion. (paper to read)

Turney extracted n-grams based on adjectives( Turney 2002). In order to determine if an adjective had a positive /negative polarity he used AltaVista and its function NEAR. (paper to read)

___________Brill tagger                    POS filter                     Classifiers
plain text========>Result 1======>Result2=========>Result3.

About the POS filter

  1. Introduce custom parts of speech when the tagger does not provide desired specificity (negation and copula)
  2. Remove the words that are least likely to contribute to the polarity of a review(determiner, preposition, etc)
  3. Reduce parts of speech that introduce unneccessary variance to POS only

The Reduplication of B.Pang’s work on Sentiment Classification

  • July 17, 2010 4:03 pm

B.Pang, L.Lee  Thumbs up? Sentiment Classificaion using Machine Learning Techniques


“Our  data source was the Internet Movie Database (IMDb) archive of the newsgroup( We selected only reviews where the author rating was expressed either with stars or some numerical value. Ratings were automatically extracted and converted into one of the three categories: positive, negative, or neutral. For the work described in this paper, we concentrated only on discriminating between positive and negative sentiment”


“To avoid domination of the corpus by a small number of prolific reviewers, we imposed a limit of fewer than 20 review per author per sentiment category, yielding a corpus of 752 negative and 1301 positive reviews, with a total of 144 reviewers represented.”

“To prepare the documents, we automatically removed the rating indicators and extracted the textual information from the original HTML document format, treating punctuation as separate lexical items. No stemming or stoplists were used”

“To create a data set with uniform class distribution (studying the effect of skewed class distributions was out of the scope of this study), we randomly selected 700 positive-sentiment and 700 negative-sentiment documents. We then divided this data into three equal-sized folds, maintaining balanced class distributions in each fold. “

Feature Selection

  • bigrams with frequency at 7

“For this study, we focused on features based on unigrams(with negation tagging) and bigrams. Because training MaxEnt is expensive in the number of features, we limited consideration to (1) the 16165 unigrams appearing at lest four times in our 1400 document corpus (lower count cutoffs did not yield significantly different results), and (2) the 16165 bigrams occurring most often in the same data (the selected bigrams all occurred at least seven times). Note that we did not add negation tags to the bigrams,since we consider bigrams ( and n-grams in general) to be an orthogonal way to incorporate context.”

Document Vector Generation

  • Frequency: the # of times fi occurs in document, and d = (n1(d),n2(d),…,nm(d))

“To implement these machine learning algorithms on our document data, we used the following standard bag-of-features framework. Let {f1,…,fm} be a predefined set of m features that can appear in a document; examples include the word “still” or the bigram “really stinks”. Let ni(d) be the number of times fi occurs in document d. Then, each document d is represented by the document vector d:=(n1(d),n2(d),…,nm(d))

  • Presence: ni(d) either is 1 or 0

“However, the definition of the MaxEnt feature/ class functions Fi,c only reflects the presence or absence of a feature, rather than directly incorporating feature frequency. In order to investigate whether reliance on frequency information could account for the higher accuracies of Naive Bayes and SVMs, we binarized the document vectors, setting ni(d) to 1 if and only feature fi appears in d, and reran Naive Bayes and SVMlight on these new vectors”

SVM:  SVMlight with default setting

“We used Joachim’s SVMlight package for training and testing,with all paramenters set to their default values, after first length-normalizing the document vectors, as is standard (neglecting to normalize generally hurt performance slightly)”

Hand-by-hand Manual

  1. 700 / 700 pos neg randomly: really randomly?
  2. divide into 3 equal-size folds: how is that possible?
  3. bigrams with frequency >= 7
  4. Generate the document vector frequency/presence: write the program
  5. SVM : how to do the train / test see the ML book for detail


  • July 12, 2010 2:36 am

Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia.pdf

Relation Extraction from Wikipedia Using Subtree Mining.pdf

Using Wikipedia for AutomaticWord Sense Disambiguation.pdf

2009-Semi-supervised Semantic Role Labeling.pdf

2007-Semi-Supervised Learning for Semantic Parsing using support vector machine.pdf

Word Sense Disambiguation with Semi-Supervised Learning

Paper to read for the two candidative projects

  • July 11, 2010 4:10 pm


  • Introduction to FOPC in Chinese version of Wikipedia
    • a discussion on problem of FOPC
  • Toward the expressive power of natural language  in  Principles of Semantic Networks
    • Topic on expressive power
  • Natural Language, Knowledge Representation, and Logical Form
    • has a discussion on expressive power
  • Language, Proof and Logic BC61 .B38 2002
    • Other expressive limitations of first-order logic