Nanos gigantium humeris insidentes!


  • October 12, 2010 1:01 am


Nguyen Bach & Sameer Badaskar
Language Technologies Institute
Carnegie Mellon University

Original Source:

Supervised Relation Extraction

  • Formulate relation extraction as a supervised classification task.
  • Focused on feature-based and kernel methods


  • Structuring the information on the web
  • Involves annotating the unstructured text with
    • Entities
    • Relations between entities
  • Extracting semantic relations between entities in text


  • Example 1: “Bill Gatesworks at Microsoft Inc.”
    • Person-Affiliation(Bill Gates, Microsoft Inc)
  • Example 2:Located-In(CMU, Pittsburgh)
  • Higher order relations
    • Protein-Organism-Location
  • Entity tuple: entities are bound in a relation
    • r(e1,e2,…,en)


  • Question Answering: Ravichandran & Hovy (2002)
    • Extracting entities and relational patterns for answering factoid questions (Example: “When was Gandhi born ?” amounts to looking for Born-In(Gandhi, ??) in the relational database)
  • Mining bio-medical texts
    • Protein binding relations useful for drug discovery
    • Detection of cancerous genes (“Gene X with mutation Y leads to malignancy Z”)


  • Datasets
    • Automatic Content Extraction (ACE)
    • Message Understanding Conference (MUC-7)
  • Supervised Approaches
    • Relation extraction as a classification task.
    • Precision, Recall and F1
  • Semi-supervised Approaches
    • Bootstrapping based approaches result in the discovery of large number of patterns and relations.
    • Approximate value of precision computed by drawing a random sample and manually checking for actual relations


  • Supervised approaches
    • Feature based
    • Kernel based
    • Concerns
  • Semi-supervised approaches
    • Bootstrapping
    • DIPRE, Snowball, KnowItAll, TextRunner
  • Higher-order relation extraction

Supervised Approaches (1)

  • Formulate the problem as a classification problem (in a discriminative framework)
  • Given a set of +ve and –ve training examples
  • Sentence : S = w1w2…e1…wi…e2…wn-1wn

fR(T(S)) ={ +1 ,  if e1 and e2 are related by R
-1  , otherwise

Supervised Approaches (2)

  • fR(.) can be a discriminative classifier
    • SVM, Voted Perceptron, Log-linear model …
    • Can also be a multiclass classifier!
  • T(S) can be
    • A set of features extracted from the sentence
    • A structured representation of the sentence (labeled sequence, parse trees)

Supervised Approaches (3)

  • Features
    • Define the feature set
    • Similarity metrics like cosine distance can be used
  • Structured Representations
    • Need to define the similarity metric (Kernel)
    • Kernel similarity is integral to classifiers like SVMs.

Supervised Approaches(4)


  • Khambhatla (2004), Zhou et. al. (2005)
  • Given a sentence
    • 1.Perform Textual Analysis (POS, Parsing, NER)
    • 2.Extract
      • Words between and including entities
      • Types of entities (person, location, etc)
      • Number of entities between the two entities, whether both entities belong to same chunk
      • # words separating the two entities
      • Path between the two entities in a parse tree
  • Textual Analysis involves POS tagging, dependency parsing etc.
  • What forms a good set of features ?
  • Choice of features guided by intuition and experiments.
  • Alternative is to use the structural representations and define an appropriate similarity metricfor the classifier!


  • Kernel K(x,y) defines similarity between objects x and y implicitly in a higher dimensional space
  • (x,y) can be Strings: similarity number of common substrings(or subsequences) between x and y
    • Example: sim(cat,cant) > sim(cat,contact)
    • Excellent introduction to string kernels in Lodhi et. al. (2002)
  • Extend string kernels to word sequences and parse trees for relation extraction

Kernels (Word Subsequences)

Word contextaround entities can be indicator of a relation -Bunescu& Mooney (2005a)

Each word is augmented with its POS, Generalized POS, chunk tag (NP, VP, etc), entity type (Person, Organization, none)

Kernels (Trees)

  • Tree kernels differ over types of trees used and attributes of nodes
  • Zelenko et. al. (2003)
    • Use shallow parse trees. Each node contains
      • Entity-Role (Person, Organization, Location, None)
      • Text it subsumes
      • Chunk tag (NP, VP etc)
    • Tasks: organization-location, person-affiliation detection
    • Tree kernel with SVM improves over feature based SVM for both tasks (F1 7% and 3% respectively)
  • Culotta & Sorensen (2004)
    • Use dependency trees. Node attributes are
      • Word, POS, Generalized POS, Chunk tag, Entity type, Entity-level, Relation argument
  • These tree kernels are rigid –attributes of nodes must match exactly!


  • Bunescu & Mooney (2005b)
    • Sufficient to use only the shortest path between entitiesin a dependency tree.
    • Each word in shortest path augmented with POS, Generalized POS, Entity type etc…
    • Structure of the dependency path is also encoded
    • Performs the best among all kernels

Kernels Vs Features

Supervised Approaches (Concerns)

  • Perform well but difficult to extend to new relation-types for want of labeled data
  • Difficult to extend to higher order relations
  • Textual analysis like POS tagging, shallow parsing, dependency parsing is a pre-requisite. This stage is prone to errors.

Rationales in Relation Extraction


  • Supervised approaches
    • Feature-based and kernel methods
  • Semi-supervised approaches
    • Bootstrapping
  • Higher-order relation extraction
  • Applications
    • Question-Answering
    • Mining biomedical text
  • Evaluation

A Shortest Path Dependency Kernel for Relation Extraction

  • August 23, 2010 8:53 pm

1. Introduction

Information Extraction (IE), which is traditionally divided into three sub-problems:  co-reference resolution, named entity recognition and relation extraction.

In this paper we focus exclusively on extracting relations between predefined types of entities in the ACE corpus.

the use of automatically derived syntactic information can lead to significant improvements in extraction accuracy.

The amount of syntactic knowledge used in IE systems varies from part-of-speech only (Ray and Craven,2001) to chunking (Ray and Craven, 2001) to shallow parse trees(Zelenko et al., 2003) to dependency trees derived from full parse trees ( Culotta and Sorensen, 2004).

The performance however depends not only on the details of the exact models using this information, but also on the details of the exact models using this information.

2. Dependency Graphs

Let e1 and e2 be two entities mentioned in the same sentence such that they are observed to be in a relationship R, i.e R(e1,e2) = 1.

We assume that a relation is to be extracted only between entities mentioned in the same sentence, and that the presence or absence of a relation is independent of the text preceding or following the sentence.

A subset of these word-word dependencies capture the predicate-argument relations present in the sentence. Arguments  are connected to their target predicates either directly through an arc pointing to the predicate, or indirectly through a preposition or infinitive particle.

Word-word dependencies are typically categorized in two classes as follows:

  • Local Dependencies
  • Non-local Dependencies

A context free Grammar parser can be used to extract local dependencies, which for each sentence form a dependency tree. Mildly context sensitive formalisms such as combinatory Categorial Grammar(CCG)(Steedman, 2000) model word-word dependencies more directly and can be used to extract both local and long-distance dependencies, giving rise to a directed acyclic graph.

3 The Shortest Path Hypothesis

contribution of the sentence dependency graph to establishing the relationship R(e1,e2) is almost exclusively concentrated in the shortest path between e1 and e2 in the undirected version of the dependency graph.

If entities e1 and e2 are arguments of the same predicate, then the shortest path between them will pass through the predicate, which may be connected directly to the two entities, or indirectly through prepositions. If e1 and e2 belong to different predicate-argument structures that share a common argument, then the shortest path will pass through this argument.

for the rest, check another paper from the same authors

Note of Discovering Relations among named Entities from Large Corpora

  • August 23, 2010 3:55 pm

Paper:  Discovering Relations among named Entities from Large Corpora

1. Introduction

Our method does not need the richly annotated corpora required for supervised learning — corpora which take great time and effort to prepare. It also does not need any instances of relations as initial seeds for weakly supervised learning.

Instead, we only need a named entity (NE) tagger to focus on the named entities which should be the arguments of relations. Recently developed named entity taggers work quite well and are able to extract named entities from text at a practically useful level.

3. Relation Discovery

3.2 Named entity tagging

Sekine proposed 150 types of named entities (Sekine et al., 2002). We use an extended named entity tagger (Sekine, 2001) in order to detect useful relations between extended named entities.

3.3 NE pairs and context

We define the co-occurrence of NE pairs as follows: two named entities are considered to co-occur if they appear within the same sentence and are separated by at most N intervening words. So we have set a frequency threshold to remove those pairs.

3.4 Context similarity among NE pairs

We adopt a vector space model and cosine similarity in order to calculate the similarities between the set of contexts of NE pairs.

The cosine similarity cosine(θ) between context vectors α and β is calculated by the following formula.

 cosine(\theta) = \frac{\alpha \cdot \beta }{|\alpha||\beta|}

Cosine similarity varies from 1 to -1. A cosine similarity of 1 would mean these NE pairs have exactly the same context words with the NEs appearing predominantly in the same order, and cosine similarity of -1 would mean these NE pairs have exactly the same context words with the NEs appearing predominately in reverse order.

3.5 Clustering NE pairs

We make clusters of NE pairs based on the similarity.

3.6 Labeling clusters

We simply count the frequency of the common words in all combinations of the NE pairs in the same cluster.

Note of An Integrated Approach for Relation Extraction

  • August 22, 2010 7:36 pm

From : An Integrated Approach for relation Extraction from wikipedia texts


two kinds of patterns:

  1. dependency patterns from dependency analysis of texts in Wikipedia;
  2. surface patterns generated from high redundant information from the web corpus.

1. Introduction

Currently the leading methods in unsupervised information extraction are based on collecting redundancy information from a local corpus or use the Web as corpus.

One of the main challenges and research interest for frequent pattern mining is how to abstract away from different surface realizations of semantic relations to discover discriminative patterns efficiently.

The common process is to generate linguistic patterns based on analysis of the syntactic, dependency or shallow semantic structure of text, then train to identify entity pairs which assume a relationship and classify them into pre-defined relationships. The advantage of these methods is using linguistic technologies to learn semantic information from different surface expressions.

2. Related Work

Discovering relations by clustering pairs of co-occurring entities represented as vectors of context features was introduced by (Hasgawa, et al., 2004)

The field of Unsupervised Relation Identification (URI)- the task of automatically discovering intersting relations between entities in a large text corpora is introduced by Hasegawa, Sekine et al.(2006) Discovering relations by clustering pairs of co-occurring entities represented as vectors of context features.

Dependency patterns have the properties of being more accurate, less spam-prone and much less redundant than surface patterns from the web corpus.

4. Pattern combination approach for relation extraction

4.1 Overview of the approach

Our approach requires a set of wikipedia articles as input; for each article, outputs a list of concept pairs with relation labels.

There are four primary modules of our approach:

  • Text Pre-processor and Relation candidate Generation
    preprocess wikipedia articles to split text and filter sentences. For each article, it collects a set of concept pairs as relation candidates.
  • Web Context Collector
    collect context information from the web corpus. For each relation candidate, it generates a set of ranked relational terms and set of surface patterns.
  • Dependency Pattern Modeling
    generate dependency patterns for each relation candidates from corresponding sentences in Wikipedia articles.
  • Linear Clustering Algorithm
    cluster relation candidates based on their context. 2 sub-modules:

    • local clustering
    • global clustering

4.2 Text Preprocessor and relation candidate generation

  1. consider all hyper-linked concepts in the article as related concepts, which may share a semantic relationship with the entitled concept.
  2. we use a linguistic parser to split article text into sentences, and finally refine sentences which contain one of the references of entitled concept and one of the lined texts for the dependency pattern modeling module.

4.3 Web Context Collection

4.4 Dependency pattern modeling


  1. shortest path dependency kernels outperform dependency tree kernels by offering a very condensed representation of the information needed to assess their relationship (Bunescu and Mooney, 2005)
  2. embedded structures of the linguistic representation are important for obtaining a good coverage of the pattern acquisition (Culotta and Sorensen, 2004; Zhang et al., 2006)

We generate dependency patterns as sub-paths from the shortest dependency path for each concept pair. The process of inducing dependency patterns is in two steps:

  1. shortest dependency tree inducing. From the dependency tree structure, we firstly induce the shortest dependency path with the entitled concept and related concept.
  2. Dependency pattern generation. Then we use a frequent tree-mining algorithm(Zaki et al.2002) to generate dependency patterns from the shortest dependency path for relation clustering.

Note of Relation Extraction from wikipedia using subtree mining

  • August 8, 2010 10:56 am
  • Relation Extraction from Wikipedia Using Subtree mining Keywords that provide clues for each relation label are identified by a Keyword Extractor.
  • An Entity Classifier module will classify the entities into types to limit the available relations for entity pairs. The Relation Extractor will extract subtree feature from a pair of the principal entity and a mention of secondary entity.
  • It then incorporates the subtree feature together with entity type feature into a feature vector and classifies relations of the entity pairs using SVM-based classifiers.

Principal Entity Detector

We propose a simple but efficient technique to identify a set of referring expressions, denoted as F, which provides better results than those produced by the above coreference tools. We adopt(Morton 2000) to classify the expressions in F into three types (1) personal pronoun (2) proper noun (common nouns. Based on chunking information, the technique is as follows:

  1. Start with F = {}
  2. Select the first two chunks for F: the proper chunk of the article title and the first proper chunk in the first sentence of the article, if any. These are the fist two names  of the principal entity. If F is still empty, stop.
  3. For each remaining proper chunk p in the article, if p is derived from any expressions selected in (2), then F ← p. Proper chunk p1 is derived from proper chunk p2 if all its proper nouns appear in p2. These proper chunks are various identifiers of the principal entity.
  4. In the article, select c as the most frequent subjective pronouns, find c’ as its equivalent objective pronoun and add them to F
  5. for each chunk p with the pattern [DT N1 … Nk] where DT is a determiner and Ni‘s are common nouns, if p appears more frequently than all the selected pronouns in (4), the F← p

Entity Classifier

The entity type is very useful for relation extraction.

We first identify year, month and date by directly examining their surface text. Types of other entities, including principal entities and secondary entities, are identified by classifying their corresponding articles.

We develop SVM-based classifier for each remaining type using a one-against-all strategy.

We represent an article an article in the form of a feature  vector and use the following features:

  • category feature: categories collected when tracing from the article up to k levels of its category structure
  • pronoun feature: the most frequent subjective pronoun in the article
  • singular noun feature: singular nouns of the first sentence of the article

Keyword Extractor

Our hypothesis in this research is that there exist some keywords that provide clues to the relationshiop between a pair.

  1. we map entities in such relations to those in sentences to collect sample sentences for each relationship
  2. The Tf-idef model is exploited to measure the relevance of words to each relationship for those on the dependency path between the entity pair
  3. we choose the keywords manually from lists of candidates ranked by relevance score with respect to each relation.

Subtree Feature from the Dependency Path

read paper Bunescu Extracting relations from text: From word sequences to dependency paths.

Supervised Learning for Relation Extraction

We formulate our problem of relation classification into a multiclass and multi-label problem in which one SVM-based classifier is dedicated for a relation.

We represent each mention of a secondary entity in a sentence with respect to a relation r as a feature vector receiving values 0 and 1. Feature vectors are created from the type of principal entity, type of secondary entity, and the mined subtree of the sentence. The number of slots for a subtree feature depends on the relation. The principal entity might be absent in a sentence and its type is unchanged for all sentences in an article.

Morton T 2000. Coreference for nlp applications

Note Of Universal Networking Language-based Relation Extraction

  • August 6, 2010 4:01 pm

paper: A statistical Approach for Universal networking Language-based Relation Extraction

Abstract– With the assumption that the positions of phrases in a sentence between which there exists a relation have been identified, we focus on the problem of classifying the relation between the given phrases.

The UNL relation classifier was developed by using statistical techniques applied on several lexical and syntactic features. In addition to the common used features, we also propose a new feature that reflects the actual semantic relation of two phrases independent on words in the between.

1. Introduction

Universal Networking Language(UNL) was defined as an artificial language that is able to represent information and knowledge described in natural languages[13].

  • Universal Words (UWs): is UNL’s vocabulary.
  • Relations:defines relationships between pairs of UWs.
  • Attributes: describes the subjectivity of sentences including
    • time with respect to the speaker,
    • speaker’s view of aspect,
    • speaker’s view of reference,
    • speaker’s focus,
    • speaker’s attitudes and
    • speaker’s view point.

One of the important tasks to create UNL representations from natural language text is to extract relations between pairs of UWs. Relation extraction task can be divided into two subtasks:

  1. identifying pairs of UWs between which it is likely to have a relation
  2. identifying the relation label for the pairs.

In this paper, we focus on the second problem. That means, we develop a label classifier given pairs of UWs between which there exists a relation.

The problem: given an English text and a pair of phrases in the text between which there exists a relation, identify the type of relation for the pair.

In this work, we apply statistical techniques to train our classifier.

  1. a number of features of the two phrases will be extracted from the text to create some feature vectors.
  2. count the occurrences of the relations for each feature vector
  3. estimate the probability of each relation type given a feature vector

Dataset: Universal networking Digital Language foundation(UNDL)

3. Method

source phase and destination phrase

For each pair of source and destination phrases in a relation of the training set, we extracted linguistic features and created a feature vector.

  1. The number of occurrences of each relation label for each feature vector will be counted in the whole training set.
  2. In testing time, the probability for each relation label will be estimated given the feature vector associated with a pair of phrases.
  3. The relation label with highest probability will be chosen and assigned to the phrases.

A. Feature Selection

Phrase Type, Head word, voice, dependency path(by Minipar parser), syntactic cross path

detail for Syntactic Cross Path

The string representing a path from the source position through the syntactic parse tree to the destination position. procedures:

  • For source and destination phrases, specify the highest nodes which still receive the head words from the phrases. (H1 H2)
  • specify the lowest common node that covers both of the phrases. (C)
  • Assume that the source phrase and the destination phrase are separated, we have three cases
    • The node C is higher than the nodes H1 and H2, trace upwards from H1 to C, then trace downwards to H2
    • The node H1 is higher or equals to the node C: trace downwards from C to H2
    • The node H2 is higher or equals to the node C: trace upwards from H1 to C

The advantage of syntactic cross path feature: it reflect the syntactic structure of the two concepts represented by the two phrases that is independent on the words lying between the phrases.



Adopted a recommendation to this effect in 1964

Phrase type VBD(POS tag of “adopted”), NN(POS tag of “effect”)
Head word adopt, effect
Voice Active, Unspecified
Dependency path (V)obj v(N)mod v(Prep)pcomp-n v(N)
Syntactic cross path VP v NP v PP(to) NP


B Probability predication

we will count number of relations for each feature vector in the training data and estimate the probability for each relation given a feature vector in testing.

The conditional probability can be estimated as the following:

P(R|F) = \frac{\#(R,F)}{\#(F)}


  • R: relation label
  • F: feature vector
  • #(R,F): number of relations with label R counted for feature vector F in the training.
  • #(F): number of relations which receive F as a feature vector counted in the training.

To estimate the probability for such the cases, we reduce the restriction of the condition in the probability formula by dividing the general feature vector into smaller vectors. In other words, we reduce the dimension of feature vector, we estimate the conditional probability by using linear interpolation method as proposed in[1].


  • Input: a training set including full information of relations
  • Ouput: estimated conditional probabilities P1, …, P8
  1. For each training fragment S; source phrase and destination phrase; relation label R, do:
    1. Receive syntactic parse tree T from Charniak parser and dependency tree D from Minipar parser from the fragment S.
    2. Based on the trees T and D, extract the features for the two phrases: pt1, pt2, hw1, hw2, voice1, voice2,depPath, SynCrossPath
    3. Increase the following counters by 1:
      1. c1(R, pt1), c2(R, pt2), …, C8(R, pt1, pt2, hw1, hw2, voice1, voice2, depPath, synCrossPath)
      2. t1(pt1), t2(pt2), …,  t8(pt1, pt2, hw1, hw2, voice1,  voice2, depPath,  synCrossPath)
        where ci(R,Fi) = #(R,Fi) and ti(Fi)=#(Fi)
  2. calculate all the conditional probabilities
    P_1(R|pt_1)=\frac{c_1(R,pt_1)}{t_1(pt_1)}, ...,
    P_8(R|pt_1, pt_2, hw_1, hw_2, voice_1, voice_2, depPath, synCrossPath) = \frac{c_8(pt_1, pt_2, hw_1, hw_2, voice_1, voice_2, depPath, synCrossPath)}{t_8(pt_1, pt_2, hw_1, hw_2, voice_1, voice_2, depPath, synCrossPath)}


  • Input: a fragment S; source phrase and destination phrase; all the conditional probabilities P1, …, P2
  • Output: relation label R
  1. Receive syntactic parse tree T from Charniak parser and dependency D from Minipar parser for the fragment S.
  2. Based on the trees T and D, extract the features for the two phrases: pt1, pt2, hw1, hw2, voice1, voice2, depPath, synCrossPath
  3. Estimate probabilities for  all relation labels by using linear interpolation as listed above.
  4. Choose the relation label R for the two phrases:
    R = \arg\displaystyle\max_{r} P(r| the two phrases)

a novel idea on relation extraction

  • August 6, 2010 3:22 pm

treat the relation as a label and the use SVM to train


  • August 6, 2010 12:06 pm
  • make the reddim  as shared library
  • find more mapping functions, these library doesn’t look good. search keyword: dimensional reduction
  • label the points

Note of Extracting relations from Text: From Word Sequences to dependency Paths

  • August 3, 2010 9:00 pm

Extracting relations from Text: From Word Sequences to dependency Paths


  1. each potential relation is represented implicitly as a vector of features, where each feature corresponds to a word sequence anchored at the two entities forming the relationship. A relation extraction system is trained based on the subsequence kernel from [2]
  2. the representation is centered on the shortest dependency path between the two entities in the dependency graph of the sentence.

Entity recognition, a prerequisite for relation extraction, is usually cast as sequence tagging problem, in which words are tagged as being entity outside any entity, or inside a particular type of entity.

Most approaches to entity tagging are therefore based on probabilistic models for labeling sequences, such as Hidden Markov Models [4], Maximum entropy Markov Models [5], or Conditional random Fields[6], and obtain a reasonably high accuracy.

3. A Dependency-Path Kernel for Relation Extraction

ACE (Automated Content Extraction) newspaper corpus [12], in which words are represented as nodes and word-word dependencies are represented as directed edges.  A subset of these word-word dependencies capture the predicate-argument relations present in the sentence.

  • Arguments are connected to their target predicates
    • either directly through an arc pointing to the predicate, (troops→raided )
    • or indirectly through a preposition or infinitive particle.  (warning←to←stop)
  • Other types of word-word dependencies account for modifier-head relationships present in
    • adjective-noun compounds, (several→stations)
    • noun-noun compounds, (pumping→stations)
    • or adverb-verb constructions.(recently→raided)

Relation Instance Shortest path in Undirected Dependency Graph
S1: protesters AT stations protesters→seized←stations
S1: worker AT stations workers→holding←protesters→seized←stations
S2: troops AT churches troops→raided←churches
S2: ministers AT churches ministers→warning←troops→raided←churches

Word-word dependencies are typically categorized in two classes as follows:

  • [Local Dependencies] correspond to local predicate-argument constructions such as ‘troops→raided’, or ‘pumping → stations’
  • [Non-local Dependencies] Long-distance dependencies arise due to various linguistic constructions such as coordination, extraction, raising and control.


  • Context Free Grammar (CFG) parser:  to extract local dependencies, which each sentence form a dependency tree.
  • Combinatory Categorial Grammar(CCG)[13]: extract local and long-distance dependencies

3.1 The shortest path Hypothesis

If e1 and e2 are two entities mentioned in the same sentence such that they are observed to be in a relationship R, then the contribution of the sentence of the sentence dependency graph to establishing the relationship R(e1,e2) is almost exclusively concentrated in the shortest path between e1 and e2 in the undirected version of the dependency graph.

  1. If entities e1 and e2 are arguments of the same predicate, then the shortest path between them will pass through the predicate, which may be connected directly to the two entities, or indirectly through prepositions.
  2. If e1 and e2 belong to different predicate-argument structures that share a common argument, the the shortest path will pass through this argument.

Both paths support the LOCATED relationship.

  1. if a PERSON entity is doing some action to a FACILITY entity, then the PERSON entity is LOCATED at that FACILITY entity.
  2. the same PERSON entity is doing two actions, one action to a PERSON entity, and the other action to a FACILITY entity.

In the case where e1 and e2 belong to predicate-argument structures that have no argument in common. We shall find a shortest sequence of predicate-argument structures with target predicates P1,P2,…,Pn such that e1 is an argument of P1, e2 is an argument of Pn, and any two consecutive predicates Pi and Pi+1 share a common argument.

3.2 Learning with Dependency paths

The set of features can then be defined as a Cartesian product over words and word classes. In this representation, sparse or contiguous subsequences of nodes along the lexicalized dependency path are included as features simply by replacing the rest of the nodes with their corresponding generalizations.

For verbs and nouns occurring along a dependency path we also use an additional suffix (“-“)to indicate a negative polarity item.

\begin{bmatrix}protesters\\NNS\\Noun\\PERSON\end{bmatrix}\times [\rightarrow] \times \begin{bmatrix}seized\\VBD\\Verb \end{bmatrix} \times [\leftarrow] \times \begin{bmatrix}stations\\NNS\\Noun\\FACILITY \end{bmatrix}

Feature generation from dependency path

We use kernel SVMs in order to avoid working explicitly with high-dimensional dependency path feature vectors. Computing the dot product (e.g, kernel) between two relation examples amount to calculating the number of common features (paths) between the two examples. If x = x1x2…xm and y=y1y2…yn are two relation examples, where xi denotes the set of word classes corresponding to position i, then the number of common feature between x and y is computed as:

K(x,y)= 1(m=n) \cdot \displaystyle\prod_{i=1}^{n}c(x_i,y_i)

where c(xi,yi) = |xi ∩ yi| is the number of common word classes between xi and yi.

1. ‘his actions in Brcko‘ (his → actions ← in ← Brcko)
2. ‘his arrival in Beijing‘ (his → arrival ← in ← Bejiing)

1. x = [x1 x2 x3 x4 x5 x6 x7], where x1 = {his,PRP, PERSON}, x2={→}, x3 = {actions,NNS,Noun}, x4={←},x5={in,IN},x6={←},x7={Brcko,NNP,Noun,LOCATION}
2. y = [y1 y2 y3 y4 y5 y6 y7], where y1= {his, PRP, PERSON},y2={→}, y3 = {arrival,NN,Noun}, y4={←},y5={in,IN},y6={←},y7={Beijing,NNP,Noun,LOCATION}

Based on the formula from Equation 4, the kernel is computed as K(x,y)=3 ×1×1×1×2×1×3=18

4. Experimental Evaluation

We assume that the entities and their labels are known. All preprocessing steps- sentence segmentation, tokenization, POS tagging and chunking – were performed using the OpenNLP package. If a sentence contains n entities (n≥2), it is replicated into \dbinom{n}{2}
sentences, each containing only two entities. If the two entities are known to be in a relationship, then the replicated sentence is added to the set of corresponding positive sentences, otherwise it is added to the set of negative sentences. Druing testing, a sentence having n entities (n≥2)is again replicated into \dbinom{n}{2} sentences in a similar way.

The dependency graph that is input to the shortest path dependency kernel is obtained from two different parsers:

  • The CCG parser introduced in [14] out puts a list of functor-argument dependencies, from which head-modifier dependencies are obtained using a straightforward procedure.[15]
  • Head-modifier dependencies can be easily extracted from the full parse output of Collins’s CFG parser [16], in which every non-terminal node is annotated with head information.

We modified the LibSVM package by plugging in the kernels described above. The factor λ in the subsequence kernel is set to 0.75. The performance is measured using precision, recall and F-measure(the harmonic mean of precision and recall)

[2] Text classification using string kernels

[4] A tutorial on hidden Markov models and selected applications in speech recognition.

[5] maximum entropy Markov models for information extraction and segmentation

[6] Conditional random fields: Probabilistic models for segmenting and labeling sequence data

[12] ACE

[13] M. Steedman, The Syntactic Process, MIT Press Cambridge, MA, 2000

[14]Generative models for statistical parsing with combinatory categorial grammar

[15]A shortest path dependency kernel for relation extraction

[16]Three generative, lexicalised models for statistical parsing

Note of Exploiting Syntactic and semantic Information

  • August 2, 2010 8:57 pm

Paper: Exploiting Syntactic and Semantic Information for Relation Extraction from Wikipedia

This paper has two version, and the note here is take from the first version

The Semantic Web is based on RDF[3], a representation language using Notation 3 or N3[4]. We follow the formalism of Semantic Web, specifically N3, in which we structure Wikipedia’s content as a collection of statements. Each statement consists of a subject, a predicate and an object. … The statements with the use of a domain-specific ontology can then be straightforwardly transformed into RDF format that in turn serves as machine-processable knowledge base.

Our method, unlike other works, mines the key patterns from syntactic and semantic structure to measure similarity between entity pairs rather using only lexical information as in [5-8] or hard matching of dependency paths as in [9].

In details, we attempt to integrate syntactic and semantic information of text to form an unified structure. We then decompose the structure into subsequences and mine the frequent ones with the aim to capture the key patterns for each relationship.

2. Problem Statement

We aim at extracting binary relations between entities from English version of Wikipedia articles. A 2-tuple (ep, es) and a triple (ep,rel, es) denote an entity pair and a binary relation respectively,
where ep and es are entities which may be PERSON, ORGANIZATION, LOCATION, TIME OR ARTIFACT and
rel denotes the directed relationship between ep and es, which may be one of following 13 relations: CEO, FOUNDER, CHAIRMAN, COO, PRESIDENT, DIRECTOR, VICE CHAIRMAN, SPOUSE, BIRTH DATA, BIRTH PLACE, FOUNDATION, PRODUCT and LOCATION.

We follow [5] to define the entity mainly discussed in an article as principal entity, and other mentioned entities in the same article as secondary entities. We assume that interested entities in this problem should have a descriptive article in Wikipedia. Thus, no entity disambiguation and entity recognition is required in our system. The identifier of an entity is defined as the URL address to its appropriate article.

Our system predicts only the relations between the principal entity and each mentionded secondary entity in an article. As one more assumption, the relationship between an entity pair can be completely expressed in one sentence.

So that ,for an article, only the sentences that contain an principal entity and a secondary entity are necessarily to be analyzed.

4. Extract Relations from Wikipedia

4.1 Relation Extraction Framework

  1. articles should be processed to remove the HTML tags, extract hyperlinks which point to other wikipedia’s articles.passed to pre-processor: Sentence Splitter, Tokenizer and Phrase Chunker
  2. Parallelly processed to anchor all occurrences of principal entities and secondary entities. The Secondary Entity detector simply labels appropriate surface text of the hyperlinks as secondary entities.
  3. Sentence Selector chooses only sentences which contain the principal entity and at least one secondary entity. Each of such pairs becomes a relation candidate.
  4. The trainer receives articles with HTML tags to identify summary sections and extract ground truce relations annotated by human editors.
  5. Previously selected sentences that contain entity pairs from ground true relations are identified as training data.
  6. The trainer will learn the key patterns with respect to each relation.
  7. During testing, for each sentence and an entity pair on it, the Relation Extractor will identify the descriptive label and then outputs the final results.

Principal Entity Detector

  • Most of the pronouns in an article refer to the principal entity.
  • The first sentence of the article is often used to briefly define the principal entity.

We use rules to identify a set of referents to the principal entity, including three types[10]:

  • pronoun (“he”,”him”,”they”…)
  • proper noun (e.g., Bill Gates, William Henry Gates, Microsoft, …)
  • common nouns ( the company, the software, …)

Supported by the nature of Wikipedia, our technique performs better than those of the co-reference tools in LingPipe library and in OpenNLP tool set. All the occurrences of the collected referents are labeled as principal entity.

Training Data Builder

examine whether the pair is in ground true relation set or not. if yes, it attaches the relation label to the pair and create a new training sentence for the relation. For a relation r, the purpose of building training data is to collect the sentences that exactly express r. To reduce noise in training data, it is necessary to eliminate the pairs from the ground truce set which hold more than one relation.

4.2 Learning Patterns with Dependency Path

In this section, we will explain our first method to extracting relation using syntactic information.

Follow the idea in [9] we assume that the shortest dependency path tracing from a principal entity through the dependency tree to a secondary entity gives a concrete syntactic structure expressing relation between the pair.

Key patterns learning from the dependency paths for each relationship.

  1. derive dependency trees of the training sentences by Minipar parser and extract paths between entity pairs
  2. transform the paths into sequences which are in turn decomposed into subsequences
  3. From the subsequence collections of a relation r, we can identify the frequent subsequences for r.
  4. During testing, dependency path between an entity pair in an novel sentence is also converted into sequence and match with the previously mined subsequences.

Sequential Representation of Dependency Path

A word together with its Part-Of-Speech tag will be an element of the sequence.


Learning Key Patterns as Mining Frequent Sequence

PrefixSpan, which is introduced in [12], is known as an efficient method to mining sequential patterns. A sequence s=s1s2…sn;, where si is an itemset, is called subsequence of a sequence p =p1p2…pm; if there exists integers 1 <= j12<…n<=m such that s1 ⊆ pj1, …, sn ⊆ pjn.

In this research, we use the implementation tool of PrefixSpan developed by Taku kudo.

From here, sequence database denotes the set of sequences converted from dependency paths with respect to a relation.

Weighting The Patterns

It is necessary for each mined pattern to be assigned a weight with respect to a relation for estimating the relevance. Factors:

  • Length of the pattern: if two paths share a long common subpattern, it is more likely that the paths express the same relationship.
  • Support of the pattern: is the number of sequences that contain the pattern. It is more likely that a pattern with high support should be a key pattern
  • Amount of lexical information: although the sequences contain both words and dependency relations from the original dependency path, we found that wordbased items are more important.
  • Number of sequence databases in which the pattern appear: if the pattern can be found in various sequence databases, it is more likely that the pattern is common and it should not be a key pattern of any relation.

wr(p) = ir f(p) x supportDr(p) x l(p) x elex(p)/|Dr|

Relation Selection

Given a novel sentence and the anchors of an entity pair in it, we will predict the appropriate relation of the pair. We extract the dependency path P, transform P into sequential pattern and then accumulate the scores of its subsequences for each relation r:

L_r(P)= \displaystyle\sum_{p in S(p)} \omega_{r (P)}

  • Lr(P) likelihood score to say that P expresses relation r
  • S(P) set of all subsequences of the sequential representation of P
  • The appropriate relation should be the one giving highest score to P:

R= \arg\displaystyle\max_r L_r(P)

4.3 Learning Patterns with Dependency Path and Semantic Role

We use the SNoW-based Semantic role Labeler [16], a state-of-art in SRL task which conforms the defintion of PropBank and CoNLL-2005 shared task on SRL. Since the SRL task just labels roles to constituents or phrases without indicateing which primitive concept playing the role, we still use dependency parsing information to further analyze the phrases.

[5] Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and patterns in Text.

[9]Extracting Relations from Text

[10]Correference for nlp applications.

[12] Mining Sequential patterns by Pattern-Growth: The prefixspan approach.