Nanos gigantium humeris insidentes!

Note of Extracting relations from Text: From Word Sequences to dependency Paths

  • August 3, 2010 9:00 pm

Extracting relations from Text: From Word Sequences to dependency Paths


  1. each potential relation is represented implicitly as a vector of features, where each feature corresponds to a word sequence anchored at the two entities forming the relationship. A relation extraction system is trained based on the subsequence kernel from [2]
  2. the representation is centered on the shortest dependency path between the two entities in the dependency graph of the sentence.

Entity recognition, a prerequisite for relation extraction, is usually cast as sequence tagging problem, in which words are tagged as being entity outside any entity, or inside a particular type of entity.

Most approaches to entity tagging are therefore based on probabilistic models for labeling sequences, such as Hidden Markov Models [4], Maximum entropy Markov Models [5], or Conditional random Fields[6], and obtain a reasonably high accuracy.

3. A Dependency-Path Kernel for Relation Extraction

ACE (Automated Content Extraction) newspaper corpus [12], in which words are represented as nodes and word-word dependencies are represented as directed edges.  A subset of these word-word dependencies capture the predicate-argument relations present in the sentence.

  • Arguments are connected to their target predicates
    • either directly through an arc pointing to the predicate, (troops→raided )
    • or indirectly through a preposition or infinitive particle.  (warning←to←stop)
  • Other types of word-word dependencies account for modifier-head relationships present in
    • adjective-noun compounds, (several→stations)
    • noun-noun compounds, (pumping→stations)
    • or adverb-verb constructions.(recently→raided)

Relation Instance Shortest path in Undirected Dependency Graph
S1: protesters AT stations protesters→seized←stations
S1: worker AT stations workers→holding←protesters→seized←stations
S2: troops AT churches troops→raided←churches
S2: ministers AT churches ministers→warning←troops→raided←churches

Word-word dependencies are typically categorized in two classes as follows:

  • [Local Dependencies] correspond to local predicate-argument constructions such as ‘troops→raided’, or ‘pumping → stations’
  • [Non-local Dependencies] Long-distance dependencies arise due to various linguistic constructions such as coordination, extraction, raising and control.


  • Context Free Grammar (CFG) parser:  to extract local dependencies, which each sentence form a dependency tree.
  • Combinatory Categorial Grammar(CCG)[13]: extract local and long-distance dependencies

3.1 The shortest path Hypothesis

If e1 and e2 are two entities mentioned in the same sentence such that they are observed to be in a relationship R, then the contribution of the sentence of the sentence dependency graph to establishing the relationship R(e1,e2) is almost exclusively concentrated in the shortest path between e1 and e2 in the undirected version of the dependency graph.

  1. If entities e1 and e2 are arguments of the same predicate, then the shortest path between them will pass through the predicate, which may be connected directly to the two entities, or indirectly through prepositions.
  2. If e1 and e2 belong to different predicate-argument structures that share a common argument, the the shortest path will pass through this argument.

Both paths support the LOCATED relationship.

  1. if a PERSON entity is doing some action to a FACILITY entity, then the PERSON entity is LOCATED at that FACILITY entity.
  2. the same PERSON entity is doing two actions, one action to a PERSON entity, and the other action to a FACILITY entity.

In the case where e1 and e2 belong to predicate-argument structures that have no argument in common. We shall find a shortest sequence of predicate-argument structures with target predicates P1,P2,…,Pn such that e1 is an argument of P1, e2 is an argument of Pn, and any two consecutive predicates Pi and Pi+1 share a common argument.

3.2 Learning with Dependency paths

The set of features can then be defined as a Cartesian product over words and word classes. In this representation, sparse or contiguous subsequences of nodes along the lexicalized dependency path are included as features simply by replacing the rest of the nodes with their corresponding generalizations.

For verbs and nouns occurring along a dependency path we also use an additional suffix (“-“)to indicate a negative polarity item.

\begin{bmatrix}protesters\\NNS\\Noun\\PERSON\end{bmatrix}\times [\rightarrow] \times \begin{bmatrix}seized\\VBD\\Verb \end{bmatrix} \times [\leftarrow] \times \begin{bmatrix}stations\\NNS\\Noun\\FACILITY \end{bmatrix}

Feature generation from dependency path

We use kernel SVMs in order to avoid working explicitly with high-dimensional dependency path feature vectors. Computing the dot product (e.g, kernel) between two relation examples amount to calculating the number of common features (paths) between the two examples. If x = x1x2…xm and y=y1y2…yn are two relation examples, where xi denotes the set of word classes corresponding to position i, then the number of common feature between x and y is computed as:

K(x,y)= 1(m=n) \cdot \displaystyle\prod_{i=1}^{n}c(x_i,y_i)

where c(xi,yi) = |xi ∩ yi| is the number of common word classes between xi and yi.

1. ‘his actions in Brcko‘ (his → actions ← in ← Brcko)
2. ‘his arrival in Beijing‘ (his → arrival ← in ← Bejiing)

1. x = [x1 x2 x3 x4 x5 x6 x7], where x1 = {his,PRP, PERSON}, x2={→}, x3 = {actions,NNS,Noun}, x4={←},x5={in,IN},x6={←},x7={Brcko,NNP,Noun,LOCATION}
2. y = [y1 y2 y3 y4 y5 y6 y7], where y1= {his, PRP, PERSON},y2={→}, y3 = {arrival,NN,Noun}, y4={←},y5={in,IN},y6={←},y7={Beijing,NNP,Noun,LOCATION}

Based on the formula from Equation 4, the kernel is computed as K(x,y)=3 ×1×1×1×2×1×3=18

4. Experimental Evaluation

We assume that the entities and their labels are known. All preprocessing steps- sentence segmentation, tokenization, POS tagging and chunking – were performed using the OpenNLP package. If a sentence contains n entities (n≥2), it is replicated into \dbinom{n}{2}
sentences, each containing only two entities. If the two entities are known to be in a relationship, then the replicated sentence is added to the set of corresponding positive sentences, otherwise it is added to the set of negative sentences. Druing testing, a sentence having n entities (n≥2)is again replicated into \dbinom{n}{2} sentences in a similar way.

The dependency graph that is input to the shortest path dependency kernel is obtained from two different parsers:

  • The CCG parser introduced in [14] out puts a list of functor-argument dependencies, from which head-modifier dependencies are obtained using a straightforward procedure.[15]
  • Head-modifier dependencies can be easily extracted from the full parse output of Collins’s CFG parser [16], in which every non-terminal node is annotated with head information.

We modified the LibSVM package by plugging in the kernels described above. The factor λ in the subsequence kernel is set to 0.75. The performance is measured using precision, recall and F-measure(the harmonic mean of precision and recall)

[2] Text classification using string kernels

[4] A tutorial on hidden Markov models and selected applications in speech recognition.

[5] maximum entropy Markov models for information extraction and segmentation

[6] Conditional random fields: Probabilistic models for segmenting and labeling sequence data

[12] ACE

[13] M. Steedman, The Syntactic Process, MIT Press Cambridge, MA, 2000

[14]Generative models for statistical parsing with combinatory categorial grammar

[15]A shortest path dependency kernel for relation extraction

[16]Three generative, lexicalised models for statistical parsing

Print Friendly