Nanos gigantium humeris insidentes!

Note of Relation Extraction from wikipedia using subtree mining

  • August 8, 2010 10:56 am
  • Relation Extraction from Wikipedia Using Subtree mining Keywords that provide clues for each relation label are identified by a Keyword Extractor.
  • An Entity Classifier module will classify the entities into types to limit the available relations for entity pairs. The Relation Extractor will extract subtree feature from a pair of the principal entity and a mention of secondary entity.
  • It then incorporates the subtree feature together with entity type feature into a feature vector and classifies relations of the entity pairs using SVM-based classifiers.

Principal Entity Detector

We propose a simple but efficient technique to identify a set of referring expressions, denoted as F, which provides better results than those produced by the above coreference tools. We adopt(Morton 2000) to classify the expressions in F into three types (1) personal pronoun (2) proper noun (common nouns. Based on chunking information, the technique is as follows:

  1. Start with F = {}
  2. Select the first two chunks for F: the proper chunk of the article title and the first proper chunk in the first sentence of the article, if any. These are the fist two names  of the principal entity. If F is still empty, stop.
  3. For each remaining proper chunk p in the article, if p is derived from any expressions selected in (2), then F ← p. Proper chunk p1 is derived from proper chunk p2 if all its proper nouns appear in p2. These proper chunks are various identifiers of the principal entity.
  4. In the article, select c as the most frequent subjective pronouns, find c’ as its equivalent objective pronoun and add them to F
  5. for each chunk p with the pattern [DT N1 … Nk] where DT is a determiner and Ni‘s are common nouns, if p appears more frequently than all the selected pronouns in (4), the F← p

Entity Classifier

The entity type is very useful for relation extraction.

We first identify year, month and date by directly examining their surface text. Types of other entities, including principal entities and secondary entities, are identified by classifying their corresponding articles.

We develop SVM-based classifier for each remaining type using a one-against-all strategy.

We represent an article an article in the form of a feature  vector and use the following features:

  • category feature: categories collected when tracing from the article up to k levels of its category structure
  • pronoun feature: the most frequent subjective pronoun in the article
  • singular noun feature: singular nouns of the first sentence of the article

Keyword Extractor

Our hypothesis in this research is that there exist some keywords that provide clues to the relationshiop between a pair.

  1. we map entities in such relations to those in sentences to collect sample sentences for each relationship
  2. The Tf-idef model is exploited to measure the relevance of words to each relationship for those on the dependency path between the entity pair
  3. we choose the keywords manually from lists of candidates ranked by relevance score with respect to each relation.

Subtree Feature from the Dependency Path

read paper Bunescu Extracting relations from text: From word sequences to dependency paths.

Supervised Learning for Relation Extraction

We formulate our problem of relation classification into a multiclass and multi-label problem in which one SVM-based classifier is dedicated for a relation.

We represent each mention of a secondary entity in a sentence with respect to a relation r as a feature vector receiving values 0 and 1. Feature vectors are created from the type of principal entity, type of secondary entity, and the mined subtree of the sentence. The number of slots for a subtree feature depends on the relation. The principal entity might be absent in a sentence and its type is unchanged for all sentences in an article.

Morton T 2000. Coreference for nlp applications

Print Friendly