Brainmaker

Nanos gigantium humeris insidentes!

Note of An Integrated Approach for Relation Extraction

  • August 22, 2010 7:36 pm

From : An Integrated Approach for relation Extraction from wikipedia texts

Abstract

two kinds of patterns:

  1. dependency patterns from dependency analysis of texts in Wikipedia;
  2. surface patterns generated from high redundant information from the web corpus.

1. Introduction

Currently the leading methods in unsupervised information extraction are based on collecting redundancy information from a local corpus or use the Web as corpus.

One of the main challenges and research interest for frequent pattern mining is how to abstract away from different surface realizations of semantic relations to discover discriminative patterns efficiently.

The common process is to generate linguistic patterns based on analysis of the syntactic, dependency or shallow semantic structure of text, then train to identify entity pairs which assume a relationship and classify them into pre-defined relationships. The advantage of these methods is using linguistic technologies to learn semantic information from different surface expressions.

2. Related Work

Discovering relations by clustering pairs of co-occurring entities represented as vectors of context features was introduced by (Hasgawa, et al., 2004)

The field of Unsupervised Relation Identification (URI)- the task of automatically discovering intersting relations between entities in a large text corpora is introduced by Hasegawa, Sekine et al.(2006) Discovering relations by clustering pairs of co-occurring entities represented as vectors of context features.

Dependency patterns have the properties of being more accurate, less spam-prone and much less redundant than surface patterns from the web corpus.

4. Pattern combination approach for relation extraction

4.1 Overview of the approach

Our approach requires a set of wikipedia articles as input; for each article, outputs a list of concept pairs with relation labels.

There are four primary modules of our approach:

  • Text Pre-processor and Relation candidate Generation
    preprocess wikipedia articles to split text and filter sentences. For each article, it collects a set of concept pairs as relation candidates.
  • Web Context Collector
    collect context information from the web corpus. For each relation candidate, it generates a set of ranked relational terms and set of surface patterns.
  • Dependency Pattern Modeling
    generate dependency patterns for each relation candidates from corresponding sentences in Wikipedia articles.
  • Linear Clustering Algorithm
    cluster relation candidates based on their context. 2 sub-modules:


    • local clustering
    • global clustering

4.2 Text Preprocessor and relation candidate generation

  1. consider all hyper-linked concepts in the article as related concepts, which may share a semantic relationship with the entitled concept.
  2. we use a linguistic parser to split article text into sentences, and finally refine sentences which contain one of the references of entitled concept and one of the lined texts for the dependency pattern modeling module.

4.3 Web Context Collection

4.4 Dependency pattern modeling

As

  1. shortest path dependency kernels outperform dependency tree kernels by offering a very condensed representation of the information needed to assess their relationship (Bunescu and Mooney, 2005)
  2. embedded structures of the linguistic representation are important for obtaining a good coverage of the pattern acquisition (Culotta and Sorensen, 2004; Zhang et al., 2006)

We generate dependency patterns as sub-paths from the shortest dependency path for each concept pair. The process of inducing dependency patterns is in two steps:

  1. shortest dependency tree inducing. From the dependency tree structure, we firstly induce the shortest dependency path with the entitled concept and related concept.
  2. Dependency pattern generation. Then we use a frequent tree-mining algorithm(Zaki et al.2002) to generate dependency patterns from the shortest dependency path for relation clustering.
Print Friendly