Information extraction

Introduction
Modern Information Extraction is, in general, credited to MUC which was established by DARPA. It has evolved quite a bit and present-day evaluations are done within the context of the ACE (Automatic Content Extraction) evaluation program run by NIST.

(a) Information Extraction, Andrew McCallum, University of Massachusettes, ACM Queue November 2005 A non-technical overview of information extraction that presents a 5-step high-level overview consisting of

1. Segmentation: essentially tokenization of text 2. Classification: classify each segmented piece as one of several classes (Person, Organization etc.) 3. Association: essentially relationship detection 4. Normalization: different things are normalized to be the same (3-3:30 and 1500-1530 to possibly the same ISO std). 5. Deduplication: essentially coreference resolution.

Talks about the different higher-level approaches to Information Extraction and applicability of these 1. Simple regular expressions: for simple extraction tasks 2. Rules: more complex exraction tasks but semantics are still clearly defineable 3. Machine learning algorithms: Subtle rules for very complex tasks

Uncertainty is an integral part of information extraction and needs to be managed appropriately. Easier training needed (defining large numbers of labeled examples is not easy) and therefore semi-supervised methods and interactive extraction.

Note Very lightweight tutorial and a good light reading.

(b) Automatic Information Extraction, Hamish Cunnigham, University of Sheffield An extensive overview of different IE tasks along with nice examples. Starts with the claim that IE tasks are faced with the specificity-complexity tradeoff; i.e., more complex the IE task the more specific the domain should be from which the information is being extracted. Several applications are listed such as marketing, PR, media analytst etc. IE tasks are broadly divided into 5 categories. 1. Named Entity 2. Coreference Resolution 3. Template Element Construction: Constructs templates by adding description to extracted information (primarily using Coreference Resolution 4. Template Relation Constrcution: Essentially Relationship identification 5. Scenario Extraction: Tie together the elements and relations into a single complex event such as Person A was replaced by Person B on Date C at Organization Y.

Information Extraction (modern) pretty much started with MUC (Message Understanding Conference) and the afore-mentioned tasks were the basis for this conference. The newer information extraction conference is known as ACE (Automatic Content Extraction) and is significantly harder with 1 & 2 a single task, 3 & 4 a single task and 5 a separate task.

Note Fairly vanilla tutorial that has basically followed the MUC and ACE tasks in most of the description.

(c) Introduction to Information Extraction, Doug Appelt and David Israel, SRI, Tutorial at IJCAI 1999 Very detailed introduction to Information Extraction from a rule-based linguist perspective. After some introduction the tutorial talks about two main approaches to building extraction systems (a) Knowledge Engineering Approach and (b) Automatically trainable Systems. Several examples of pros and cons of the two approches are discussed. Different components of an information extraction system are described as 1. Tokenization: Straightforward 2. Morphological Processing: (a) Identify inflectional variants (b) Lexical lookup of tokens (c) Part-of-speech tagging (d) names and structured items: Identification of structured items such as dates, times, telephone numbers and proper names etc. There is a, somewhat, detailed discussion on both knowledge-based and machine learning approaches to named-entity extraction. A generic approach to building rule-based named-entity recognizers is given. Discussion of trainable named-entity taggers using HMMs etc. There is also pointers to several tools for building named-entity taggers. 3. Syntactic Analysis: shallow and full-parsing. Both knowledge-based and trainable parsers are discussed. 4. Domain Analysis: (a) Coreference analysis with a detailed description of a coreference algorithm (b) Merging of partial results

Note Incomplete. This is a very comprehensive tutorial but not very well-organized. Towards the end (Domain Analysis) it gets to be fairly opaque and several issues are mixed-up.

(d) Empirical Methods in Information Extraction, Claire Cardie, AI Magazine Another overview to several of the MUC tasks. 

Machine Learning
(a) Relational Markov Networks for Collective Information Extraction, Bunescu and Mooney, ACL 2004

(b) Statistical Information Extraction Course offered by Andrew McCallum at UMass. I will look through here and collect appropriate papers and summarize them here. 

In general machine learning approaches to information extraction are becoming more popular. Several points to be noted (I will expand on them later)

1. Mainly supervised techniques. Unfortunately, labeled data is very sparse (ACE for example has a total of 300,000 labeled corpus for training of all tasks)

2. Modeling sequence is important (as can be expected)

3. semi-supervised methods are becoming more popular. Recent innovation is to use other related problems to help the problem at hand.

4. CRF (a very popular recent development) tries to do 1. and 2. in the same step.

Named-Entity Recognition

4. Till recently Basic Plot for named-entity recognition followed the following two steps

(a) Train classifiers to learn 3 classes (from labeled data) Start of named-entity, contains named-entity and other.

(b) Use resulting outputs to form a huge lattice. Prune it and then find the sequence with highest probability.

A statistical Model For Multilingual Entity Detection And Tracking This paper follows the basic plot above and compares the performance of two algorithms for Step (a) namely, Robust Risk Minimization (Text Chunking based on a Generalization of Winnow ) and Maximum Entropy (A Maximum Entropy Approach to Natural Language Processing ). A more readable paper on application of maximum entropy for text classification is (Using Maximum Entropy for Text Classification ).

Resources
1. Infrastructure

(a) UIMA

(b) Gate

2. IE Tools

(a) Mallet

(b) MinorThird