[With contributions rom W. Cohen's IE tutorial, Shivakumar Vaithyanathan, and DBLife]

Tutorials / Overviews

  • J. Cowie, W. Lehnert, Information Extraction, Communications of the ACM, 39 (1), 1996
  • Andrew McCallum's recent survey in ACM Queue.
  • Ion Muslea, "Extraction Patterns for Information Extraction Tasks: A Survey", The AAAI-99 Workshop on Machine Learning for Information Extraction.
  • D. Appelt et. al, "Introduction to Information Extraction Technology", IJCAI-99 Tutorial
  • Line Eikvil, "Information Extraction from World Wide Web A Survey“, Report No. 945, July, 1999. ISBN 82-539-0429-0

Data Sources / Tools

  • MinorThird [7]


  • [Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In Proceedings of ANLP’97, p194-201.
  • [Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).
  • [Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)
  • [Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).
  • [Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity, in Proceedings of ACM SIGMOD-98.
  • [Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language, ACM Transactions on Information Systems, 18(3).
  • [Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning: Proceedings of the Seventeeth International Conference (ML-2000).
  • [Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, 1999.
  • [De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural Language Processing. Larence Erlbaum, 1982, 149-176.
  • [Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98).
  • [Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains. Ph.D. dissertation, Carnegie Mellon University.
  • [Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101 (2000).
  • Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99)
  • [Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.
  • [Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence, 118(pp 15-68).
  • [Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001.
  • [Leek 1997] Leek, T. R. Information extraction using hidden Markov models. Master’s thesis. UC San Diego.
  • [McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000
  • [Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.
  • [Muslea et al, 1999] Muslea, I.; Minton, S.; Knoblock, C. A.: A Hierarchical Approach to Wrapper Induction. Proceedings of Autonomous Agents-99.
  • [Muslea et al, 2000] Musclea, I.; Minton, S.; and Knoblock, C. Hierarhical wrapper induction for semistructured information sources. Journal of Autonomous Agents and Multi-Agent Systems.
  • [Nahm & Mooney, 2000] Nahm, Y.; and Mooney, R. A mutually beneficial integration of data mining and information extraction. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pages 627--632, Austin, TX.
  • [Punyakanok & Roth 2001] Punyakanok, V.; and Roth, D. The use of classifiers in sequential inference. Advances in Neural Information Processing Systems 13.
  • [Ratnaparkhi 1996] Ratnaparkhi, A., A maximum entropy part-of-speech tagger, in Proc. Empirical Methods in Natural Language Processing Conference, p133-141.
  • [Ray & Craven 2001] Ray, S.; and Craven, Ml. Representing Sentence Structure in Hidden Markov Models for Information Extraction. Proceedings of the 17th International Joint Conference on Artificial Intelligence, Seattle, WA. Morgan Kaufmann.
  • [Soderland 1997]: Soderland, S.: Learning to Extract Text-Based Information from the World Wide Web. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97).
  • [Soderland 1999] Soderland, S. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1/3):233-277.
  • Crawling the Hidden Web. Sriram Raghavan, Hector Garcia-Molina. VLDB 2001. (regular paper, data extraction from the web)
  • RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. VLDB 2001. (regular paper, data extraction from the web)
  • Visual Web Information Extraction with Lixto. Robert Baumgartner, Sergio Flesca, Georg Gottlob. VLDB 2001. (regular paper, data extraction from the web)

Additional Notes

Community content is available under CC-BY-SA unless otherwise noted.