Department Colloquium
Doug Downey, Assistant Professor, Northwestern University
November 17, 2009 - 3:30 Reception - 4:00 Presentation
Title: Autonomous Web-scale Information Extraction
Abstract:
Search engines are extremely useful tools for answering simple questions. However, for more complex questions -- e.g., "which nanotechnology companies are hiring on the West Coast?" -- existing search engines are less effective, because the answers are not contained on just a single page. Answering these questions requires extracting and synthesizing information across multiple documents. Currently, this is a tedious and error-prone manual process.
In this talk, I will describe my research aimed at automating the extraction of information from the Web. I will present a model of the redundancy inherent in the Web, and show that the model can be used to identify correct extractions autonomously, without the manually labeled examples typically assumed in previous information extraction research. Further, while the redundancy-based model alone is ineffective for the "long tail" of infrequently mentioned facts, I will illustrate how unsupervised language models can be leveraged to overcome this limitation.
Bio:
Doug Downey is an assistant professor in the EECS Department of Northwestern University, which he joined in the Fall of 2008. He obtained his PhD from the University of Washington, where he was advised by Oren Etzioni. His research interests are in the areas of natural language processing, machine learning, and artificial intelligence. At UW, he was part of the KnowItAll project, which was aimed at utilizing the Web to autonomously extract large knowledge bases. Doug's primary research results concern probabilistic models of the redundancy inherent in large corpora, along with associated techniques that allow systems like KnowItAll to extract data autonomously at high precision.

