Thinking outside the search box
4:00pm 17 September 2010
Ian H. Witten and David Milne
Professor and PhD Candidate, University of Waikato
There are many opportunities to improve interactivity of information retrieval beyond the ubiquitous search box.
One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri, ontologies—to organize, describe and navigate information spaces. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday web-scale search.
Wikipedia has the potential to change all that. This online, collaboratively generated encyclopedia is easily the largest, fastest growing and most consulted reference work in existence. It is far broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. What if search engines could consult this resource as easily as we do, to understand more about the documents they encounter and help us explore them more effectively?
This is not a far-fetched idea. While clearly intended for human readers, the raw structure of the Wikipedia bears striking resemblance to traditional knowledge bases and provides many footholds for algorithms to extract machine-readable knowledge. It sits somewhere between the chaotic and—at least to machines—incomprehensible web, and the exhaustively formalized knowledge required for artificial intelligence.
This talk describes ongoing research into using Wikipedia to make other textual information sources easier to search and navigate. We break this down into three key problems:
- Extracting structured knowledge from Wikipedia;
- Connecting it to textual documents; and
- Allowing people to easily, effectively and intuitively tap into it while searching and browsing.
There will be more than just talk. For each of the three problems described above we will provide live demonstrations of the systems we have developed to address them.
For the extraction problem, we present an extremely large thesaurus-like structure that has been automatically generated from Wikipedia, and show how it can be reasoned over by machines. For the connection task, we demonstrate an algorithm that can automatically detect and disambiguate Wikipedia topics when they are mentioned in any textual document, and intelligently predict those that are most likely of interest to the reader. For the final problem, we present several end-user applications that combine the work described above with slick visualization techniques, to provide enhanced browsing and searching experiences.
All of the presented systems are open source and publicly available on the web.
About David Milne
David Milne has recently completed his PhD thesis on mining Wikipedia for information retrieval.
He has presented papers at numerous international conferences, and won the “Best Paper” award at the 2008 Conference on Information and Knowledge Management.
About Prof Ian H. Witten
Ian H. Witten, David’s supervisor, is a Professor of Computer Science at the University of Waikato in New Zealand where he directs the New Zealand Digital Library research project. He has published widely in the fields of data mining and information retrieval.