Introduction

Focused crawling is a relatively new, promising approach to improving the recall of expert search on the Web. Typically, search engines provide some potentially relevant documents and the user could find better results within the neighbourhood of these sites (viewing the Web as a graph), but manually surfing hundreds or thousands of Web pages is out of the question for time and cost reasons (i.e., the cost of the “intellectual cycles” spent by the human user). Often the best results can be obtained from portals like www.yahoo.com or www.invisibleweb.com where documents are intellectually preclassified into a hierarchy of topics, also known as an ontology.

Unfortunately, maintaining such an ontology with human experts (or cheap students) as classifiers is barely feasible in the long term. This is where focused crawling kicks in: it starts from a user- or community-specific tree of topics along with a few training documents for each tree node, and then crawls the Web with focus on these topics of interest. This process can either build a personalized, hierarchical ontology whose tree nodes are populated with relevant high-quality documents, or it can be initiated to process a single expert query such as the ones above (i.e., viewing the query terms as an initial training document). The key components of a focused crawler are a document classifier to test whether a visited document fits into one of the specified topics of interest, and a distiller to identify the best URLs for the crawl frontier (i.e., those hyperlinks in already visited documents that, when traversed, promise the best results in the continuation of the crawl). Obviously the distiller should be aware of the specified topics, too, to keep the crawl on focus. So for both components the quality of the training data is the most critical issue and potential bottleneck for the effectivity and scale of a focused crawler.

The BINGO! system implements an approach to focused crawling that aims to overcome the limitations of the initial training data. To this end, BINGO! identifies, among the crawled and positively classified documents of a topic, characteristic “archetypes” and uses them for periodically re-training the classifier; this way the crawler is dynamically adapted based on the most significant documents seen so far. Two kinds of archetypes are considered: good authorities as determined by employing Kleinberg’s link analysis algorithm, and documents that have been automatically classified with high confidence using a linear SVM classifier.