Crawler

The initial seed for the crawler are the URLs of the documents that are referenced from the various folders of the user's bookmark file; these URLs are placed in the URL queue. The crawler then processes the links in the URL queue using multiple threads. It downloads new HTML documents and stores them in the local database which serves as a buffer for the subsequent stages of the analysis and classification pipeline. Once a crawled document has been classified, the BINGO! engine extracts all links from such a document and adds them to the URL queue for further crawling. BINGO! has several strategies for the prioritization of URLs to be visited; the simplest one of them would be depth-first traversal with a limit on the number and depth of URLs fetched from the same site.

Links from rejected documents (i.e., documents that did not pass the classification test for a given topic) are considered for further crawling, too; however, we restrict the depth of additionally traversed links from such documents to a value of two. The rationale behind this threshold is that one often has to "tunnel" through topic-unspecific welcome or table-of-content pages before again reaching a thematically relevant document. When a document is reached that passes the classification test, the limit for the allowed crawling depth along this path is dropped.

The BINGO! crawler has no global limits on crawling depth. Rather it uses the filling degrees of the ontology's various topics as a stopping criterion. When a topic holds a certain number of successfully classified documents (say 200), BINGO! suspends crawling. At this point, link analysis and re-training are performed for all topics, and then crawling is resumed.