Engine Parameters

BINGO! initiated re-training whenever one topic reached Nmax=200 new documents and each topic contained at least Nmin=10 documents, and we repeated ten such re-training periods. During this process, which ran several hours, approximately 16,000 Web pages from about 1,000 different hosts were crawled.

The crawling depth for URL tunnelling (i.e., paths that are pursued from documents that did not pass the classification test) was restricted to 2. Feature selection always selected the 100 top terms in the MI ranking. The parameters for link analysis and archetype selection were set to Nhub=100, Nauth=50, and Nconf=50, and the Eigenvector computation of the link analysis always performed 30 iterations.

We compared five different techniques:

"Crawler" as the vanilla version did not perform any feature selection or dynamic training and merely used the initial bookmarks for training and as the crawl's seed.
"Crawler+SVM" employed periodic re-training based on the highest-confidence archetypes but did not exploit authorities or hubs and did not use feature selection either.
"Crawler+SVM+MI" additionally performed feature selection.
"Crawler+SVM+HITS" did not use feature selection but exploited both high-confidence archetypes and authorities for periodid re-training and also hubs for the crawl frontier.
"Crawler+SVM+HITS+MI" finally was the full variant that exploited all capabilities of the BINGO! system.

After each crawl-and-training period, we manually inspected the population of selected topics and determined "true positives" versus "false positives". This way we were able to calculate the classifiers' precision (ratio of topic-relevant documents to all documents in a node of the ontology tree) and also the overall crawler precision (ratio of documents relevant for a topic to the total number of documents visited during the entire crawl by following a path from one of the topic's training data).

For lack of time, we did not (yet) compute any recall figures because of the obviously much higher human overhead and inherent problems of identifying "false negatives" (i.e., relevant but misclassified or missed documents).