Engine Parameters

BINGO! initiated re-training whenever one topic reached Nmax=200 new documents and each topic contained at least Nmin=10 documents, and we repeated ten such re-training periods. During this process, which ran several hours, approximately 16,000 Web pages from about 1,000 different hosts were crawled.

The crawling depth for URL tunnelling (i.e., paths that are pursued from documents that did not pass the classification test) was restricted to 2. Feature selection always selected the 100 top terms in the MI ranking. The parameters for link analysis and archetype selection were set to Nhub=100, Nauth=50, and Nconf=50, and the Eigenvector computation of the link analysis always performed 30 iterations.

We compared five different techniques:

After each crawl-and-training period, we manually inspected the population of selected topics and determined "true positives" versus "false positives". This way we were able to calculate the classifiers' precision (ratio of topic-relevant documents to all documents in a node of the ontology tree) and also the overall crawler precision (ratio of documents relevant for a topic to the total number of documents visited during the entire crawl by following a path from one of the topic's training data).

For lack of time, we did not (yet) compute any recall figures because of the obviously much higher human overhead and inherent problems of identifying "false negatives" (i.e., relevant but misclassified or missed documents).