Engine Parameters
BINGO! initiated re-training whenever one topic reached
Nmax=200 new documents and each topic contained at
least Nmin=10 documents, and we repeated ten such
re-training periods. During this process, which ran several hours,
approximately 16,000 Web pages from about 1,000 different hosts were
crawled.
The crawling depth for URL tunnelling (i.e., paths that
are pursued from documents that did not pass the classification test)
was restricted to 2. Feature selection always selected the 100 top
terms in the MI ranking. The parameters for link analysis and archetype
selection were set to Nhub=100, Nauth=50,
and Nconf=50, and the Eigenvector computation of the link
analysis always performed 30 iterations.
We compared five different techniques:
- "Crawler" as the vanilla version did not perform any
feature selection or dynamic training and merely used the
initial bookmarks for training and as the crawl's seed.
- "Crawler+SVM" employed periodic re-training based on the
highest-confidence archetypes but did not exploit authorities
or hubs and did not use feature selection either.
- "Crawler+SVM+MI" additionally performed feature
selection.
- "Crawler+SVM+HITS" did not use feature selection
but exploited both high-confidence archetypes and authorities for
periodid re-training and also hubs for the crawl frontier.
- "Crawler+SVM+HITS+MI" finally was the full
variant that exploited all capabilities of the BINGO! system.
After each crawl-and-training period, we manually inspected the
population of selected topics and determined "true positives" versus
"false positives". This way we were able to calculate the classifiers' precision
(ratio of topic-relevant documents to all documents in a node of the
ontology tree) and also the overall crawler precision (ratio of
documents relevant for a topic to the total number of documents visited
during the entire crawl by following a path from one of the topic's
training data).
For lack of time, we did not (yet) compute any recall figures
because of the obviously much higher human overhead and inherent
problems of identifying "false negatives" (i.e., relevant but
misclassified or missed documents).