To ensure that we have indeed made sufficient crawling progress for all topics, we additionally require that all nodes of the ontology tree have to hold at least X positively classified documents. If this latter condition is not met, we can give higher crawling priority to this topic by prioritizing hyperlinks in the URL queue that have been extracted or transitively reached from positive documents of the given topic.
The purpose of the re-training procedure is to identify new training documents that promise to be better than the original ones taken from the bookmark file. Here better means more characteristic in the sense that the features of the new training data capture the topic-specific terminology and concepts and are more discriminative with regard to competing topics It is obvious that we should consider asking the human user about suggestions for characteristic documents, and we do indeed support such human interaction as an option. At the same time, however, we should strive for providing automated support as well for scalability and versatility. Our approach thus is to identify the most characteristic "archetypes" among the documents that have been positively classified into a given topic. We aim to find at least as many good archetypes as the topic initially had bookmarks; hopefully we can identify an order of magnitude more training documents of very high relevance.
BINGO! draws on the following two sources of potential archetypes:
if {at least one topic has more than Nmax positive documents and all topics have at least Nmin positive documents} { for each topic Vi { invoke link analysis using all documents of Vi as base set; hubs (Vi) = top Nhub documents from hub score ranking; authorities (Vi) = top Nauth documents from authority score ranking; sort documents of Vi in descending order of confidence for positive classification; archetypes (Vi) = top Nconf documents from confidence ranking È authorities (Vi) È bookmarks (Vi) }; for each topic Vi { perform feature selection based on archetypes (Vi); re-compute SVM decision model for Vi using archetypes(Vi) as training data; re-initialize URL queue for the crawl; add URLs from hubs (Vi) to URL queue (in descending order of hub scores, with round-robin selection across topics) } }