Re-Training based on Archetypes

Whenever the filling degree of one topic exceeds a pre-specified level for example, when the topic is populated with a few hundred positively classified documents, the re-training procedure is invoked. For the sake of simplicity, we start re-training for all topics at this point, although we could invoke it for individual topics only, too.

To ensure that we have indeed made sufficient crawling progress for all topics, we additionally require that all nodes of the ontology tree have to hold at least X positively classified documents. If this latter condition is not met, we can give higher crawling priority to this topic by prioritizing hyperlinks in the URL queue that have been extracted or transitively reached from positive documents of the given topic.

The purpose of the re-training procedure is to identify new training documents that promise to be better than the original ones taken from the bookmark file. Here better means more characteristic in the sense that the features of the new training data capture the topic-specific terminology and concepts and are more discriminative with regard to competing topics It is obvious that we should consider asking the human user about suggestions for characteristic documents, and we do indeed support such human interaction as an option. At the same time, however, we should strive for providing automated support as well for scalability and versatility. Our approach thus is to identify the most characteristic "archetypes" among the documents that have been positively classified into a given topic. We aim to find at least as many good archetypes as the topic initially had bookmarks; hopefully we can identify an order of magnitude more training documents of very high relevance.

BINGO! draws on the following two sources of potential archetypes:

  1. The link analysis provides us with good authorities for the given topic. We simply use the current set of positively classified documents as the input for Kleinberg's HITS algorithm. The result of the analysis is a ranked list of documents in descending order of authority scores; the top of these documents are considered as archetypes. The HITS algorithm also provides us with a ranked list of hubs, from which we take the top candidates for the crawl frontier; these will be placed into the URL queue of the crawler, together with the hubs for all other topics, when the re-training procedure is completed.
  2. We exploit the fact that the SVM classifier yields a measure of its confidence about a positive classification, namely, the distance of the document's feature vector from the separating hyperplane. This way we can sort the documents of a topic in descending order of confidence. We then select the top documents as archetypes.
Once the archetypes of a topic have been selected, the classifier for that topic is re-trained using the archetypes plus the original bookmarks as training data. This step in turn requires invoking the feature selection first. So the effect of re-training is twofold:
  1. if the archetypes capture the terminology of the topic better than the original bookmarks (which is our basic premise) then the feature selection procedure can extract better, more discriminative, features for driving the classifier, and
  2. the accuracy of the classifier's test whether a new, previously unseen, document belongs to a topic or not is improved using richer (e.g., longer but concise) and more characteristic training documents for building its decision model.
The next figure summarizes all steps of the re-training procedure in pseudo-code form.

if {at least one topic has more than Nmax positive documents and

    all topics have at least Nmin positive documents} {

for each topic Vi {

     invoke link analysis using all documents of Vi as base set;

     hubs (Vi) = top Nhub documents from hub score ranking;

     authorities (Vi) = top Nauth documents from authority score ranking;

     sort documents of Vi in descending order of 

          confidence for positive classification;

     archetypes (Vi) = top Nconf documents from confidence ranking  

                       È  authorities (Vi)  È  bookmarks (Vi) };

for each topic Vi {

     perform feature selection based on archetypes (Vi);  

     re-compute SVM decision model for Vi 

          using archetypes(Vi) as training data;

     re-initialize URL queue for the crawl;

     add URLs from hubs (Vi) to URL queue 

          (in descending order of hub scores,

           with round-robin selection across topics) }                          

}