Classifier

Document classification consists of a training phase for building a mathematical decision model based on intellectually preclassified documents, and a decision phase for classifying new, previously unseen, documents fetched by the crawler.

In the training phase, BINGO! builds a topic-specific classifier for each node of the ontology tree. Initially, the bookmarked documents of the topic serve as training data; these are periodically augmented by "archetypes" of the topic as the crawl proceeds. For non-leaf nodes of the ontology tree the training data is the union of the training data of all subtopics and the topic itself.

In the decision phase, BINGO! tests a new document against all topics in a top-down manner. Starting with the root, which corresponds to the union of the user's topics of interest, we invoke the classifiers for all topics with the same parent; we refer to these as "competing" topics as the document will eventually be placed in at most one of them. Each of the topic-specific classifiers returns a yes-or-no decision and also a measure of confidence for this decision. We assign the document to the tree node with the highest confidence in a positive decision. For example, if we used a Bayesian classifier, this would be the tree node with the highest likelihood that the document was "generated" from this topic given the features of the topic's training data. If none of the topics with the same parent returns yes, we place the document into a special tree node "others" under the same parent.

BINGO! uses support vector machines (SVM) as topic-specific classifiers. This method has been shown to be both efficient and very effective for text classification in general. We use the linear form of SVM (i.e., with trivial kernel function) where training amounts to finding a hyperplane in the m-dimensional feature vector space that separates a set of positive training examples for topic from a set of negative examples (of all competing topics with the same parent) with maximum margin:

In BINGO! we use an existing open-source SVM implementation that is part of the BioJava package. Our system also allows the user to interactively inspect and possibly overwrite the classifier's decisions. So even if the classifier rejects a document for a given topic, the user may intellectual assign it to this topic (and may analogously drop documents that were accepted by the classifier); these documents are then treated as if they were among the initial bookmarks (i.e., intellectually classified data with hundred percent confidence). This kind of interactive user control is optional; BINGO! can also run in fully automated mode, then relying on SVM classification only.