The feature selection algorithm provides the BINGO! engine with the most characteristic features for a given topic; these are exactly the features that are used by the classifier for testing new documents. A good feature for this purpose discriminates competing topics from each other, i.e., topics that have the same parent in the ontology tree. Therefore, feature selection has to be topic-specific; it is invoked for every topic in the tree individually. As an example, consider an ontology with topics mathematics, agriculture, and arts, where mathematics has subcategories algebra and stochastics. Obviously, the term "theorem" is very characteristic for math documents and thus an excellent discriminator between mathematics, agriculture, and arts. However, it is of no use at all to discriminate algebra versus stochastics. A term such as "field", on the other hand, is a good indicator for the topic algebra when the only competing topic is stochastics; however, it is useless for a classifier that tests mathematics versus agriculture.
We use the Mutual Information (MI) measure for topic-specific feature selection. This technique, which is a specialized case of the notions of cross-entropy or Kullback-Leibler divergence, is known as one of the most effective methods.
Feature selection is a very important building block for a focused crawler. This is illustrated by the following figure, which shows the top features for the sample topic ROOT/Job/Java with regard to tf*idf scores and the MI measure (note that the terms are only word stems). It is obvious that the tf*idf-based top features are much less characteristic for Java documents (in this case, versus the topics Perl and XML) than the features based on MI:
tf*idf score MI measure new 1.4927 virtual 0.0620 world 1.2778 rmi 0.0585 object 1.2446 jdk 0.0585 link 1.0406 java 0.0525 web 0.9491 runtim 0.0511 class 0.8613 client 0.0481 inform 0.8567 class 0.0478 comput 0.8112 remot 0.0446 resourc 0.7764 form 0.0438