Feature Selection

The feature selection algorithm provides the BINGO! engine with the most characteristic features for a given topic; these are exactly the features that are used by the classifier for testing new documents. A good feature for this purpose discriminates competing topics from each other, i.e., topics that have the same parent in the ontology tree. Therefore, feature selection has to be topic-specific; it is invoked for every topic in the tree individually. As an example, consider an ontology with topics mathematics, agriculture, and arts, where mathematics has subcategories algebra and stochastics. Obviously, the term "theorem" is very characteristic for math documents and thus an excellent discriminator between mathematics, agriculture, and arts. However, it is of no use at all to discriminate algebra versus stochastics. A term such as "field", on the other hand, is a good indicator for the topic algebra when the only competing topic is stochastics; however, it is useless for a classifier that tests mathematics versus agriculture.

We use the Mutual Information (MI) measure for topic-specific feature selection. This technique, which is a specialized case of the notions of cross-entropy or Kullback-Leibler divergence, is known as one of the most effective methods.

Feature selection is a very important building block for a focused crawler. This is illustrated by the following figure, which shows the top features for the sample topic ROOT/Job/Java with regard to tf*idf scores and the MI measure (note that the terms are only word stems). It is obvious that the tf*idf-based top features are much less characteristic for Java documents (in this case, versus the topics Perl and XML) than the features based on MI:


                    tf*idf score                         MI measure



      new             1.4927           virtual             0.0620  

      world           1.2778           rmi                 0.0585

      object          1.2446           jdk                 0.0585

      link            1.0406           java                0.0525

      web             0.9491           runtim              0.0511

      class           0.8613           client              0.0481

      inform          0.8567           class               0.0478

      comput          0.8112           remot               0.0446

      resourc         0.7764           form                0.0438