First results

The following Figure shows the classifier's precision for the sample topic "ROOT/Job/Java" at the end of each crawl-and-retraining period for all techniques other than the vanilla method "Crawler". The "Crawler" method is omitted because it simply turned out to be a more or less flat line which started with the initial precision of approximately 0.4. This also shows that the initial bookmarks are at best mediocre training data, at least when no feature selection is performed.

Note that we could probably have chosen better, richer and more characteristic training data by investing more time into careful inspection of candidates, but we did not "optimize" this initial step as we wanted to stress-test the focused crawler.

Among the other four techniques, the full "Crawler+SVM+MI+HITS" algorithm stood out in that it performed much better than the other three variants and gradually improved the classification precision up to about 0.9. The best precision that the three less competitive techniques could achieve was about 0.8, and this was reached by the "Crawler+SVM+MI" variant. These results indicate that feature selection is absolutely crucial for the classifier, and the adaptive re-training of our approach gradually improved the selection of discriminative features, too. Note that, although the most significant improvement in precision was reached after one or two crawl-and-retraining periods, precision continued to improve with more iterations.

The next Figure shows the crawler precision for the topic "ROOT/Job/Java". Obviously, this measure steadily decreases for all techniques as the crawl proceeds and visits more and more documents with only a minority of them qualifying themselves for the given topic. The chart did not reveal any significant differences between the various techniques, and is merely shown for completeness.

To give a more concrete impression of what the BINGO! system achieved in this preliminary, and still small-scale, experiment, the next figure shows some of the original bookmarks for the Java topic and also some of the documents that were found and positively classified during the crawl (ordered by Kleinberg's Authority-score, new encountered authorities underlined):


Initial bookmarks:



       http://developer.java.sun.com/developer/TechTips/

       http://javaboutique.internet.com/

       http://www.nikos.com/javatoys/

       http://www.jguru.com/

       http://java.sun.com/products/jdbc/

       http://www.javaworld.com/javaworld/

       ......



Focused crawling results:                                  Authority-Score



       http://java.sun.com/products/jdk/1.2/docs/api/           0,5899

       http://www.java.sun.com/docs/books/tutorial/index.html   0,5686

       http://developer.java.sun.com/developer/infodocs/        0,2768

       http://java.sun.com/products/jdbc/                       0,1325

       http://www.apl.jhu.edu/~hall/java/                       0,1142

       http://java.sun.com/products/OV_stdExt.html              0,0966

       http://java.apache.org/index.html                        0,0822

       ......