Link Analysis

The link structure between documents in each topic is an additional source of information about how well they capture the topic. We apply Kleinberg's link analysis method, coined HITS, to each topic of the ontology. This method aims to identify a set of authorities, which should be Web pages with most significant and/or comprehensive information on the topic, and a set of hubs, which should be the best link collections with pointer to good authorities.

The algorithm considers a small part of the hyperlink-induced Web graph in the order of a few hundred or a few thousand documents. The node set of the graph is constructed in two steps:

  1. we include all documents that have been positively classified into the topic under consideration, which form the "base set" in Kleinberg's terminology, and
  2. we add all successors of these documents (i.e., documents that can be reached along one outgoing edge) and a reasonably sized subset of predecessors (i.e., documents that have a direct hyperlink to a document in the base set).
The predecessors are determined by asking a Web search engine that internally maintains a large fraction of the full Web graph and can thus answer "link: <...>" queries; the BINGO! system queries Google for this purpose. Unlike the set of successors, the set of predecessors of the base set can be extremely large (namely, when the base set already contains an excellent authority or hub that many personal home pages link to); so we artificially limit the number of added documents to 1000 by random selection.

Finally, the edge set for the entire set of nodes is constructed by fetching all documents and analyzing their hyperlinks. The actual link analysis computes an authority score x and a hub score y for each document in the node set S, based on the following mutual reinforcement relationship:

The intuition behind these mutually recursive definitions is that a good hub is a page that points to many good authorities and a good authority is a page that is pointed to by many good hubs. The recursion in the above formulas is resolved by computing the x and y vectors in an iterative manner: we start with assigning initial hub and authority scores uniformly, and then compute x(i) and y(i) from the previous iteration's values x(i-1) and y(i-1); after each iteration the two vectors are normalized such that they have length (i.e., L2 norm) one.

Since we are not really interested in computing the absolute values of authority and hub scores of all documents in S but merely want to identify the top N authorities and hubs (with N in the order of 100), we interpret the notion of convergence in a somewhat loose manner and simply stop after a fixed number of iterations (the current setup of BINGO! uses 30 iterations).