Document Analyzer

Once a document has been fetched, it is stored in the database for further processing: an HTML document is parsed and analyzed to identify its outgoing hyperlinks, which are then added to the crawler's URL queue, and to produce a bag of words that occur in the document along with their frequencies. Very common and thus insignificant words, so-called stop words such as articles ("the", "a", "an"), pronouns ("we", "you", etc.), and "universal" verbs and the auxiliary verbs ("be", "have", "take", "may", etc.) along with their flexions, are eliminated from the bag.

Each of the remaining words is reduced to its word stem using the Porter algorithm, thus mapping words with the same stem to the same feature. We refer to the resulting list of word stem frequencies in the document, normalized by dividing all values by the maximum frequency in the document, as the document's feature vector or vector of term frequencies (tf values). As an option, we can replace the tf values by tf*idf values where idf is the so-called inverse document frequency of a term, the inverse of the number of documents that contain the term. To avoid that idf values dominate the tf*idf product we use log10(idf) instead of idf for dampening. As idf values are a global measure and change with the progress of the crawl, they are re-computed lazily whenever a certain number of new documents have been crawled.