Document Analyzer
Once a document has been fetched, it is stored in the database for
further processing: an HTML document is parsed and analyzed to
identify its outgoing hyperlinks, which are then added to the crawler's
URL queue, and to produce a bag of words that occur in the document along
with their frequencies. Very common and thus insignificant words,
so-called stop words such as articles ("the", "a", "an"),
pronouns ("we", "you", etc.), and "universal" verbs
and the auxiliary verbs ("be", "have", "take",
"may", etc.) along with their flexions, are eliminated from
the bag.
Each of the remaining words is reduced to its word stem using the
Porter algorithm,
thus mapping words with the same stem to the same
feature. We refer to the resulting list of word stem frequencies in
the document, normalized by dividing all values by the maximum
frequency in the document, as the document's feature vector or vector of
term frequencies (tf values). As an option, we can replace
the tf values by tf*idf values where idf is
the so-called inverse document frequency of a term, the inverse of the
number of documents that contain the term. To avoid that idf
values dominate the tf*idf product we use log10(idf)
instead of idf for dampening. As idf values are a
global measure and change with the progress of the crawl,
they are re-computed lazily whenever a certain number of new documents
have been crawled.