====================================================
Enchancements and changes in current BINGO! release
====================================================
This document summarizes some of the significant enhancements
and changes introduced in BINGO! 3.4.1, since version 3.3.14.
For further information, see the following documents:
- License.txt - the license agreement for use of this software
- INSTALL.txt - instructions for installation and troubleshooting
- Customize.txt - recent customizations and modifications of BINGO! components
THESE INFORMATIONS ARE PROVIDED WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES.
IN NO EVENT SHALL THE DATABASE AND INFORMATION SYSTEMS RESEARCH
GROUP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THESE
RECOMMENDATIONS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
NOTE: If you use BINGO! in your scientific work, please cite as:
Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer,
Martin Theobald, Gerhard Weikum, Patrick Zimmer:
The BINGO! System for Information Portal Generation and Expert Web Search.
The 1st Semiannual Conference on Innovative Data Systems Research (CIDR),
Asilomar(CA), 2003
available at
http://www-db.cs.wisc.edu/cidr2003/program/p7.pdf
==================================================
1) DB Schema Manager
The Schema Manager stores no duplicates of database descriptors. When the database schema is re-created,
the new connection string replaces the old one in the connection list. This reduces the number of similar
entries in the "quick access" pulldown menu of the the Login Dialog.
The "quick access" menu is added to the Schema Manager dialog. This allows to "re-write" existing database
accounts without entering all connection details by hand. The initial value of the Administrator
password is set to "sys" by default for all available Oracle connections.
2) Document annotations
The new annotation algorithm uses the Kullback-Leibler divergence (computed on features of particular
document centences) to create meaningful topic-relevant previews for HTML and PDF documents. Customizable
weghting options are included for centences with special markup (e.g.
..) and initial centences
of each paragraph.
3) PDF Parsing
The PDF processing routine is improved to capture different encoding options for german 'Umlaut' characters.
4) Stopwords
The language of the current document is recognized by counting extracted stopwords. The algorithm counts
stopwords for the given locale (stemmer language) and all other available languages, separately. The classifier
can reject the document when count(locale) - count(other languages) is less than the specified threshold. This
routine is also useful to exclude pages without meaningful text (e.g. pages based on images and icons) from
classification. Furthermore, it prevents the focused crawler from drift into non-relevant foreign domains.
New french stemmer and appropriate french stopword list are added.
5) The crawler focusing options are slightly modified. The ordering of URLs on the crawl frontier can be influenced
by changing following parameters:
Focusing:
- soft: links from all positively classified documents are used to continue crawl (plus tunneling option for negative
results, when no positively classified documents available)
- strong:
OLD: links from a new positively classified document are only used to continue crawl, when its topic is the same as
(or below) the topic of its predecessor.
NEW: links from a new positively classified document are only used to continue crawl, when its topic is the same as
(or below) the topic of initial bookmark from that is was reached. This strategy is slightly less restrictive than
the old version.
Ordering of links on the crawl frontier
(scores in descending order):
- SVM + depth-first: score = SVMScore * Document.depth
- SVM + breadth-first: score = SVMScore / (Document.depth+1)
- SVM: score = SVMScore
- depth-first: score = Document.depth
- breadth-first: score = 1/(1+Document.depth)
- FIFO: score = 1.0, queue maintains the FIFO order
6) Web Services
- new Google Web Services library that can be used to initialize crawl (in connection with option "Links from Database").
It uses native Google Web Services (libraries provided by Google). The number of queries per day is limited to 1000. The
number of matches for each result page is limited to 10; multiple pages for same query can be retrieved one by one.
- new Amazon Web Services interface that can be used to initialize crawl (in connection with option "Links from Database").
It uses the native Amazon Web Services. The number of queries is not limited. Each result page contains 10 matches ordered by
relevance; multiple pages for one query can be accessed one by one.
7) Processing HTML Frames
The processing of HTML Frames has been improved. The Crawler rejects now nested frames (frames within frames) to avoid endless
loops on incorrect HTML inputs.
8) Limitation of #documents per host
To avoid endless loops within manipulated Web sources with thousands of (faked) links, the limitation of total number of documents
per host was introduced. In the mean, the value of 100 to 300 is sufficient to crawl meaningful resources completely or build a
representative 'cutout' of its contents.
9) Following extracted links
For each HTML document, the number of accepted links for the URL queue can be limited. In the prior version, first extracted
links were always used. The selection routine is now randomized: the required number of links is randomly chosen from the
complete set.
This option does not influence the storage of extracted links (for link analysis purposes, all extracted links are completely
stored into the database)
10) Memory consumption issues
To reduce memory consumption, the links extracted from HTML documents are now maintained as URL objects (rather than
BingoDocument objects). The conversion of URL objects into BingoDocument is initiated only for selected links that must be
added to the URL queue.
Further possible improvements that are not (yet) implemented in this release.
1) On-the-fly conversion of strings
To avoid high memory consumption by Java String objects and associated character arrays, extracted terms and features can be
mapped onto numeric keys (e.g. using hashing or MD5 signatures) just within the parsing routine.
2) Network activity
The comprehensive thread analysis in multiple runs (using JProfiler monitoring software) shows that crawler threads do not exploit
the full available bandwidth. Usually, the network activity (data downloads) and parsing (String and char[] manipulations)
result in 5% to 10% of thread runtime. In the remaining time, crawler threads remain in the 'blocking' state; the blocking monitor is
owned by the system object sun.net.www.protocol.http.Handler that contains built-in Java routines to manage HTTP connections:
DNS lookups, opening socket connections, and processing HTTP headers. Domain-specific customization of JVM parameters (parallel use of
multiple DNS servers, agressive caching of lookup results, etc.) may be crucial for the crawler performance.
3) The current version allows a simple limitation for the number of extracted links that should be added to the URL queue on
a per-document basis. This focusing option can be re-implemented in a more flexible manner: the number of followed links can depend
on document classification score, average score of previously classified documents from the same host, or other similar criterions.
The same holds also for the maximum allowed number of documents that can be downloaded from given host or domain.