==================================================== Enchancements and changes in current BINGO! release ==================================================== This document summarizes some of the significant enhancements and changes introduced in BINGO! 3.4.1, since version 3.3.14. For further information, see the following documents: - License.txt - the license agreement for use of this software - INSTALL.txt - instructions for installation and troubleshooting - Customize.txt - recent customizations and modifications of BINGO! components THESE INFORMATIONS ARE PROVIDED WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES. IN NO EVENT SHALL THE DATABASE AND INFORMATION SYSTEMS RESEARCH GROUP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THESE RECOMMENDATIONS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. NOTE: If you use BINGO! in your scientific work, please cite as: Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer: The BINGO! System for Information Portal Generation and Expert Web Search. The 1st Semiannual Conference on Innovative Data Systems Research (CIDR), Asilomar(CA), 2003 available at http://www-db.cs.wisc.edu/cidr2003/program/p7.pdf ================================================== 1) DB Schema Manager The Schema Manager stores no duplicates of database descriptors. When the database schema is re-created, the new connection string replaces the old one in the connection list. This reduces the number of similar entries in the "quick access" pulldown menu of the the Login Dialog. The "quick access" menu is added to the Schema Manager dialog. This allows to "re-write" existing database accounts without entering all connection details by hand. The initial value of the Administrator password is set to "sys" by default for all available Oracle connections. 2) Document annotations The new annotation algorithm uses the Kullback-Leibler divergence (computed on features of particular document centences) to create meaningful topic-relevant previews for HTML and PDF documents. Customizable weghting options are included for centences with special markup (e.g.

..) and initial centences of each paragraph. 3) PDF Parsing The PDF processing routine is improved to capture different encoding options for german 'Umlaut' characters. 4) Stopwords The language of the current document is recognized by counting extracted stopwords. The algorithm counts stopwords for the given locale (stemmer language) and all other available languages, separately. The classifier can reject the document when count(locale) - count(other languages) is less than the specified threshold. This routine is also useful to exclude pages without meaningful text (e.g. pages based on images and icons) from classification. Furthermore, it prevents the focused crawler from drift into non-relevant foreign domains. New french stemmer and appropriate french stopword list are added. 5) The crawler focusing options are slightly modified. The ordering of URLs on the crawl frontier can be influenced by changing following parameters: Focusing: - soft: links from all positively classified documents are used to continue crawl (plus tunneling option for negative results, when no positively classified documents available) - strong: OLD: links from a new positively classified document are only used to continue crawl, when its topic is the same as (or below) the topic of its predecessor. NEW: links from a new positively classified document are only used to continue crawl, when its topic is the same as (or below) the topic of initial bookmark from that is was reached. This strategy is slightly less restrictive than the old version. Ordering of links on the crawl frontier (scores in descending order): - SVM + depth-first: score = SVMScore * Document.depth - SVM + breadth-first: score = SVMScore / (Document.depth+1) - SVM: score = SVMScore - depth-first: score = Document.depth - breadth-first: score = 1/(1+Document.depth) - FIFO: score = 1.0, queue maintains the FIFO order 6) Web Services - new Google Web Services library that can be used to initialize crawl (in connection with option "Links from Database"). It uses native Google Web Services (libraries provided by Google). The number of queries per day is limited to 1000. The number of matches for each result page is limited to 10; multiple pages for same query can be retrieved one by one. - new Amazon Web Services interface that can be used to initialize crawl (in connection with option "Links from Database"). It uses the native Amazon Web Services. The number of queries is not limited. Each result page contains 10 matches ordered by relevance; multiple pages for one query can be accessed one by one. 7) Processing HTML Frames The processing of HTML Frames has been improved. The Crawler rejects now nested frames (frames within frames) to avoid endless loops on incorrect HTML inputs. 8) Limitation of #documents per host To avoid endless loops within manipulated Web sources with thousands of (faked) links, the limitation of total number of documents per host was introduced. In the mean, the value of 100 to 300 is sufficient to crawl meaningful resources completely or build a representative 'cutout' of its contents. 9) Following extracted links For each HTML document, the number of accepted links for the URL queue can be limited. In the prior version, first extracted links were always used. The selection routine is now randomized: the required number of links is randomly chosen from the complete set. This option does not influence the storage of extracted links (for link analysis purposes, all extracted links are completely stored into the database) 10) Memory consumption issues To reduce memory consumption, the links extracted from HTML documents are now maintained as URL objects (rather than BingoDocument objects). The conversion of URL objects into BingoDocument is initiated only for selected links that must be added to the URL queue. Further possible improvements that are not (yet) implemented in this release. 1) On-the-fly conversion of strings To avoid high memory consumption by Java String objects and associated character arrays, extracted terms and features can be mapped onto numeric keys (e.g. using hashing or MD5 signatures) just within the parsing routine. 2) Network activity The comprehensive thread analysis in multiple runs (using JProfiler monitoring software) shows that crawler threads do not exploit the full available bandwidth. Usually, the network activity (data downloads) and parsing (String and char[] manipulations) result in 5% to 10% of thread runtime. In the remaining time, crawler threads remain in the 'blocking' state; the blocking monitor is owned by the system object sun.net.www.protocol.http.Handler that contains built-in Java routines to manage HTTP connections: DNS lookups, opening socket connections, and processing HTTP headers. Domain-specific customization of JVM parameters (parallel use of multiple DNS servers, agressive caching of lookup results, etc.) may be crucial for the crawler performance. 3) The current version allows a simple limitation for the number of extracted links that should be added to the URL queue on a per-document basis. This focusing option can be re-implemented in a more flexible manner: the number of followed links can depend on document classification score, average score of previously classified documents from the same host, or other similar criterions. The same holds also for the maximum allowed number of documents that can be downloaded from given host or domain.