================================================== CUSTOMIZATION RECOMMENDATIONS FOR BINGO! Framework ================================================== This document contains some recommendations on possible customizations for the BINGO! framework. It is NOT designed as a step-by-step guide for trivial modifications; it's rather a collection of conceptual recommendations for experienced Java developers. THESE RECOMMENDATIONS ARE PROVIDED WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES. IN NO EVENT SHALL THE DATABASE AND INFORMATION SYSTEMS RESEARCH GROUP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THESE RECOMMENDATIONS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. IMPORTANT: We highly recommend you to create ***BACKUPS** of all BINGO! sources prior to apply any changes. NOTE: If you use BINGO! in your scientific work, please cite as: Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer: The BINGO! System for Information Portal Generation and Expert Web Search. The 1st Semiannual Conference on Innovative Data Systems Research (CIDR), Asilomar(CA), 2003 available at http://www-db.cs.wisc.edu/cidr2003/program/p7.pdf ================================================== 1) Running BINGO! on non-Win32 platforms 2) Using the Oracle database via Oracle JDBC 3) Using another databases 4) Adding support for special document types 5) Customizations of the classifier 6) Crawler monitoring: Crawler Plug-Ins 7) Modifications of the database schema 8) Menu language customization 9) Stemmer language customization 10) Tokenizer customization 11) Performance tuning 12) Java virtual machine: memory, internal params 13) Adding support for non-http protocols. 14) Customization of the Crawler queue 15) Pattern-based excluding of URLs from crawl 16) Important BINGO! options 17) Adding new global options 18) Adding language support for new GUI menues 19) Customizations of feature spaces ========================================================================== 1) Running BINGO! on non-Win32 platforms The current BINGO! release was tested on Win32 platform. With minor changes described below, the framework can be adapted to run on other operating systems (e.g. Linux) - the core BINGO! implementation is 100% Java and runs without any changes on most JVMs that are compatible with Sun Java 1.4.* and higher. - verify that execution rights for all shell scripts located in the root directory of your BINGO! installation are properly set. On Unix systems, you can use the command 'chmod u+x filename' to enable execution of these files. - the classifier of the engine uses the SVM modelling tool SVM*Light provided by Thorsten Joachims. The engine calls SVM*Light to build the classification model, that is than imported into BINGO! from SVM*Light output file. You can obtain the latest compile for your OS as well as sources for SVM*Light from http://svmlight.joachims.org. - compile and build the program 'svm_learn' from SVM*Light package and place it into /temp directory of your BINGO! installation - modify shell scripts '/temp/clean.bat' and '/temp/train.bat' that reside in the same directory to execute the program 'svm_learn' - verify that execution rights for both shell scripts and the program 'svm_learn' are properly set . - the engine uses Adobe PDF Filter to parse PDF files. The shell script /bingo/crawler/handler/FiltDump.bat is used to call the external program /bingo/crawler/handler/FiltDump.exe. You can replace 'FiltDump.exe' by any platform-specific PDF parser that can perform PDF filtering and produce plain ASCII output. FiltDump.bat redirects this output into ASCII temp files that are used for further content processing. Finally, you will need to modify FiltDump.bat and ensure that execution rights are properly set. If there is no appropriate PDF parser (or PDF processing is not intended), you can delete files FiltDump.exe and FiltDump.bat. This will automatically disable PDF processing. 2) Accessing the Oracle database via JDBC the BINGO! framework was tested with both MySQL and Oracle databases. The initial configuration is set up to run with an MySQL instance. If you want to run BINGO! with an Oracle database, you can reconfigure the framework: - download the Oracle JDBC driver package and put the library into /lib directory of your BINGO! installation - include this file (usually classes12.zip, oracle.jar or something similar) into your CLASSPATH setting in shell script files r.bat, schema.bat, mini.bat and rebuild.bat that are located in the root directory of your BINGO! installation. - create a new Oracle database instance (e.g. using Oracle management tools) and note its access parameters (hostname, port, service name, root username and password) - verify SQL scripts /schema/schema_speed.sql and /schema/user.sql and adapt them to your database parameters (e.g. tablespace names), if necessary. - run the script 'schema.bat' to create a new BINGO! user. - edit the file /conf/config.xml and replace the setting mysql by oracle. 3) Using other database systems Most components of BINGO! use SQL'92 compliant queries and should work with any SQL'92 compatible database system (e.g DB2 or SQLServer). However, there are some system-dependent differences in LOB management, database infrastructure (e.g. Views), establishing of JDBC connections, etc. that need to be reconsidered individually. In general, the following modifications are required: - download the appropriate JDBC driver package and put the library into /lib directory of your BINGO! installation - include this file(s) into your CLASSPATH setting in shell script files r.bat, schema.bat, mini.bat and rebuild.bat that are located in the root directory of your BINGO! installation. - create a new database instance (e.g. using vendor management tools) and note its access parameters (hostname, port, service name, root username and password) - create a new database user with properly set access rights, and log in (using this account). - execute queries from /schema/schema_speed.sql to create the BINGO! schema for this user - edit the file /conf/config.xml and replace the setting *** by my-database-name. - create a new class "bingo.db.MyDatabaseInterface extends bingo.db.DBInterface" that should override all methods with non-SQL92 queries of the framework. You can use existing classes bingo.db.MySQLInterface and bingo.db.OracleInterface as prototypes. - edit the file /bingo/util/SessionBuffer.java. You need to modify the method getDBConnection() and add a new driver-specific method that opens a new connection to your database. You can use existing methods getMySqlConnection() and getOracleConnection() as prototypes. - rebuild the framework using the shell script rebuld.bat 4) Using additional handlers for special document types BINGO! is equipped with so-called 'handler' classes that can process documents of particular MIME types. Currently the engine supports document types text/html, text/plain, application/pdf, and application/xml. If you want to add routines for further mime types (e.g. PostScript or WinWord), the following steps are required: - carefully study the class "bingo.crawler.handler.CrawlHandler" to understand the basic handler architecture - Create a new class "bingo.crawler.handler.MyHandler extends CrawlHandler" with the desired functionality. You can use existing classes bingo.crawler.handler.HtmlHandler and bingo.crawler.handler.PDFHandler as prototyps. - Modify the class "bingo.crawler.handler.HandlerManager" and include the handling routine for your mime type. - Modify the file /bingo/data/allowed_mimes.dat and add the standard name of your mime type and the maximum allowed download size. - rebuild the framework using the shell script rebuild.bat 5) Classifier customizations the BINGO! framework uses the linear SVM model to classify crawled documents. Each topic of the ontology contains its own linear SVM classifier that is based on training samples from that node ant its children (positive examples) and all opposite nodes on the same level (negative training samples). Furthermore, each node may contain additional manually selected negative examples (documents with status 'J'=Junk), that are used to improve the classifier quality. To add another type of the classifier (e.g. Naive Bayes), you need to perform following steps: - carefully study the classes "bingo.svmlight.ModelBuilder" and "bingo.svmlight.NodeModel" to realize the basic classifier architecture. - create new classes "bingo.svmlight.MyModelBuilder" and "bingo.svmlight.MyNodeModel" with desired functionality. - modify the class "bingo.util.BingoTreeNode" (represents the topic of the BINGO! ontology) and replace carefully the current SVMClassifierModel by MyClassifierModel. - please note that classification scores returned by the classifier are also used to order links on the crawl frontier (higher scores = higher priority). Improper setting of classification scores in the classifier may thus infer the thematical focusing of the crawler. - rebuild the framework using the shell script rebuld.bat If you intend to tune the default SVM classifier of BINGO rather than to implement a new one, following steps might be useful: - verify the script "/temp/train.bat" that is used to execute the external SVM*Light training routine. For instance, you can add or modify SVM*Light flags to influence the training procedure: Learning options: -z {c,r,p} -> select between classification (c), regression (r), and preference ranking (p) (default classification) -c float -> C: trade-off between training error and margin (default [avg. x*x]^-1) -w [0..] -> epsilon width of tube for regression (default 0.1) -j float -> Cost: cost-factor, by which training errors on positive examples outweight errors on negative examples (default 1) (see [4]) -b [0,1] -> use biased hyperplane (i.e. x*w+b>0) instead of unbiased hyperplane (i.e. x*w>0) (default 1) -i [0,1] -> remove inconsistent training examples and retrain (default 0) Kernel options: -t int -> type of kernel function: 0: linear (default) 1: polynomial (s a*b+c)^d 2: radial basis function exp(-gamma ||a-b||^2) 3: sigmoid tanh(s a*b + c) 4: user defined kernel from kernel.h -d int -> parameter d in polynomial kernel -g float -> parameter gamma in rbf kernel -s float -> parameter s in sigmoid/poly kernel -r float -> parameter c in sigmoid/poly kernel -u string -> parameter of user defined kernel Optimization options (see [1]): -q [2..] -> maximum size of QP-subproblems (default 10) -n [2..q] -> number of new variables entering the working set in each iteration (default n = q). Set n size of cache for kernel evaluations in MB (default 40) The larger the faster... -e float -> eps: Allow that error for termination criterion [y [w*x+b] - 1] >= eps (default 0.001) -h [5..] -> number of iterations a variable needs to be optimal before considered for shrinking (default 100) -f [0,1] -> do final optimality check for variables removed by shrinking. Although this test is usually positive, there is no guarantee that the optimum was found if the test is omitted. (default 1) - in addition, you can verify the parametrization of the resulting linear SVM classifier in the class "bingo.util.BingoTreeNode", method "calcAvgSVMScore()". After the new SVM model has been created, the system classifies its training data using new classifier. The scores of new incoming documents are normalized by the average score of training data (to make answers from particular node classifiers of the tree comparable). to make classifiers more restrictive, the document is considered as positively classified only when its score ist higher than the threshold currently defined as minSVMTreshold = avgSVMScore * 0.05; You may want to adapt this setting according to your needs. 6) Crawler monitoring: Crawler Plug-Ins The Crawler of the BINGO! framework supports registraton of so-called callback objects that can be notified about ist state changes and particular crawling events b(download, classification, failures, etc.). This mechanism is used to monitor the crawl in BINGO! GUI applications and for some other purposes (e.g. import of new training data). Following steps are required to implement new 'Listener' objects: - create a new class "bingo/crawler/MyListener implements LinkListener" or "bingo/crawler/MyCrawlListener implements CrawlListener". - use the Crawler methods (class "/bingo/crawler/BINGOCrawler") addCrawlListener(), addLinkListener(), removeCrawlListener() and removeLinkListener() to register and unregister new listeners. - when the state of the crawler changes, the listener will be automatically notified about this event by calling ist callback routines as contained in respective interface definitions. 7) Modifications to database schema and its parameters You can modify the BINGO! database by editing SQL scripts in "/schema" directory. These scripts are used by automated routines for MySQL and Oracle databases to create a new user. When you intend to use another database, you can add classes for new automated schema generator to the package "schema" (directory "/schema") using existing routines for MySQL and Oracle as prototypes. Also you can create the schema "by hand", executing queries from SQL scripts one by one. - to modify the database schema, edit the SQL scripts "/schema/schema_speed_mysql.sql" (MySQL), "/schema/schema_speed.sql" (Oracle), or create your own new script using these files as prototypes. - to modify the indexes of the schema, edit SQL scripts "/schema/index_create.sql" and "/schema/index_drop.sql". - to modify quick-start account information in the initial LoginDialog, edit the file "/bingo/data/accounts.dat". This textfile contains stored connection parameters for BINGO! users that were created using automated BINGO! schema tools. Connections are stored line by line using following format: "user;password;hostname;database_name" (MySQL) "user;password;hostname;service_name" (Oracle) 8) Language customization The GUI of BINGO! framework supports English and German interface languages. The stemmer of the BINGO! engine provides stopword lists and stemming routines for English and German languages. To add support for additional GUI language, you need to perform following steps: - create a new file "/conf/mylanguage.xml" with formatted GUI messages translated into the language of your choice. You can use existing files "/conf/english.xml" and "/conf/german.xml" as prototypes. - modify the file "/conf/config.xml" and replace "***" by "mylanguage". - modify the "cDLanguageSelect" element in files "/conf/german.xml", "/conf/english.xml", and "/conf/mylanguage.xml" and add the name of new language: deutsch;english;mylanguage - recompilation is not required 9) Stemmer language customization The current version of BINGO! comes up with stemming algorithms for English and German. Internally, we use Snowball stemmers (written by Martin Porter, author of the well-known Porter stemmer) to process tokens. Thus, the simplest way to add stemming for a new language is to call the desired language-specific Snowball stemmer. The Snowball package included in our release supports following languages: danish, dutch, english, finnish, french, german, italian, norwegian, portugese, russian, spanish, swedish. - the documentation and newest versions of Snowball stemmers can be found at http://snowball.tartarus.org/ - create a new directory /bingo/data/mylanguage. The name "mylanguage" MUST exactly match the name of corresponding Snowball stemmer. See the included library file "/lib/stemmer.jar" and its package "net.sf.snowball.ext" for details. - create a plain text file "/bingo/data/mylanguage/stopwords.txt" that should contain stopwords for the desired language (you can get stopword files for supported languages from the Snowball homepage). - modify the BINGO! configuration file "/conf/config.xml": replace ***** by mylanguage - modify language configuration files "/conf/english.xml" and "/conf/german.xml". Replace the option german;english by german;english;mylanguage - recompilation is not required 10) Tokenizer customization The tokenizer is used to split the text of the current document (represented by a character buffer) into particular tokens. The current tokenization procedure of BINGO! is quite simple and straightforward: - Normalization. German-specific characters (umlaut-characters with diaeresis and eszett=long-S) are transcribed by (ae, oe, ue, and ss). - Tokenization. Characters a..z,A..Z are threated as symbols, all others as delimiters. Since the current version of Java "StreamTokenizer" class has unfortunately small hidden bugs (it is NOT possible to threat some special characters as delimiters), we use a simple self-implemented tokenization routine. - Stopword elimination. The system-wide stopword list is used to remove language-specific non-relewant words. - Stemming. The current language-specific stemmer is applied. To modify the tokenization algorithm (you may want to add support for numbers that are currently threated as whitespaces, set up stopword elimination mechanism to be applied on tokens before or after stemming, or transcript language-specific special characters), following steps are required: - carefully study and modify the class "/bingo/crawler/handler/parser/StemmerDriver" according to your requirements. - verify the method "/bingo/util/SessionBuffer.setStopwords()". Uncomment the statement "token=stem(token)" if the stopword elimination should be done on word stems rather than words (this change will lead to stronger filtering of potential stopwords); - rebuild the framework using the shell script rebuld.bat - NOTE: String replacements use Java regular expressions that are applied to each(!) extracted word. Big amount of replacement patterns may cause performance problems. 11) Performance tuning In order to optimize the crawler performance, we recommend to verify and set up crawling parameters according to your current demands. The most important settings that directly influence the engine performane are: - the amount of shared memory available to the Java virtual machine for BINGO! application. See Question 12) to learn more about JVM parameter settings. - the number of crawling threads. Higher numbers of crawler threads may help to increase the overall crawling speed. Please keep in mind that each thread maintains its own database connecton, so the database must be set up to allow the expected number of parallel connections simultaneously. Furthermore, parallel in-memory processing of multiple documents may rapidly increase the overall memory consumption of the framework. You may use the BINGO! GUI (section "Options") or directly modify the configuration file "conf/config.xml", option "**" to change this parameter. - the number of allowed parallel connections to the same host. Currently, the BINGO! crawler is configured to allow only a limited nubmer of parallel connections to the same host. The default value is currently set to 4 to avoid denial-of-service problems on particular Web servers. However, if it is intended to scan completely one large-scale Web service with guaranteed performance (e.g Amazon or DBLP), this parameter can be increased for higher processing speed. The appropriate locking mechanism resides in the class "bingo.crawler.frontier.URL_Queue". You can modify the attribute "max_locks" of the class "bingo.crawler.frontier.URL_Queue" to change this setting. - the number of links from each document to follow. This option is useful to avoid the queue overflow with useless links from banner sites and faked hubs with thousands of (mostly useless) links. The default value is 5. You may use the BINGO! GUI (section "Options") or directly modify the configuration file "conf/config.xml", option "*" to change this parameter. - maximum allowed crawling depth. For interconnected communities (e.g. Computer Science), the characteristic path length is usually small (in order of 10). In some cases, it migh be useful to increase this value - for instance, when the topic of interest is widely spread over the internet and particular pages are usually not directly connected to each other. The default value of this parameter is 10. You may use the BINGO! GUI (section "Options") or directly modify the configuration file "conf/config.xml", option "10" to change this parameter. - BINGO! GUI components. In large-scale crawl experiments, running of some GUI components (in particular, animated link structure and the overview of crawled documents) may slow down the crawling speed. These components are designed primarily for short demos; we recommend to switch out animations in long-term crawling sessions. - data to be stored. BINGO! stores by default short document descriptions, extracted links and features into the database. In addition, you can enable the storage of "raw" document sources (as BLOB). Although this option is useful for some applications, it would increase the database size and reduce crawling speed. By default, the LOB storage is set to OFF. You can enable this option using the BINGO! GUI (section "Options") or directly modify the configuration file "conf/config.xml", setting "true". 12) Java virtual machine: memory, internal params - We recommend to run the BINGO! framework with sufficient amount of reserved JVM shared memory to avoid allocation problems at the runtime. For large-scale experiments, it is recommended to provide at least 500 Mb shared memory at startup. Using Sun JVM, you can force the allocation of shared memory using JVM flags -Xmx and -Xms: Example: java -Xmx500M -Xms500M myClass - BINGO! uses advanced parameters of Sun JVM to optimize DNS lookups and HTTP network connections. Following entries in the class "bingo.crawler.BINGOCrawler" can be modified: java.security.Security.setProperty("networkaddress.cache.ttl", "300"); java.security.Security.setProperty("networkaddress.cache.negative.ttl", "60"); System.setProperty("networkaddress.cache.ttl", "300"); System.setProperty("networkaddress.cache.negative.ttl", "60"); System.setProperty("http.agent", "Mozilla/4.0 (compatible; MSIE 6.01; Windows NT 5.0)"); System.setProperty("http.keepAlive", "true"); System.setProperty("http.maxConnections", "4"); System.setProperty("sun.net.client.defaultConnectTimeout", "10000"); System.setProperty("sun.net.client.defaultReadTimeout", "60000"); System.setProperty("sun.net.inetaddr.ttl", "300"); System.setProperty("sun.net.inetaddr.negative.ttl", "60"); In case of other JVMs, changing these parameters will have no function. Please refer to the vendor's documentation on analogous parameter settings for your system. 13) Adding support for non-http protocols. The Crawler of the BINGO! framework is currently set up to handle URLs via "http" protocol. URLs that require another protocols (e.g. "ftp://myhost.net/file.pdf"), will be rejected. You can add support for additional protocols (e.g. "file"): - carefully study the class "bingo/crawler/frontier/url2resolve". - Modify its constructor "url2resolve()" to add the desired protocol to the list of allowed protocols "url2resolve.allowedProtocols". - Modify its method "url2resolve.getConnection()" to properly handle the new protocol. - rebuild the framework using the shell script rebuld.bat 14) Customization of the Crawler queue The queue is an important component of the focused crawler that is responsible for proper ordering of links on the crawl frontier. You may want to adapt following queue parameters: - the queue size. The size of the crawler's sorted queue is limited to avoid memory overflows. When the maximum allowed queue size is reached, new links can be still accepted by replacing lower-rated candidates at the bottom of the queue. Otherwise, they will be ignored. BINGO! maintains separate queues for particular ontology topics, so the selected value will be used for every topic queue. Small values (in order of 1000 or less) may cause preliminary loss of focus. High values (1.000.000 and above) may cause sytem overload. This tuning parameter can be modified in the configuration file "/conf/config.xml": myvalue as well as from BINGO GUI (section "Options"). - the URL ordering. Basically, URLs in the queue (objects "bingo.util.BingoDocument") are ordered by their priority attribute that can be accessed using functions BingoDocument.getPriority() and BingoDocument.setPriority(). The queue sorts links in descending order: greater priority value means higher priority. Currently, the priority value for each new link is assigned within routines of the class "bingo.crawler.handler.LinkHandler" according to its normalized SVM score. For rejected documents, the priority is set to the half of its predecessor priority (to enable tunneling). You can easily modify this policy by appropriate changes within the class "bingo.crawler.handler.LinkHandler". - implementation of the sorted queue. The current URL queue vor each class is implemented on top of an Java "ArrayList" object, backed by an array. You can replace the queue backbone (e.g. using linked list or TreeMap objects), according to your expected read/update pattern. In this case, you will need to modify the class "bingo.crawler.frontier.ClassQueue" and carefully adapt particular insert/remove/lookup routines. NOTE: the built-in Java container object with sorted access (TreeMap) may cause problems on systems with high load using Sun JVM 1.4.2. We observed that removed objects (TreeMap.delete()) sometimes are NOT immediately deleted from TreeMap and can be retrieved twice or even multiple times. This would cause violations of DB integrity constraints (document-IDs have to be unique). 15) Pattern-based excluding of URLs from crawl In some cases it is useful to "lock" (exclude) particular domains or URL patterns from current crawl: banner servers, irrelevant portals or private homepages might be potential candidates. - carefully study the file "bingo/crawler/handler/UrlVerifier". - add or remove desired "bad" patterns to the array "UrlVerifier.forbiddenMatches". - NOTE: Pattern-based locking uses String-based Java regular expressions that are applied to each(!) target URL. Big amount of complex patterns may cause performance problems. - the IP-based locking of particular hosts/domains is currently NOT supported. You can modify the classes "bingo/crawler/handler/UrlVerifier" and "bingo/crawler/frontier/url2resolve" to add the desired functionality. - rebuild BINGO! 16) Important BINGO! options 17) Adding new global options In some cases, you will need to extend BINGO! by some new global options. In general, following steps are required to integrate new global options into BINGO!: - edit two XML files "/conf/config.xml" **and** "/bingo/data/default.dat" (this file replaces the customized config.xml file when the user hits the Button "RestoreDefaults" in BINGO! Options) and add tags for new global parameter - edit class "bingo.util.SessionBuffer" and add the new static parameter variable and appropriate get/set access methods - modify its methods "initSettings()" and "saveConfig" and add read/store support for the new parameter - modify class "/bingo/crawler/ControlDialog" (BINGO! Options GUI) and add GUI support for new parameter. Don't forget to add action listerners for new GUI elements. Modify methods "initStandard()" and "commitsettings()" for proper initialization and postprocessing of new GUI elements. 18) Adding language support for new GUI menues The current BINGO! release supports menu languages English and German. In order to add support for new languages, you need to create a new language schema for all system messages. Carefully study language files "conf/german.xml" and "conf/english.xml". Copy one of them and translate all messages into desired language. Store new file as "conf/mylanguage.xml". Edit all "conf/language.xml" files (including the new one) and add the name of new language into attribute 'cDLanguageSelect': deutsch;english;mylanguage 19) Customizations of feature spaces To verify and customize feature spaces for each topic, you can use the collection of JSP servlets 'Bingo Reviser'. The simplest way to install 'BingoReviser' is to copy JSP Files and Java classes of its distribution into appropriate directories of an existing JSP repository (e.g., 'jsp-examples'). The root page of the Reviser is called 'bingo_feed_start.htm' and can be accessed in our example via http://hostname:8080/jsp-examples/bingo/bingo_feed_start.htm (depending on your custom settings, the port and the directory of this location may be different). The 'Administration'-page of the Reviser contains the link to 'Feature Reviser' routines. You can process feature spaces of particular topics using a set of filtering rules and store verified results back into database. The 'positively' marked features are used by BINGO! for restrictive filtering of new documents: every candidate must contain at least a specified number of 'good' topic-specifuc features to qualify for this topic. Appropriate settings for term-based classifier restrictivity can be accessed via BINGO! GUI (Menu 'Global Settings->Settings->Crawler') or by editing the configuration file 'conf/config.xml', options and . ================================================================================