================================================== First steps with BINGO! ================================================== This document contains some explanations on most important BINGO! routines and algorithms. Furthermore, it contains an overview of important involved packages, classes and methods for each step for troubleshooting and quick customizations. This introduction is NOT designed as detailed step-by-step introduction in all BINGO routines; it's rather a collection of conceptual recommendations for experienced Java developers. THESE RECOMMENDATIONS ARE PROVIDED WITHOUT ANY EXPRESSED OR IMPLIED WARRANTIES. IN NO EVENT SHALL THE DATABASE AND INFORMATION SYSTEMS RESEARCH GROUP BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THESE RECOMMENDATIONS, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. IMPORTANT: We highly recommend you to create ***BACKUPS** of all BINGO! sources prior to apply any changes. NOTE: If you use BINGO! in your scientific work, please cite as: Sergej Sizov, Michael Biwer, Jens Graupmann, Stefan Siersdorfer, Martin Theobald, Gerhard Weikum, Patrick Zimmer: The BINGO! System for Information Portal Generation and Expert Web Search. The 1st Semiannual Conference on Innovative Data Systems Research (CIDR), Asilomar(CA), 2003 available at http://www-db.cs.wisc.edu/cidr2003/program/p7.pdf ================================================== 1) Installation 2) Creating initial ontology 3) Monitoring feature spaces 4) Customize Training base & retrain 5) Customize starting points and start the crawl 6) Link analysis 7) Evaluation of results ========================================================================== 1) Installation Please refer to INSTALL.txt for detailed installation instructions and troubleshooting. In general, following simple steps are required: - download and uncompress the BINGO! release into new folder - run batch file schema.bat. This will create a new database user with BINGO! database schema. - run r.bat and select the new user from the dropdown list of available database connections Important packages and files: file schema.bat: batch file to create a new BINGO! user automatically file r.bat: starts the BINGO! engine class schema.SchemaManager: automatically creates the new BINGO! database user file schema/user.sql: contains SQL instructions to create a new BINGO! user file schema/schema_speed.sql: contains SQL instructions to create the BINGO! schema class bingo.main.BINGO3: start class of the BINGO! framework ================================================================================ 2) Creating initial ontology Open the Settings window ('Global Settings' -> 'Options' -> 'General') and verify general BINGO! settings: - Stemmer Language ('Stemmer Language') - Log Level ('Log Level'), higher values mean more details about the crawl (default 1) - Ignore null vectors: false - Links from database: false - Show stack Trace: yes (useful for debugging) 2.1 In BINGO! main window, select 'BINGO!'->'Batch on Bookmarks'. The batch routine will perform following steps: a) Import a new Bookmark file (in Netscape Bookmarks format) b) Create a new ontology tree according to topic structure of this bookmark file c) Retrieve dokuments referenced by bookmarks d) Store retrieved documents into database e) Apply DF-based feature pre-selection f) Apply MI-based feature selection and initialize feature spaces for each topic g) Build a new SVM model using downloaded documents as training samples h) Store new SVM model into database 2.2 Open the main crawler window ('Crawler' -> 'Crawler-Interface') and then 'Crawler' -> 'Start'. This will start the crawler on links extracted from retrieved training documents. Before starting the crawl, you can customize feature spaces, the SVM model, and entry points of the Crawler (i.e. Web addresses where the crawl should be started for better recall). Important packages and files: class bingo.util.BINGODesktop: main GUI window of the engine class bingo.util.CommonTasks: batch routines for common sequences of BINGO! algorithms (like ontology creation described above) class bingo.base.FeatureDialog: main class for import of referenced documents from bookmark file class bingo.crawler.BingoCrawler: the main crawler class (used to download bookmarked documents) class stored.berechne_df: preselection of features (DF-Based) class stored.mi.MiManager: main class for MI-based selection of features class bingo.svmlight.SVMModelBuilder: builds a new linear SVM model on current training documents class bingo.db.DBInterface: database access routines class bingo.util.BingoTreeNode: the node of the BINGO! ontology tree. Contains training documents, topic-specific features, and the topic-specific SVM classifier. class bingo.util.BingoDocument: represents a Web document (e.g. a HTML or PDF file) in the BINGO! data model class bingo.crawler.frontier.url2resolve: contains connection-specific information (e.g. IP address of the target URL) for BingoDocument. 3. Customization of feature spaces - Install the BingoReviser engine for Apache Tomcat. See installation instructions for BingoReviser for technical details of this installation. In general, it is sufficient to copy the contents of BingoReviser package into appropriate folders of existing ApacheTomcat JSP repository 'jsp-examples'. The root page of the BingoReviser is called 'bingo_feed_start.htm' and can be accessed in our example via http://hostname:8080/jsp-examples/bingo/bingo_feed_start.htm (depending on your custom settings, the port and the directory of this location may differ). - open the main page of BingoReviser (bingo_feed_start.htm) - open the 'Admin' page and type in details for your database connection - on the following Admin page, click the Link 'Feature Reviser' on the bottom - verify feature spaces for particular topics and store results. - In the BINGO! Framework, enable the use of 'indicators' for classification: 'Global Settings' -> 'Options' -> 'Crawler' -> 'Use Indicators' and set the min number of positive features (as selected before) that is required to positively classify a document. Important packages and classes: JSP+Class bingo.bingo_feed_admin: the admin page JSP+Class bingo.feature_auswahl: generate preview for selected features 4. Customizations of the training base & retraining - You can add new training documents to the ontology directly from crawler window ('Crawler' -> 'Crawler Interface') and Database window ('Database' -> 'Database Interface') using the '=> Training' Button. - in the BINGO! Framework, repeat Feature Selection routines ('Feature Selection' -> 'DF' and 'Feature Selection' -> MI) with customized numbers of features for each step. - Create a new SVM classifier ('SVM' -> 'SVM Modelling' -> 'Start'). After retraining, the new model will replace the old one. Additionally, it will be automatically serialized and stored into current database. 5 Customizations of starting points You can start the crawl on links from manually preselected training documents: - Open the training base view ('Training Base' -> 'Modify training Base') or Database view ('Database' -> 'Database Interface') - Select desired documents and send them to the crawler (Button '=> Crawler') You can also start the crawler on manually preselected links. These links can be completely independent from current training base and should be stored into BINGO! database table 'start_urls' as absolute HREF strings (e.g. 'http://www.mpi-sb.mpg.de'). The table 'start_urls' can be filled 'by hand' (e.g. using BINGO! SQL MiniClient: 'Database' -> 'SQL MiniClient') or using automated routines: > class WebAPI.StaticSubmitter contains an example with static list of pre-selected Links. (Run batch script 'StaticSubmitter.bat' in folder 'WebAPI' for automated sample execution) > class WebAPI.Google.GoogleTest contains an example with Google lookup using Google API (requires registration) (Run batch script 'GoogleTest.bat' in folder 'WebAPI.Google' for automated sample execution) > class WebAPI.Amazon.AmazonTest contains an example with Amazon lookup using Amazon Web Services (requires registration) (Run batch script 'AmazonTest.bat' in folder 'WebAPI.Amazon' for automated sample execution) - In the BINGO! engine, open the Settings-Window ('Global Settings' -> 'Options' -> 'General') and activate the checkboks 'Links from database'. - Open the Crawler window ('Crawler' -> 'Crawler Interface') and start the crawler ('Crawler' -> 'Start'). Important packages and files: class bingo.util.BINGODesktop: main GUI window of the engine class bingo.base.FeatureDialog: main class for import of referenced documents from bookmark file class stored.berechne_df: preselection of features (DF-Based) class stored.mi.MiManager: main class for MI-based selection of features class bingo.svmlight.SVMModelBuilder: builds a new linear SVM model on current training documents class bingo.db.DBInterface: database access routines 6) Link analysis BINGO! Provides routines for Link-Based authority ranking of crawl results. a) Kleinberg's HITS algorithm From the main window, select 'Link Analysis'-> 'Hits' and specify the number of iterations for computation. The results are automatically stored into database. a) PageRank From the main window, select 'Link Analysis'-> 'PageRank' (with/without pruning) and specify the number of iterations for computation. The results are automatically stored into database. The version 'PageRank without pruning' performas the standard PageRank analysis. The modified version 'PageRank with pruning' removes recursively all pages with outdegree 0 from the link graph model. Important packages and files: class stored.linkanal.WebRanker: contains all routines for link-based authority scores 7) Evaluation of results To evaluate crawl results interactively, you can use the JSP engine BingoReviser. After installation, - open the main window (e.g. http://hostname:8080/jsp-examples/bingo/bingo_feed_start.htm) and proceed to the 'Admin' window. Here you can add database connection details. On the following Admin page, customize all relevant details for your evaluation (e.g. include/exclude OTHERS topics, number of documents per topic to be shown, etc.) and store these preferences. - return to main page and open the 'Evaluation' link. Verify and store displayed crawl results. - return to main page and follow the 'Scores' Link. BingoReviser provides summarized information about Precision, Accuracy and Recall for each particulat topics as well as summarized micro/macro average values for the whole taxonomy. Important packages and files: JSP+Class bingo.bingo_feed_admin: the admin page JSP+Class bingo.bingo_feed_check: enter/verify database connection details JSP+Class bingo.bingo_feed_check_2: store updated results JSP+Class bingo.bingo_feed_summary: show summary of the evaluation