Course "Selected Topics in Web Information Retrieval and Mining"

 

In-depth Course
Winter Semester 2003/04

Prof. Dr.-Ing. Gerhard Weikum
Databases and Information Systems (AG5, MPI)


 
   Organization

 

   Contents of the Course

The course covers selected publications from recent conferences and journal publications in the area of Web information retrieval and mining (SIGIR, TOIS, WWW, CIKM, Machine Learning, KDD, SIGMOD, VLDB, TODS, etc.). The material is organized into four main topics each with 2-4 lectures: Personalization of Information Search, Efficient Ranking, System Architecture, and Statistical Language Models.

 

  Requirements for the Certificate (6 Credit Points)
Each student that participates in the course is supposed to present her/his solutions to assignments at least once during the semester. This is a mandatory requirement. Each additional presentation, on a voluntary basis, earns the student one bonus point; up to 3 bonus points are possible.

Grades will be based on oral exams (of 20-30 minutes per student) at the end of the semester. Each bonus point that was earned for the presentation of good solutions to assignments improves the exam's grade by 1/3 according to the German grading system (e.g., an oral exam grade of 2.0 would be improved to 1.7 with one bonus point, 1.3 with two points, and 1.0 with three points).

 

   General Background Literature
 

Soumen Chakrabarti: Mining the Web, Morgan Kaufmann, 2003.

http://http.cs.berkeley.edu/~soumen/mining-the-web/

 

Gerhard Weikum (Editor): IEEE CS Data Engineering Bulleting Vol 25 No.1, March 2002,

Special Issue on Organizing and Discovering the Semantic Web.

ftp://ftp.research.microsoft.com/pub/debull/A02MAR-CD.pdf

 

   Tentative Topics and Dates
 
Part 1: Personalized Information Search

1) Thu, Oct 30 - Topic-specific and Personalized Page Rank 1
Taher Haveliwala:
Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search
IEEE Transactions on Knowledge and Data Engineering, to appear in 2003.
http://www.stanford.edu/~taherh/papers/topic-sensitive-pagerank-tkde.pdf (No. 0101)
 
Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub:
Exploiting the Block Structure of the Web for Computing PageRank
Technical Report, Stanford University, 2003.
http://www.stanford.edu/~taherh/papers/blockrank.pdf(No. 0102)
 
Chris Ding, Xiaofeng He, Parry Husbands, Hongyuan Zha, Horst Simon:
PageRank, HITS, and a Unified Framework for Link Analysis
SIAM International Conference on Data Mining, 2003.
http://www.nersc.gov/research/SCG/cding/papers_ps/sigpage6b.ps(No. 0103)

2) Thu, Nov 6 - Topic-specific and Personalized Page Rank 2
Glen Jeh, Jennifer Widom: 
Scaling personalized web search
WWW Conference, 2003. 
http://www2003.org/cdrom/papers/refereed/p185/html/p185-jeh.html
http://citeseer.nj.nec.com/jeh02scaling.html(No. 0201)
 
Serge Abiteboul, Mihai Preda, Gregory Cobena: 
Adaptive on-line page importance computation
WWW Conference, 2003.
http://www2003.org/cdrom/papers/refereed/p007/p7-abiteboul.html (No. 0202)
 

3) Thu, Nov 13 - Query-Log-based Personalization
Ji-Rong Wen, Jian-Yun and Hong-Jiang Zhang:
Query Clustering using User Logs 
ACM Transactions on Information Systems (ACM TOIS), 20(1), 59-81, January, 2002
http://doi.acm.org/10.1145/503104.503108(No. 0301)
 
Hang Cui, Ji-Rong Wen, Jian-Yun Nie and Wei-Ying Ma:
Query Expansion by Mining User Logs
IEEE Transaction on Knowledge and Data Engineering, Vol. 15, No. 4, July/August 2003.
http://csdl.computer.org/comp/trans/tk/2003/04/k4toc.htm(No. 0302)
or get the hard copy from library 
or an alternative version from http://research.microsoft.com/asia/dload_files/group/mediasearching/2002p/QE-TKDE.pdf
 
Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-Jiang Zhang, Chao-Jun Lu:
Implicit Link Analysis for Small Web Search
SIGIR Conference, 2003.
http://research.microsoft.com/~i-hjzeng/p31261-xue.pdf(No. 0303)
 

4) Thu, Nov 20 - Relevance Feedback
Michael Ortega-Binderberger, Kaushik Chakrabarti, Sharad Mehrotra:
An Approach to Integrating Query Refinement in SQL
EDBT Conference, 2002.
http://www-db.ics.uci.edu/pages/publications/2002/tr-db-02-03-full.pdf(No. 0401)
 
Kaushik Chakrabarti, Michael Ortega-Binderberger, Sharad Mehrotra, Kriengkrai Porkaew: 
Evaluating Refined Queries in Top-k Retrieval Systems
TKDE Vol.15 No.5, 2003.
http://www-db.ics.uci.edu/pages/publications/2003/tr-db-03-05.pdf(No. 0402) 
          
 
Michael Ortega-Binderberger and Sharad Mehrotra:
Relevance Feedback in Multimedia Databases
In Borko Furht and Oge Marquez (Eds.), 
Handbook of Video Databases: Design and Applications, CRC Press, 2003.
http://www-db.ics.uci.edu/pages/publications/2003/Chapter23_book.pdf(No. 0403) 
          
 
 

Part 2: Efficient Ranking

5) Thu, Nov 27 - Index Pruning
Xiaohui Long, Torsten Suel: 
Optimized Query Execution in Large Search Engines with Global Page Ordering
VLDB 2003.
http://cis.poly.edu/suel/papers/order.pdf(No. 0501)

Alistair Moffat, Justin Zobel: 
Self-Indexing Inverted Files for Fast Text Retrieval
TOIS 14(4), 1996.
http://doi.acm.org/10.1145/237496.237497(No. 0502)
 
Aya Soffer, David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoëlle S. Maarek: 
Static Index Pruning for Information Retrieval Systems
SIGIR Conference, 2001.
http://www.almaden.ibm.com/cs/people/fagin/sigir01.pdf(No. 0503)
 

6) Thu, Dec 4 - Rank Aggregation
Ronald Fagin, Ravi Kumar, D. Sivakumar: 
Efficient similarity search and classification via rank aggregation
SIGMOD Conference, 2003.
http://www.almaden.ibm.com/cs/people/fagin/sigmod03.pdf(No. 0601)
 
Ronald Fagin, Amnon Lotem, Moni Naor: 
Optimal aggregation algorithms for middleware
Journal of Computer and System Sciences Vol.66 No.4, 2003
http://dx.doi.org/10.1016/S0022-0000(03)00026-6(No. 0602)
 
Ronald Fagin, Ravi Kumar, and D. Sivakumar:
Comparing Top k Lists
To appear in SIAM Journal on Discrete Mathematics.
http://www.almaden.ibm.com/cs/people/fagin/topk.pdf(No. 0603)


7) Thu, Dec 11 - Top k Queries on Structured Data and Deep Web Sources
Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, and Aristides Gionis,
Ranking on structured data: Automated Ranking of Database Query Results
CIDR Conference, 2003.
http://www-db.cs.wisc.edu/cidr/program/p9.pdf(No. 0701)
 
Nicolas Bruno, Luis Gravano, Amélie Marian: 
Evaluating Top-k Queries over Web-Accessible Databases
ICDE Conference, 2002.
http://www1.cs.columbia.edu/~amelie/papers/combining.pdf(No. 0702)

Panayiotis Tsaparas, Themistoklis Palpanas, Yannis Kotidis, Nick Koudas, Divesh Srivastava:
Ranked Join Indices
ICDE Conference, 2003.
http://www.research.att.com/~divesh/papers/tpk+2003-rji..ps.gz(No. 0703)


Part 3: System Architecture

8) Thu, Dec 18 - Topic Distillation
Soumen Chakrabarti, Mukul Joshi, Vivek Tawde:
Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks
SIGIR Conference, 2001.
http://citeseer.nj.nec.com/chakrabarti01enhanced.html(No. 0801)
 
Soumen Chakrabarti, Kunal Punera, Mallela Subramanya:
Accelerated focused crawling through online relevance feedback
WWW Conference, 2002.
http://www2002.org/CDROM/refereed/336/(No. 0802) 
 

9) Thu, Jan 8 - XML Search Engines
Lin Guo, Feng Shao, Chavdar Botev, Jayavel Shanmugasundaram: 
XRANK: Ranked Keyword Search over XML Documents
SIGMOD Conference, 2003.
http://www.cs.cornell.edu/People/jai/papers/XRank.pdf(No. 0901)
 
Sihem Amer-Yahia, SungRan Cho, Divesh Srivastava:
Tree Pattern Relaxation
EDBT Conference, 2002.
http://www.research.att.com/~divesh/papers/acs2002-relax.ps(No. 0902)
 

10) Thu, Jan 15 - Peer-to-Peer Search
Edith Cohen, Amos Fiat, Haim Kaplan:
Associative Search in Peer to Peer Networks: Harnessing Latent Semantics
IEEE INFOCOM Conference, 2003.
http://www.research.att.com/~edith/Papers/infocom03.ps(No. 1001)
 
Jie Lu, Jamie Callan:
Content-Based Retrieval in Hybrid Peer-to-Peer Networks
CIKM Conference, 2003.
http://www-2.cs.cmu.edu/~jielu/Papers/cikm03_jielu_irp2p.pdf(No. 1002)
 
Chiasen Chung, Charles L.A. Clarke:
Topic-Oriented Collaborative Crawling
CIKM Conference, 2002
http://citeseer.nj.nec.com/538331.html(No. 1003)
 

Part 4: Statistical Language Models and Topic Spaces

11) Thu, Jan 22 - Latent Semantic Spaces
Thomas Hofmann: 
Unsupervised Learning by Probabilistic Latent Semantic Analysis
Machine Learning Vol.42 No.1/2, 2001.
http://www.kluweronline.com/oasis.htm/279327(No. 1101)
 
D.M. Blei, A.Y. Ng, M.I. Jordan:
Latent Dirichlet Allocation
Journal of Machine Learning Research Vol.3, 2003.
http://www.cs.berkeley.edu/~jordan/papers/blei03a.ps.gz(No. 1102)
 

12) Thu, Jan 29 - Unified Link and Content Models
L. Getoor, N. Friedman, D. Koller, B. Taskar:
Learning Probabilistic Models of Link Structure
Journal of Machine Learning Research, 2002.
http://www.cs.umd.edu/~getoor/Publications/jmlr02.pdf(No. 1201)
 

Dimitris Achlioptas, Amos Fiat, Anna R. Karlin, Frank McSherry:

Web Search via Hub Synthesis

FOCS Conference, 2001.

http://citeseer.nj.nec.com/achlioptas01web.html(No. 1202)

 
Matthew Richardson, Pedro Domingos:
The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank
NIPS Conference, 2001.
http://www.cs.washington.edu/homes/pedrod/papers/nips01b.pdf(No. 1203) 
          
 

13) Thu, Feb 5 - Topic Mappings
Sunita Sarawagi, Soumen Chakrabarti, Shantanu Godbole:
Cross-Training: Learning Probabilistic Mappings Between Topics
http://http.cs.berkeley.edu/~soumen/doc/sigkdd2003(No. 1301) 
          
 
AnHai Doan, Jayant Madhavan, Pedro Domingos, Alon Halevy:
Ontology Matching: A Machine Learning Approach
In S. Staab and R. Studer (eds.), Handbook on Ontologies in Information Systems, 
Springer, 2003. 
http://www.cs.washington.edu/homes/pedrod/papers/hois.pdf(No. 1302) 
          
 

Last upload: Hanglin Pan, Nov 19, 2003