max planck institut
Practical Projects in Web Dynamics
As part of the exercises on Web Dynamics, everybody is required to do a practical project. The projects can be done in groups up to three students.
There are three topics to choose from: Web Structure, Web Size Estimation and Web Crawling. The first two projects will use the ClueWeb09 dataset, while the third one will focus on the MPII web site. In the end summarize your results, describe the techniques and the algorithms you have used and provide argumentation.
The ClueWeb09 of 1 billion web pages,
collected in January and February 2009.
Access to the dataset:
- 1,040,809,705 web pages
- 4,780,950,903 unique urls
- 7,944,351,835 outlinks
The services are accessible from the internal MPI-INF network only.
- Keyword Search
- results in JSON:
- results in XML:
- results in HTML:
- Existence of a page, returns
Analyze the ClueWeb09 dataset according to the model of the Web Graph introduced by
Broder et al.. Implement the algorithms for finding strongly and weakly connected components in a
graph. Apply the algorithms to identify the URLs which belong to the strongly connected component
(SCC), to the IN and OUT subgraphs, to the tendrils, and to the tubes. Compute the diameters of the
whole graph and the SCC.
We recommend to use JUNG
for graph processing and analysis.
A. Broder, R. Kumar, F. Maghoul, P. Rag,havan, S. Rajagopalan, R. Stata, A. Tomkins, J. Wiener: Graph
structure in the Web, Computer Networks, Vol. 33, No. 1. (June 2000), pp. 309-320.
Web Size Estimation
Use the ClueWeb09 dataset to estimate the size of the web with the method developed by Bharat and
Broder. Implement a sampling procedure for picking pages uniformly at random pages from a search
engine and from the data set. Implement a checking procedure or determining whether a particular page
is indexed by the search engine or is part of the dataset. Analyze the results and using the figures about
ClueWeb09 give an estimation for the size of web as indexed by the search engine.
Use the public search API for the communication with a search engine of your choice.
K. Bharat, A. Broder: A technique for measuring the relative size and overlap of public Web search
engines, Computer Networks, Vol. 30, No. 1-7. (1998), pp. 379-388.
Report how many
changed pages you have detected. Crawl regularly the web site several weeks before the experiment in
order to get insights about the behaviour of the different pages. Try to estimate the change rates of the
pages with the approach developed by Cho and Garcia-Molina. Use the change rate information to design
an optimal crawling strategy for the experiment.
We recommend to use the open source web crawler Heritrix.
- Crawl the pages in MPII site between 14.07.2009 and 17.07.2009 at least once (~10300).
- Use 100 additional queries to find changes.
- Maximize the detected changes from different web pages.
- All pages must match the regular expression
- All pages must not be of the form
J. Cho, H. Garcia-Molina: Estimating frequency of change ACM Transactions on Internet Technology,
Vol. 3, No.3, pp. 256-290
J. Cho, A. Ntoulas: Effective change detection using sampling Proceedings of the 28th international
conference on Very Large Data Bases, Hong Kong, China, pp. 514-525