INEX 2009 collection

Introduction

The INEX initiative had been using an XML collection based on Wikipedia articles since 2006. While that collection was a great improvement over the old IEEE articles in terms of topical diversity and size, it has been difficult to find queries with meaningful constraints on the structure of results. Starting in 2009, INEX uses a new set of Wikipedia XML articles that additionally provide semantic markup of articles and outgoing links, based on the semantic knowledge base YAGO, explicitly labeling more than 5,800 classes of entities like persons, movies, cities, and many more. As this new collection was created from a recent Wikipedia dump, it consists of approximately four times more articles than the 2006 collection and is approximately ten times larger in size.
This collection was created from the October 8, 2008 dump of the English Wikipedia articles and incorporates semantic annotations from the 2008-w40-2 version of YAGO.
For a more technical description of a preliminary version of this collection, see: Ralf Schenkel, Fabian M. Suchanek, and Gjergji Kasneci: YAWN: A semantically annotated Wikipedia XML corpus, 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW 2007), Aachen, Germany, March 2007.

Each of the collections comes in four parts, each containing approximately 660,000 articles; you need to download all four archives to get the full collection.
Official INEX 2009 collection:
This collection includes semantic annotations for articles and outgoing links, based on the WordNet concepts YAGO assigns to Wikipedia articles. Any runs submitted to the INEX benchmark should be evaluated on this collection (there may be some differences in text positions in the other collections).
- part 1 (1.355 GiB)
- part 2 (1.356 GiB)
- part 3 (1.358 GiB)
- part 4 (1.352 GiB)
- DTD (defines only XML entities, not document structure)
INEX 2009 collection without annotation tags: (unofficial)
This collection with an uncompressed size of about 30 GiB does not include semantic annotations.
- part 1 (1.103 GiB)
- part 2 (1.104 GiB)
- part 3 (1.105 GiB)
- part 4 (1.101 GiB)
- DTD (defines only XML entities, not document structure)
INEX 2009 collection as plain text files: (unofficial)
This collection with an uncompressed size of about 12 GiB does not include any markup, just the pure textual content. (not yet available)
- part 1 (0 GiB)
- part 2 (0 GiB)
- part 3 (0 GiB)
- part 4 (0 GiB)

Corresponding YAGO Release:
The annotations in the official INEX 2009 collection were created using YAGO version 2008-w40-2, available here (0.994 GiB).