The INEX initiative had been using an XML collection based on Wikipedia articles since 2006. While that collection was a great improvement over the old IEEE articles in terms of topical diversity and size, it has been difficult to find queries with meaningful constraints on the structure of results. Starting in 2009, INEX uses a new set of Wikipedia XML articles that additionally provide semantic markup of articles and outgoing links, based on the semantic knowledge base YAGO, explicitly labeling more than 5,800 classes of entities like persons, movies, cities, and many more. As this new collection was created from a recent Wikipedia dump, it consists of approximately four times more articles than the 2006 collection and is approximately ten times larger in size.
This collection was created from the October 8, 2008 dump of the English Wikipedia articles and incorporates semantic annotations from the 2008-w40-2 version of YAGO.
For a more technical description of a preliminary version of this collection, see: Ralf Schenkel, Fabian M. Suchanek, and Gjergji Kasneci: YAWN: A semantically annotated Wikipedia XML corpus, 12. GI-Fachtagung für Datenbanksysteme in Business, Technologie und Web (BTW 2007), Aachen, Germany, March 2007.
Each of the collections comes in
four parts, each containing approximately 660,000 articles; you need to download all four archives to
get the full collection.
Official INEX 2009 collection:
This collection includes semantic annotations for articles and outgoing links, based on the WordNet concepts YAGO assigns to Wikipedia articles.
Any runs submitted to the INEX benchmark should be evaluated on this collection (there may be some differences in text positions in the other collections).
DTD (defines only XML entities, not document structure)
INEX 2009 collection without annotation tags: (unofficial)
This collection with an uncompressed size of about 30 GiB does not include semantic annotations.
DTD (defines only XML entities, not document structure)
INEX 2009 collection as plain text files: (unofficial)
This collection with an uncompressed size of about 12 GiB does not include any markup, just the pure textual content. (not yet available)