AIDA: Accurate Online Disambiguation of Named Entities in Text and Tables

Overview STICS Downloads

http://www.mpi-inf.mpg.de/yago-naga/aida/downloads.html

AIDA Downloads

The AIDA source code is available on github.com/yago-naga/aida. For AIDA to work, you will need to download our YAGO-based entity respository and import it into a PostgreSQL server. Further installation instructions are included in the source release.

Downloadable files:

AIDA source code
AIDA-light source code - an alternative to AIDA that provides high output quality and fast run-time with a Java-native API.
AIDA_entity_repository_2010-08-17v5-1.sql.bz2 (21 Gb) - the repository used in the original EMNLP 2011 publication, and reference for comparison.
AIDA_entity_repository_2012-11-01v5-1.sql.bz2 (31 Gb) - a repository built from a recent Wikipedia dump. Use this if you want to disambiguate to more recent entities and do not care about scientific comparison.

AIDA CoNLL-YAGO Dataset Download

The dataset used in the experiments in our EMNLP 2011 paper, Robust Disambiguation of Named Entities in Text, can be downloaded here:

aida-yago2-dataset.zip (419 KB)

The dataset has been updated on 2013-11-21, adding all but 7 Freebase MIDs, as well as Wikipedia IDs.

It contains assignments of entities to the mentions of named entities annotated for the original CoNLL 2003 entity recognition task. The entities are identified by YAGO2 entity name, by Wikipedia URL, or by Freebase mid (Thanks to Massimiliano Ciaramita from Google Zürich for creating the Wikipedia/Freebase mapping and making it available to us). The zip contains a README.txt with details about the format, as well as instructions how to create it from the original CoNLL 2003 dataset (this is required).

We also provide the mention-entity candidate mapping which was used in our experiments in Robust Disambiguation of Named Entities in Text, which is an extension of the YAGO2 means relation:

aida_means.tsv.bz2 (156 MB)

This file contains two tab-separated colums. The first column is a quoted string, denoting a potential mention which can be recognized in the input text, and the second column is one entity candidate for this mention. Both columns are encoded in the YAGO2 format, go to the YAGO2 downloads for decoding utils.

AIDA-EE Dataset Download

The dataset used in the experiments in our WWW 2014 paper, Discovering Emerging Entities with Ambiguous Names, can be downloaded here:

AIDA-EE.tar.gz (119 KB)

The AIDA-EE Dataset contains 300 documents with 9,976 entity names linked to Wikipedia (2010-08-17 dump). The documents themselves are taken from the APW part of the GIGAWORD5 dataset, with 150 documents from 2010-10-01 (development data) and 150 documents from 2010-11-01 (test data). Due to licensing issues, we do not provide the document content, just the offsets with the entity annotations.

KORE Datasets Download

The datasets used in the experiments in our CIKM 2012 paper, KORE: Keyphrase Overlap Relatedness for Entity Disambiguation, can be downloaded here:

KORE_entity_relatedness.tar.gz (5 KB)
KORE50.tar.gz (5 KB)

License

All datasets are licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.