Data Release: Large-scale cross-document event alignment data set IndIA

The ICL Computational Linguistics Group releases a large-scale cross-document alignment data set based on the GigaWord corpus. The resource is based on the research described in Roth and Frank 2012a,b, Roth and Frank 2013 and the PhD thesis Roth 2014.

The data contains four resources:

  • Comparable news texts extracted from the Gigaword corpus, identified by document ID
  • Gold standard alignments between predicate argument structures across these texts
  • Automatically computed high precision alignments for the full pairwise corpus
  • Induced instances of implicit arguments and their discourse antecedents

The resources can be found in the download section. See also the Data and Resources page of the Computational Linguistics Group

