Ruprecht-Karls-Universität Heidelberg

A Textual Entailment Search Task Dataset for German



Textual Entailment (TE) is the task of detecting semantic inference. Given a Text T and a Hypothesis H, textual entailment holds if a human reading of T would infer that H is most likely true.
The Pascal Recognising Textual Entailment (RTE) challenges are the most important forum of Textual Entailment. A number of datasets have been created for these challenges, incorporating the properties of particular tasks such as semantic search in RTE-5 or novelty detection in RTE-7.

However, RTE has focused almost exclusively on English as a target language, and is based on clean data, e.g. from newspapers or Wikipedia. Therefore, it is hard to evaluate the performance of entailment algorithms in terms of both language and genre independence.

We created a German dataset for Textual Entailment that is derived from social media data. We concentrate on a search task on a computer user forum that deals with computer problems: given a problem statement formulated by a user, identify all relevant forum threads that describe this problem.

For details please refer to this paper. If you use the resource, please cite the paper as shown below.

This dataset was created in the context of the EC-funded project EXCITEMENT (EXploring Customer Interactions through Textual EntailMENT). Please also refer to the official project website.


Dataset

The format of the dataset is in the same fashion as the RTE data, consisting of pairs of Text and Hypothesis. The Texts are collected forum entries from the internet. The Hypotheses, displaying search queries for the forum threads, were created by crowdsourcing. In total, the resource contains more than 3,000 Text/Hypothesis pairs, split into a devel and a test set of equal size. Both sets contain 86 positive and 1421 negative entailment pairs.

You can download here the dataset in RTE3 format: germanSocialMedia_rte3.zip


License

This dataset is made available under the CreativeCommons license CC BY-SA 3.0. By downloading the dataset, you acknowledge the terms and conditions of the CC BY-SA license.


References

@inproceedings{zellerPado13:gerDataset,
  author = {Zeller, Britta and Pad{\'o}, Sebastian},
  title = {{A Search Task Dataset for German Textual Entailment }},
  booktitle = {{Proceedings of the 10th International Conference on Computational Semantics (IWCS)}},
  year = {2013},
  pages = {288--299},
  address = {Potsdam},
  url = {http://www.cl.uni-heidelberg.de/~zeller/publications/iwcs2013.pdf}
}
zum Seitenanfang