README for creating the SR3de "Semantic Role Triple German Dataset" Short description: ------------------ The SR3de parallel dataset entails semantic role (SR) annotations for approximately 3000 predicates in the CoNLL 2009 German Shared Task data on "Syntactic and Semantic Dependencies in Multiple Languages" [https://ufal.mff.cuni.cz/conll2009-st/]. The annotations includes three semantic role labeling frameworks: - PropBank-style labeling (PB) [https://ufal.mff.cuni.cz/conll2009-st/] - FrameNet-style labeling (FN) [http://www.coli.uni-saarland.de/projects/salsa/] - VerbNet-style labeling (VN) [http://projects.cl.uni-heidelberg.de/GNVN_semanno/] All of these SR frameworks are adjusted to German data. Requirements: ------------- - Python 3.4 (at least; tested with 3.6.1 on Mac OS 10.11.6) - Licence and dataset for * PB: CoNLL 2009 Shared Task German Data [https://catalog.ldc.upenn.edu/LDC2012T03] (CoNLL2009) * FN: SALSA (at least 1.0) [http://www.coli.uni-saarland.de/projects/salsa/] (SALSA) * VN: SR3de_VN-onlySR (provided within this package) Use: ---- To create the parallel data set, run the provided Python script SR3dePy/main_create_SR3de.py: This script reads the provided VerbNet-like annotation (also separately downloadable under [http://projects.cl.uni-heidelberg.de/SR3de/material/SR3de_VN-onlySR.zip]) (SR3de_VN-onlySR) as well as the CoNLL2009 and SALSA annotations, and creates the parallel data set in CoNLL2009 file format. Steps: 1) unzip SR3de.zip -- the unzipped folder already entails the VN-style predicate argument structure annotation part 2) start the main_create_sr3de.py from the command line like this: python main_create_SR3de.py -pb -fn -o additional optional parameter: -vn Output: In the given output path, the following folder and file structure will be generated: +- SR3de +- FN +- sr3de_dev.conll +- sr3de_test.conll +- sr3de_train.conll +- PB +- sr3de_dev.conll +- sr3de_test.conll +- sr3de_train.conll +- VN +- sr3de_dev.conll +- sr3de_test.conll +- sr3de_train.conll Cautions: - In the conll file format, each annotation will be set on the head token of a predicate or an argument. - On one head, only one annotation is allowed, thus possible multiple annotations of the SALSA and the original Vn-style datasets are reduced to one of them. - The conll format does not allow for annotating an argument beyond the current sentence boundary. Thus, such long dependencies are not included in the parallel data. This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. If using the dataset, please cite the following publication: Silvana Hartmann, Éva Mújdricza-Maydt, Ilia Kuznetsov, Iryna Gurevych and Anette Frank (2017): Assessing SRL Frameworks with Automatic Training Data Expansion. Proceedings of the 11th Linguistic Annotation Workshop (LAW-XI 2017). Website: http://projects.cl.uni-heidelberg.de/SR3de/ Contact: http://projects.cl.uni-heidelberg.de/SR3de/contact.shtml (Eva Mujdricza-Maydt, mujdricza@cl.uni-heidelberg.de) Versions: -------- 20170307 - readme augmented 20170305 - SALSA data extraction completed 20170403 -initial version online