GUTENBERG data: --------------- German-English parallel corpus containing 114 novels (111 training + 3 test file pairs) from the 19th century. This corpus is a slightly differring variant of the "Bilingual Formal / Informal Address Corpus" by Manaal Faruqui and Sebastian Pad�, provided on the web-side http://www.nlpado.de/~sebastian/data/tv_data.shtml. The differences between the two corpora: - Our data entails 114 novel pairs (See NOTE at the end of this readme file.): * 111 file pairs belongs to the training set, the remaining three file pairs build the test set. We do not used any development set at all. * The range of training and test files are different from the file set in the "Bilingual Formal / Informal Address Corpus", see the comparison below. - We splitted the files into three folders in both training and test set: * de/: all German texts, with file name .txt * en/: all English texts, with file name .txt * align/: sentence-wise alignment from German to English texts by Gargantua, with file name _alignInfo.txt. For test set, we additionally provide manual alignments, with file name .goldAlign - The text files in our data entail only the tokenized text, one sentence in a line. - The alignment files in our data hold the Gargantua output with sentence-wise alignments and the word-wise alignments from GIZA++. (We only use the sentence-wise alignments.) The novel list compared to the "Bilingual Formal / Informal Address Corpus": In the first column stands "tr", "d" or "te", if the novel belongs to the training set, development set or test set of the "Bilingual Formal / Informal Address Corpus", respectively. If you see a "-", then the text is not part of the mentioned corpus. In this case, we declare the title, author, and the original language of the novel in brackets after the file name. TRAINING SET: tr 2staedte.txt te 80tage.txt - abt.txt (The Abbot, Walter Scott, English) tr altertu.txt tr annakare.txt tr auferste.txt d ballon.txt te ball.txt tr baske.txt tr beast.txt te belami.txt tr bernac.txt tr bleakhau.txt tr bovary.txt - bravo.txt (The Bravo, James Fenimore Cooper, English) tr bubbl.txt tr chabert.txt tr chagrinl.txt tr city.txt tr clerg.txt tr copperf.txt d corinna.txt tr crusoe.txt tr dbuch.txt tr denkwuer.txt tr dombey.txt tr donnerst.txt te dorfpfar.txt tr dorrit.txt tr einfalt.txt tr eugenie.txt tr evastoch.txt tr facino.txt tr fifi.txt tr finan.txt tr fire.txt tr frauvon.txt tr fromont.txt tr geist001.txt tr georg.txt tr grendier.txt tr hanspete.txt tr head.txt te idiot.txt tr illusion.txt tr immensee.txt te imray.txt - ivanhoe.txt (Ivanhoe, Walter Scott, English) tr jungfer.txt tr kandergr.txt te karamaso.txt - kenilwo.txt (Kenilworth, Walter Scott, English) te king.txt te k-medici.txt d kornelli.txt tr kreutzer.txt tr kriegfri.txt tr kurtisan.txt tr landarzt.txt tr lebewohl.txt d lord.txt d lourdes.txt tr marigold.txt tr masketo.txt tr menschki.txt te menschle.txt te mittelpu.txt d mohikan.txt tr monikins.txt d moni.txt tr moti.txt te muehle.txt tr nabob.txt tr nabot.txt d nickleby.txt d onkeltom.txt d pambe.txt d passa.txt d pension.txt d pferdege.txt d pickwick.txt d pique-da.txt - quentind.txt (Quentin Durward, Walter Scott, English) te refugies.txt d ricks.txt d rotkr.txt tr salambo.txt d satansto.txt tr schlemil.txt tr schuldsu.txt tr schwzeit.txt tr shush.txt tr silveste.txt - simpl.txt (Simplicius Simplicissimus, Hans Jakob Christoffel von Grimmelshausen, German) tr starktod.txt - talisman.txt (The Thalisman, Walter Scott, English) tr tartalp.txt tr tartarin.txt tr tobra.txt tr toteseel.txt tr twist.txt tr unsterbl.txt tr vendetta.txt - verlobt.txt (The Betrothed, Walter Scott, English) d vettpons.txt tr wasserni.txt tr waverley.txt tr weihlied.txt tr werther.txt tr zeichen.txt tr zwerg.txt TEST SET: tr ehestand.txt tr morro.txt tr rotewirt.txt NOTE that for our experiments, we additionally used the following novel, which is not part of the "Bilingual Formal / Informal Address Corpus", in the training set: - moby1001.txt (Moby Dick, or, the whale, Herman Melwille, English) We cannot provide this novel, because the German version of them is still subject to copyright. Contact: Eva Mujdricza-Maydt mujdricz@cl.uni-heidelberg.de Huiqin Qu huiqin@cl.uni-heidelberg.de