Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Information Extraction and Applications

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010[25%] BS-AC 4 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Lecturer Daniel Dahlmeier
Module Type Proseminar / Hauptseminar
Language English
First Session 19.04.2024
Time and Place Friday, 08:15–09:45, online
In June: 15:15-16:45, online
Commitment Period tbd.


All advanced Bachelor students and all Master students. Students from Computer Science, Mathematics or Scientific computing with Anwendungsgebiet Computational Linguistics are welcome.

Prerequisite for Participation

  • Mathematical Foundations of CL (or a comparable introductory class to linear algebra and theory of probability)
  • Statistical Methods for CL (or a comparable introductory class to machine learning)


  • Regular and active attendance of seminar (40%)
  • Independent study of assigned scientific papers, clarity of report and presentation (60%)


This seminar focuses on information extraction (IE) and its applications to business documents. After an overview of traditional IE methods, we will discuss recent research focusing on IE from form-like business documents, such as invoices or purchase orders.

Students will be assigned research papers for them to study and present in the seminar.

Module Overview


Date Session Materials


  • Daniel Jurafsky, James H. Martin. 2020. Speech and Language Processing
  • ACL 2020 tutorial. Ibrahim, Yusra, Mirek Riedewald, Gerhard Weikum and Demetrios Zeinalipour-Yazti. Bridging Quantities in Tables and Text. ICDE (2019): 1010-1021.
  • Katti, Anoop R., Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hhne and Jean Baptiste Faddoul. Chargrid: Towards Understanding 2D Documents. EMNLP (2018).
  • Qian, Yujie, Enrico Santus, Zhijing Jin, Jiang Guo and Regina Barzilay. GraphIE: A Graph-Based Framework for Information Extraction. NAACL-HLT (2019).
  • Ratner, Alexander, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu and Christopher R. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB 11 3 (2017): 269-282 .
  • Wu, Sen, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis and Christopher R. Fonduer: Knowledge Base Construction from Richly Formatted Data. Proceedings. SIGMOD 2018 (2018): 1301-1316
  • Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork. Representation Learning for Information Extraction from Form-like Documents
  • Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. NAACL 2020
  • Timo I. Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding.
  • Sebastian Riedel, Limin Yao, Andrew McCallum, Benjamin M. Marlin. Relation Extraction with Matrix Factorization and Universal Schemas Milan Aggarwal, Hiresh Gupta, Mausoom Sarkar, Balaji Krishnamurthy. Form2Seq: A Framework for Higher-Order Form Structure Extraction
  • Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming. LayoutLM: Pre-training of Text and Layout for Document Image Understanding
  • Jonathan Herzig, Pawe Krzysztof Nowak, Thomas Muller, Francesco Piccinno, Julian Martin Eisenschlos. TAPAS: Weakly Supervised Table Parsing via Pre-training
  • Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis
  • Mller M, Ghosh S, Rey M, Wittig U, Mller W, Strube M. 2020. Reconstructing Manual Information Extraction with DB-to-Document Backprojection:. Experiments in the Life Science Domain. Proceedings of the First Workshop on Scholarly Document Processing, pages 8190
  • Xavier Holt and Andrew Chisholm. 2018. Extracting structured data from invoices. In Proceedings of Australasian Language Technology Association Workshop
  • Clement Sage, Alex Aussem, Veronique Eglin, Haytham Elghaze, Jeremy Espinas. 2020. End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks. Proceedings of 4th Workshop on Structured Prediction for NLP
  • Chuwei Luo, Yongpan Wang, Qi Zheng, Liangcheng Li, Feiyu Gao, Shiyu Zhang. 2020. Merge and Recognize: A Geometry and 2D Context Aware Graph Model for Named Entity Recognition from Visual Documents. Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
  • Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. Cloudscan-aconfiguration-free invoice analysis system using recurrent neural networks. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 406413. IEEE.
  • Rasmus Berg Palm, Florian Laws, and Ole Winther. 2018. Attend, copy, parse-end-to-end information extraction from documents. In arXiv preprint arXiv:1812.07248.
  • Sen Wu Luke Hsiao Xiao Cheng Braden Hancock Theodoros Rekatsinas
  • Philip Levis Christopher Re. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data

» More Materials

zum Seitenanfang