Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Information Extraction and Applications

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Lecturer Daniel Dahlmeier
Module Type Proseminar / Hauptseminar
Language English
First Session 16.04.2021
Time and Place Friday, 09:15-10:45, Online
Commitment Period tbd.

Prerequisite for Participation

  • Mathematical Foundations of CL (or a comparable introductory class to linear algebra and theory of probability)
  • Statistical Methods for CL (or a comparable introductory class to machine learning)


  • Regular and active attendance of seminar (40%)
  • Independent study of assigned scientific papers, clarity of report and presentation (60%)


This seminar focuses on information extraction (IE) and its applications to business documents. After an overview of traditional IE methods, we will discuss recent research focusing on IE from form-like business documents, such as invoices or purchase orders.

Students will be assigned research papers for them to study and present in the seminar.

Module Overview


Date Session Materials
  • Overview about information extraction and motivate why information extraction is important for business documents.
  • Seminar logistics, the schedule for the upcoming weeks and distribute the papers to read and present.
Lecture slides will be made available on Moodle before the seminar.
(Agenda will be updated during the course)


  • Daniel Jurafsky, James H. Martin. 2020. Speech and Language Processing
  • Tong et al. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
  • Florian et al. 2003. Named Entity Recognition through Classifier Combination
  • Huang et al. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging
  • Lample et al. 2016. Neural Architectures for Named Entity Recognition
  • Akbik et al 2018. Contextual String Embeddings for Sequence Labeling
  • Yamada et al. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
  • Zhou et al. 2005. Exploring various knowledge in relation extraction. ACL.
  • Snow et al. 2005. Learning syntactic patterns for automatic hypernym discovery. NeurIPS
  • Surdeanu. 2013. Overview of the TAC2013 Knowledge Base Population evaluation: English slot filling and temporal slot filling. TAC-13.
  • Riedel et al. 2013. Relation Extraction with Matrix Factorization and Universal Schemas
  • Zhang et al. 2017. Position-aware attention and supervised data improve slot filling. EMNLP.
  • Joshi et al. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans
  • Qian et al. 2019. GraphIE: A Graph-Based Framework for Information Extraction
  • Katti et al. 2018. Chargrid: Towards Understanding 2D Documents.
  • Liu et al. 2020. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
  • Denk and Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding
  • Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents
  • Xu et al. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding
  • Li et al. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis
  • Aggarwal et al. 2020. Form2Seq : A Framework for Higher-Order Form Structure Extraction
  • Herzig et al. 2020. TAPAS: Weakly Supervised Table Parsing via Pre-training

» More Materials

zum Seitenanfang