Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Information Extraction and Applications

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Lecturer Daniel Dahlmeier
Module Type Proseminar / Hauptseminar
Language English
First Session 16.04.2021
Time and Place Friday, 09:15-10:45, Online
Commitment Period tbd.

Prerequisite for Participation

  • Mathematical Foundations of CL (or a comparable introductory class to linear algebra and theory of probability)
  • Statistical Methods for CL (or a comparable introductory class to machine learning)

Assessment

  • Regular and active attendance of seminar (40%)
  • Independent study of assigned scientific papers, clarity of report and presentation (60%)

Content

This seminar focuses on information extraction (IE) and its applications to business documents. After an overview of traditional IE methods, we will discuss recent research focusing on IE from form-like business documents, such as invoices or purchase orders.

Students will be assigned research papers for them to study and present in the seminar.

Module Overview

Agenda

Date Session Materials
16.04.2021
9:15–10:45
  • Overview about information extraction and motivate why information extraction is important for business documents.
  • Seminar logistics, the schedule for the upcoming weeks and distribute the papers to read and present.
Lecture slides and recording available on Moodle.
23.04.2021
9:15–10:45
  • Introduction lecture Named Entity Recognition
Lecture slides and recording available on Moodle.
30.04.2021
9:15–10:45
  • Introduction lecture Relation Extraction
Lecture slides and recording available on Moodle.
07.05.2021
9:15–10:45
  1. Tong Yu: Tong et al. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
  2. Irina Wüst: Florian et al. 2003. Named Entity Recognition through Classifier Combination
14.05.2021
9:15–10:45
  1. Hyunji Kim: Lample et al. 2016. Neural Architectures for Named Entity Recognition
  2. Carlos Rubiano: Huang et al. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging
21.05.2021
9:15–10:45
  1. Dang Nguyen: Akbik et al 2018. Contextual String Embeddings for Sequence Labeling
  2. Adil Chhabra: Yamada et al. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
28.05.2021
9:15–10:45
  1. Claudia Rebmann: Joshi et al. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans
  2. Oliver Zobel: Surdeanu. 2013.  Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling
04.06.2021
9:15–10:45
  1. Luisa Kriener: Zhou et al. 2005. Exploring various knowledge in relation extraction.
  2. Leander Girrbach: Snow et al. 2005: Learning syntactic patterns for automatic hypernym discovery
11.06.2021
9:15–10:45
  1. Ufkun Menderes: Riedel et al. 2013. Relation Extraction with Matrix Factorization and Universal Schemas
  2. Jiahui Li: Zhang et al. 2017. Position-aware attention and supervised data improve slot filling
18.06.2021
9:15–10:45
  1. Lisa Kuhn: Katti et al. 2018. Chargrid: Towards Understanding 2D Documents
  2. Ines Reinig: Denk and Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding
25.06.2021
9:15–10:45
  1. Dorian Heide: Qian et al. 2019. GraphIE: A Graph-Based Framework for Information Extraction
  2. Eileen Dickson: Majumder et al.  2020.  Representation Learning for Information Extraction from Form-like Documents
02.07.2021
9:15–10:45
  1. Anne-Kathrin Bugert: Li et al. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis
  2. Ines Pisetta: Xu et al. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding
09.07.2021
9:15–10:45
  1. Trang Nguyen Vu: Liu et al. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
  2. Christoph Schneider: Klaiman and Lehne. 2021. DocReader: Bounding-Box Free Training of a Document Information Extraction
16.07.2021
9:15–10:45
  1. Claire Sun: Ratner et al. 2016. Data Programming: Creating Large Training Sets, Quickly
  2. Philipp Meier: Herzig et al. 2020. TAPAS: Weakly Supervised Table Parsing via Pre-training
23.07.2021
9:15–10:45
  1. Arthur Neumüller: Banko et al. 2007. Open information extraction for the web
  2. Benedetto Kotzaneck: Fader et al. 2011. Identifying relations for open information extraction

Literature

  • Daniel Jurafsky, James H. Martin. 2020. Speech and Language Processing
  • Tong et al. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
  • Florian et al. 2003. Named Entity Recognition through Classifier Combination
  • Huang et al. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging
  • Lample et al. 2016. Neural Architectures for Named Entity Recognition
  • Akbik et al 2018. Contextual String Embeddings for Sequence Labeling
  • Yamada et al. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
  • Zhou et al. 2005. Exploring various knowledge in relation extraction. ACL.
  • Snow et al. 2005. Learning syntactic patterns for automatic hypernym discovery. NeurIPS
  • Surdeanu. 2013. Overview of the TAC2013 Knowledge Base Population evaluation: English slot filling and temporal slot filling. TAC-13.
  • Riedel et al. 2013. Relation Extraction with Matrix Factorization and Universal Schemas
  • Zhang et al. 2017. Position-aware attention and supervised data improve slot filling. EMNLP.
  • Joshi et al. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans
  • Qian et al. 2019. GraphIE: A Graph-Based Framework for Information Extraction
  • Katti et al. 2018. Chargrid: Towards Understanding 2D Documents.
  • Liu et al. 2020. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
  • Denk and Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding
  • Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents
  • Xu et al. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding
  • Li et al. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis
  • Aggarwal et al. 2020. Form2Seq : A Framework for Higher-Order Form Structure Extraction
  • Herzig et al. 2020. TAPAS: Weakly Supervised Table Parsing via Pre-training

» More Materials

zum Seitenanfang