Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Lehrveranstaltungen
heiCO
Ressourcen	Fachschaft
Studien-FAQ	Technik-FAQ

Information Extraction and Applications

Module Description

Course	Module Abbreviation	Credit Points
BA-2010[100%\|75%]	CS-CL	6 LP
BA-2010[50%]	BS-CL	6 LP
BA-2010	AS-CL	8 LP
Master	SS-CL, SS-TAC	8 LP

Lecturer	Daniel Dahlmeier
Module Type	Proseminar / Hauptseminar
Language	English
First Session	16.04.2021
Time and Place	Friday, 09:15-10:45, Online
Commitment Period	tbd.

Prerequisite for Participation

Mathematical Foundations of CL (or a comparable introductory class to linear algebra and theory of probability)
Statistical Methods for CL (or a comparable introductory class to machine learning)

Assessment

Regular and active attendance of seminar (40%)
Independent study of assigned scientific papers, clarity of report and presentation (60%)

Content

This seminar focuses on information extraction (IE) and its applications to business documents. After an overview of traditional IE methods, we will discuss recent research focusing on IE from form-like business documents, such as invoices or purchase orders.

Students will be assigned research papers for them to study and present in the seminar.

Module Overview

Agenda

Date	Session	Materials
16.04.2021 9:15–10:45	Overview about information extraction and motivate why information extraction is important for business documents. Seminar logistics, the schedule for the upcoming weeks and distribute the papers to read and present.	Lecture slides and recording available on Moodle.
23.04.2021 9:15–10:45	Introduction lecture Named Entity Recognition	Lecture slides and recording available on Moodle.
30.04.2021 9:15–10:45	Introduction lecture Relation Extraction	Lecture slides and recording available on Moodle.
07.05.2021 9:15–10:45	Tong Yu: Tong et al. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition Irina Wüst: Florian et al. 2003. Named Entity Recognition through Classifier Combination	Tong et al. 2003 Forian et al. 2003
14.05.2021 9:15–10:45	Hyunji Kim: Lample et al. 2016. Neural Architectures for Named Entity Recognition Carlos Rubiano: Huang et al. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging	Lample et al. 2016 Huang et al. 2015
21.05.2021 9:15–10:45	Dang Nguyen: Akbik et al 2018. Contextual String Embeddings for Sequence Labeling Adil Chhabra: Yamada et al. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention	Akbik et al. 2018 Yamada et al. 2020
28.05.2021 9:15–10:45	Claudia Rebmann: Joshi et al. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans Oliver Zobel: Surdeanu. 2013. Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling and Temporal Slot Filling	Joshi et al. 2020 Surdeanu. 2013
04.06.2021 9:15–10:45	Luisa Kriener: Zhou et al. 2005. Exploring various knowledge in relation extraction. Leander Girrbach: Snow et al. 2005: Learning syntactic patterns for automatic hypernym discovery	Zhou et al. 2005 Snow et al. 2005
11.06.2021 9:15–10:45	Ufkun Menderes: Riedel et al. 2013. Relation Extraction with Matrix Factorization and Universal Schemas Jiahui Li: Zhang et al. 2017. Position-aware attention and supervised data improve slot filling	Riedel et al. 2013 Zhang et al. 2017
18.06.2021 9:15–10:45	Lisa Kuhn: Katti et al. 2018. Chargrid: Towards Understanding 2D Documents Ines Reinig: Denk and Reisswig. 2019. BERTgrid: Contextualized Embedding for 2D Document Representation and Understanding	Katti et al. 2018 Denk and Reisswig. 2019
25.06.2021 9:15–10:45	Dorian Heide: Qian et al. 2019. GraphIE: A Graph-Based Framework for Information Extraction Eileen Dickson: Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents	Qian et al. 2019 Majumder et al. 2020
02.07.2021 9:15–10:45	Anne-Kathrin Bugert: Li et al. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis Ines Pisetta: Xu et al. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding	Li et al. 2020 Xu et al. 2020
09.07.2021 9:15–10:45	Trang Nguyen Vu: Liu et al. 2019. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents Christoph Schneider: Klaiman and Lehne. 2021. DocReader: Bounding-Box Free Training of a Document Information Extraction	Liu et al. 2020 Klaiman and Lehne. 2021 (link will follow)
16.07.2021 9:15–10:45	Claire Sun: Ratner et al. 2016. Data Programming: Creating Large Training Sets, Quickly Philipp Meier: Herzig et al. 2020. TAPAS: Weakly Supervised Table Parsing via Pre-training	Ratner et al. 2016 Herzig et al. 2020
23.07.2021 9:15–10:45	Arthur Neumüller: Banko et al. 2007. Open information extraction for the web Benedetto Kotzaneck: Fader et al. 2011. Identifying relations for open information extraction	Banko et al. 2007 Fader et al. 2011

Literature

Daniel Jurafsky, James H. Martin. 2020. Speech and Language Processing
Tong et al. 2003. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
Florian et al. 2003. Named Entity Recognition through Classifier Combination
Huang et al. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging
Lample et al. 2016. Neural Architectures for Named Entity Recognition
Akbik et al 2018. Contextual String Embeddings for Sequence Labeling
Yamada et al. 2020. LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention
Zhou et al. 2005. Exploring various knowledge in relation extraction. ACL.
Snow et al. 2005. Learning syntactic patterns for automatic hypernym discovery. NeurIPS
Surdeanu. 2013. Overview of the TAC2013 Knowledge Base Population evaluation: English slot filling and temporal slot filling. TAC-13.
Riedel et al. 2013. Relation Extraction with Matrix Factorization and Universal Schemas
Zhang et al. 2017. Position-aware attention and supervised data improve slot filling. EMNLP.
Joshi et al. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans
Qian et al. 2019. GraphIE: A Graph-Based Framework for Information Extraction
Katti et al. 2018. Chargrid: Towards Understanding 2D Documents.
Liu et al. 2020. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents
Denk and Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding
Majumder et al. 2020. Representation Learning for Information Extraction from Form-like Documents
Xu et al. 2020. LayoutLM: Pre-training of Text and Layout for Document Image Understanding
Li et al. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis
Aggarwal et al. 2020. Form2Seq : A Framework for Higher-Order Form Structure Extraction
Herzig et al. 2020. TAPAS: Weakly Supervised Table Parsing via Pre-training