Ruprecht-Karls-Universität Heidelberg
Institut für Computerlinguistik

Bilder vom Neuenheimer Feld, Heidelberg und der Universität Heidelberg

Information Extraction and Applications

Module Description

Course Module Abbreviation Credit Points
BA-2010[100%|75%] CS-CL 6 LP
BA-2010[50%] BS-CL 6 LP
BA-2010 AS-CL 8 LP
Master SS-CL, SS-TAC 8 LP
Lecturer Daniel Dahlmeier
Module Type Proseminar / Hauptseminar
Language English
First Session 22.04.2022
Time and Place Freitag, 08:15–09:45, online
Commitment Period tbd.

Prerequisite for Participation

  • Mathematical Foundations of CL (or a comparable introductory class to linear algebra and theory of probability)
  • Statistical Methods for CL (or a comparable introductory class to machine learning)

Assessment

  • Regular and active attendance of seminar (40%)
  • Independent study of assigned scientific papers, clarity of report and presentation (60%)

Content

This seminar focuses on information extraction (IE) and its applications to business documents. After an overview of traditional IE methods, we will discuss recent research focusing on IE from form-like business documents, such as invoices or purchase orders.


Students will be assigned research papers for them to study and present in the seminar.

Module Overview

Agenda

Date Session Materials

Literature

  • Daniel Jurafsky, James H. Martin. 2020. Speech and Language Processing
  • ACL 2020 tutorial. Ibrahim, Yusra, Mirek Riedewald, Gerhard Weikum and Demetrios Zeinalipour-Yazti. Bridging Quantities in Tables and Text. ICDE (2019): 1010-1021.
  • Katti, Anoop R., Christian Reisswig, Cordula Guder, Sebastian Brarda, Steffen Bickel, Johannes Hhne and Jean Baptiste Faddoul. Chargrid: Towards Understanding 2D Documents. EMNLP (2018).
  • Qian, Yujie, Enrico Santus, Zhijing Jin, Jiang Guo and Regina Barzilay. GraphIE: A Graph-Based Framework for Information Extraction. NAACL-HLT (2019).
  • Ratner, Alexander, Stephen H. Bach, Henry R. Ehrenberg, Jason Alan Fries, Sen Wu and Christopher R. Snorkel: Rapid Training Data Creation with Weak Supervision. PVLDB 11 3 (2017): 269-282 .
  • Wu, Sen, Luke Hsiao, Xiao Cheng, Braden Hancock, Theodoros Rekatsinas, Philip Levis and Christopher R. Fonduer: Knowledge Base Construction from Richly Formatted Data. Proceedings. SIGMOD 2018 (2018): 1301-1316
  • Bodhisattwa Prasad Majumder, Navneet Potti, Sandeep Tata, James Bradley Wendt, Qi Zhao, Marc Najork. Representation Learning for Information Extraction from Form-like Documents
  • Xiaojing Liu, Feiyu Gao, Qiong Zhang, Huasha Zhao. Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. NAACL 2020
  • Timo I. Denk and Christian Reisswig. 2019. Bertgrid: Contextualized embedding for 2d document representation and understanding.
  • Sebastian Riedel, Limin Yao, Andrew McCallum, Benjamin M. Marlin. Relation Extraction with Matrix Factorization and Universal Schemas Milan Aggarwal, Hiresh Gupta, Mausoom Sarkar, Balaji Krishnamurthy. Form2Seq: A Framework for Higher-Order Form Structure Extraction
  • Yiheng Xu, Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming. LayoutLM: Pre-training of Text and Layout for Document Image Understanding
  • Jonathan Herzig, Pawe Krzysztof Nowak, Thomas Muller, Francesco Piccinno, Julian Martin Eisenschlos. TAPAS: Weakly Supervised Table Parsing via Pre-training
  • Minghao Li, Yiheng Xu, Lei Cui, Shaohan Huang, Furu Wei, Zhoujun Li, Ming Zhou. 2020. DocBank: A Benchmark Dataset for Document Layout Analysis
  • Mller M, Ghosh S, Rey M, Wittig U, Mller W, Strube M. 2020. Reconstructing Manual Information Extraction with DB-to-Document Backprojection:. Experiments in the Life Science Domain. Proceedings of the First Workshop on Scholarly Document Processing, pages 8190
  • Xavier Holt and Andrew Chisholm. 2018. Extracting structured data from invoices. In Proceedings of Australasian Language Technology Association Workshop
  • Clement Sage, Alex Aussem, Veronique Eglin, Haytham Elghaze, Jeremy Espinas. 2020. End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks. Proceedings of 4th Workshop on Structured Prediction for NLP
  • Chuwei Luo, Yongpan Wang, Qi Zheng, Liangcheng Li, Feiyu Gao, Shiyu Zhang. 2020. Merge and Recognize: A Geometry and 2D Context Aware Graph Model for Named Entity Recognition from Visual Documents. Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
  • Rasmus Berg Palm, Ole Winther, and Florian Laws. 2017. Cloudscan-aconfiguration-free invoice analysis system using recurrent neural networks. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 1, pages 406413. IEEE.
  • Rasmus Berg Palm, Florian Laws, and Ole Winther. 2018. Attend, copy, parse-end-to-end information extraction from documents. In arXiv preprint arXiv:1812.07248.
  • Sen Wu Luke Hsiao Xiao Cheng Braden Hancock Theodoros Rekatsinas
  • Philip Levis Christopher Re. 2018. Fonduer: Knowledge Base Construction from Richly Formatted Data

» More Materials

zum Seitenanfang