Data Centric Domain Adaptation for Historical Text with OCR Errors

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Marz, Luisa; Schweter, Stefan; Poerner, Nina; Roth, Benjamin und Schütze, Hinrich (2. September 2021): Data Centric Domain Adaptation for Historical Text with OCR Errors. International Conference on Document Analysis and Recognition, Lausanne, Switzerland, September 2021. Lladós, J.; Lopresti, D. und Uchida, S. (Hrsg.): In: Document Analysis and Recognition – ICDAR 2021. 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II, Cham: Springer. S. 748-761 [PDF, 364kB]

Vorschau

Entwurf

DOI: 10.1007/978-3-030-86331-9_48

Abstract

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

Dokumententyp:	Konferenzbeitrag (Paper)
EU Funded Grant Agreement Number:	740516
EU-Projekte:	Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement
Publikationsform:	Preprint
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme 400 Sprache > 400 Sprache 400 Sprache > 410 Linguistik
URN:	urn:nbn:de:bvb:19-epub-92207-3
Ort:	Cham
Sprache:	Englisch
Dokumenten ID:	92207
Datum der Veröffentlichung auf Open Access LMU:	27. Mai 2022 11:06
Letzte Änderungen:	27. Mai 2022 11:06

Dokument bearbeiten