Logo Logo
Hilfe
Hilfe
Switch Language to English

Marz, Luisa; Schweter, Stefan; Poerner, Nina; Roth, Benjamin und Schütze, Hinrich (2. September 2021): Data Centric Domain Adaptation for Historical Text with OCR Errors. International Conference on Document Analysis and Recognition, Lausanne, Switzerland, September 2021. Lladós, J.; Lopresti, D. und Uchida, S. (Hrsg.): In: Document Analysis and Recognition – ICDAR 2021. 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II, Cham: Springer. S. 748-761 [PDF, 364kB]

[thumbnail of 2107.00927.pdf]
Vorschau
Entwurf
Download (364kB)

Abstract

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

Dokument bearbeiten Dokument bearbeiten