Abstract
We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.
| Item Type: | Conference or Workshop Item (Paper) |
|---|---|
| EU Funded Grant Agreement Number: | 740516 |
| EU Projects: | Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement |
| Form of publication: | Preprint |
| Research Centers: | Center for Information and Language Processing (CIS) |
| Subjects: | 000 Computer science, information and general works > 000 Computer science, knowledge, and systems 400 Language > 400 Language 400 Language > 410 Linguistics |
| URN: | urn:nbn:de:bvb:19-epub-92207-3 |
| Place of Publication: | Cham |
| Language: | English |
| Item ID: | 92207 |
| Date Deposited: | 27. May 2022 11:06 |
| Last Modified: | 27. May 2022 11:06 |

