Poerner, Nina; Sabet, Masoud Jalili; Roth, Benjamin; Schütze, Hinrich (October 2018): Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective. |
| 409kB |
Abstract
Count-based word alignment methods, such as the IBM models or fast-align, struggle on very small parallel corpora. We therefore present an alternative approach based on cross-lingual word embeddings (CLWEs), which are trained on purely monolingual data. Our main contribution is an unsupervised objective to adapt CLWEs to parallel corpora. In experiments on between 25 and 500 sentences, our method outperforms fast-align. We also show that our fine-tuning objective consistently improves a CLWE-only baseline.
Item Type: | Paper (Research Paper) |
---|---|
EU Funded Grant Agreement Number: | 740516 |
EU Projects: | Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement |
Research Centers: | Center for Information and Language Processing (CIS) |
Subjects: | 000 Computer science, information and general works > 000 Computer science, knowledge, and systems 000 Computer science, information and general works > 004 Data processing computer science 400 Language > 400 Language 400 Language > 410 Linguistics |
URN: | urn:nbn:de:bvb:19-epub-61865-8 |
Language: | English |
ID Code: | 61865 |
Deposited On: | 13. May 2019 13:40 |
Last Modified: | 04. Nov 2020 13:39 |
Repository Staff Only: item control page