SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Sabet, Masoud Jalili; Dufter, Philipp und Schütze, Hinrich (27. April 2020): SimAlign: High Quality Word Alignments without Parallel Training Data using Static and Contextualized Embeddings. In: Findings of ACL: EMNLP 2020 [PDF, 1MB]

Vorschau

DOI: 10.5282/ubm/epub.72200

Externer Volltext: https://arxiv.org/abs/2004.08728

Abstract

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings, both static and contextualized, for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are competitive and mostly superior to traditional statistical aligners, even in scenarios with abundant parallel data. For example, for a set of 100k parallel sentences, contextualized embeddings achieve a word alignment F1 for English-German that is more than 5% higher (absolute) than eflomal, a high quality alignment model.

Dokumententyp:	Zeitschriftenartikel
EU Funded Grant Agreement Number:	740516
EU-Projekte:	Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme 400 Sprache > 410 Linguistik
URN:	urn:nbn:de:bvb:19-epub-72200-6
Sprache:	Englisch
Dokumenten ID:	72200
Datum der Veröffentlichung auf Open Access LMU:	20. Mai 2020 07:35
Letzte Änderungen:	04. Nov. 2020 13:53

Dokument bearbeiten