A Stronger Baseline for Multilingual Word Embeddings

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Dufter, Philipp und Schütze, Hinrich (November 2018): A Stronger Baseline for Multilingual Word Embeddings. [PDF, 173kB]

Vorschau

DOI: 10.5282/ubm/epub.61864

Abstract

Levy, Søgaard and Goldberg’s (2017) S-ID (sentence ID) method applies word2vec on tuples containing a sentence ID and a word from the sentence. It has been shown to be a strong baseline for learning multilingual embeddings. Inspired by recent work on concept based embedding learning we propose SC-ID, an extension to S-ID: given a sentence aligned corpus, we use sampling to extract concepts that are then processed in the same manner as S-IDs. We perform experiments on the Parallel Bible Corpus across 1000+ languages and show that SC-ID yields up to 6% performance increase in a word translation task. In ad- dition, we provide evidence that SC-ID is easily and widely applicable by reporting competitive results across 8 tasks on a EuroParl based corpus.

Dokumententyp:	Paper
EU Funded Grant Agreement Number:	740516
EU-Projekte:	Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme 000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik 400 Sprache > 400 Sprache 400 Sprache > 410 Linguistik
URN:	urn:nbn:de:bvb:19-epub-61864-2
Sprache:	Englisch
Dokumenten ID:	61864
Datum der Veröffentlichung auf Open Access LMU:	13. Mai 2019 13:39
Letzte Änderungen:	04. Nov. 2020 13:39

Dokument bearbeiten