Wine is Not v i n. On the Compatibility of Tokenizations Across Languages

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Maronikolakis, Antonis; Dufter, Philipp und Schütze, Hinrich (November 2021): Wine is Not v i n. On the Compatibility of Tokenizations Across Languages. EMNLP 2021, Punta Cana, Dominican Republic, November 7–11, 2021. Moens, Marie-Francine; Huang, Xuanjing; Specia, Lucia und Yih, Scott Wen-tau (Hrsg.): In: Findings of the Association for Computational Linguistics: EMNLP 2021, Stroudsburg, PA: Association for Computational Linguistics. S. 2382-2399 [PDF, 3MB]

[thumbnail of 2021.findings-emnlp.205.pdf]

Vorschau

DOI: 10.5282/ubm/epub.92193

Abstract

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., “wine” (word-level) in English vs. “v i n” (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.

Dokumententyp:	Konferenzbeitrag (Paper)
EU Funded Grant Agreement Number:	740516
EU-Projekte:	Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme 400 Sprache > 400 Sprache 400 Sprache > 410 Linguistik
URN:	urn:nbn:de:bvb:19-epub-92193-9
Ort:	Stroudsburg, PA
Sprache:	Englisch
Dokumenten ID:	92193
Datum der Veröffentlichung auf Open Access LMU:	27. Mai 2022 09:17
Letzte Änderungen:	27. Mai 2022 09:17

Dokument bearbeiten