Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Imani, Ayyoob; Severini, Silvia; Sabet, Masoud Jalili; Yvon, François und Schütze, Hinrich (Dezember 2022): Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging. EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, S. 157-1589 [PDF, 425kB]

Vorschau

Creative Commons: Namensnennung 4.0 (CC-BY)

DOI: 10.18653/v1/2022.emnlp-main.102

Externer Volltext: https://aclanthology.org/2022.emnlp-main.102

Abstract

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

Dokumententyp:	Konferenzbeitrag (Paper)
EU Funded Grant Agreement Number:	740516
EU-Projekte:	Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement
Publikationsform:	Publisher's Version
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	400 Sprache > 400 Sprache 400 Sprache > 410 Linguistik
URN:	urn:nbn:de:bvb:19-epub-107437-7
Sprache:	Englisch
Dokumenten ID:	107437
Datum der Veröffentlichung auf Open Access LMU:	20. Okt. 2023 07:23
Letzte Änderungen:	20. Okt. 2023 07:23

Dokument bearbeiten