Statistical Models for Unsupervised, Semi-Supervised, and Supervised Transliteration Mining

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Sajjad, Hassan; Schmid, Helmut; Fraser, Alexander und Schütze, Hinrich (2017): Statistical Models for Unsupervised, Semi-Supervised, and Supervised Transliteration Mining. In: Computational Linguistics, Bd. 43, Nr. 2: S. 349-375

Volltext auf 'Open Access LMU' nicht verfügbar.

DOI: 10.1162/COLI_a_00286

Abstract

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

Dokumententyp:	Zeitschriftenartikel
Fakultät:	Sprach- und Literaturwissenschaften > Department 2
Themengebiete:	400 Sprache > 400 Sprache
ISSN:	0891-2017
Sprache:	Englisch
Dokumenten ID:	53302
Datum der Veröffentlichung auf Open Access LMU:	14. Jun. 2018 09:52
Letzte Änderungen:	04. Nov. 2020 13:32

Dokument bearbeiten