ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Zhang, Mike; Goot, Rob van der und Plank, Barbara (2023): ESCOXLM-R: Multilingual Taxonomy-driven Pre-training for the Job Market Domain. The 61st Annual Meeting of the Association for Computational Linguistics, Toronto, Canada, July 9-14, 2023. Anna, Rogers (Hrsg.): In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA: Association for Computational Linguistics. S. 11871-11890 [PDF, 403kB]

Vorschau

Creative Commons: Namensnennung 4.0 (CC-BY)

Veröffentlichte Version

DOI: 10.18653/v1/2023.acl-long.662

Abstract

The increasing number of benchmarks for Natural Language Processing (NLP) tasks in the computational job market domain highlights the demand for methods that can handle job-related tasks such as skill extraction, skill classification, job title classification, and de-identification. While some approaches have been developed that are specific to the job market domain, there is a lack of generalized, multilingual models and benchmarks for these tasks. In this study, we introduce a language model called ESCOXLM-R, based on XLM-R-large, which uses domain-adaptive pre-training on the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy, covering 27 languages. The pre-training objectives for ESCOXLM-R include dynamic masked language modeling and a novel additional objective for inducing multilingual taxonomical ESCO relations. We comprehensively evaluate the performance of ESCOXLM-R on 6 sequence labeling and 3 classification tasks in 4 languages and find that it achieves state-of-the-art results on 6 out of 9 datasets. Our analysis reveals that ESCOXLM-R performs better on short spans and outperforms XLM-R-large on entity-level and surface-level span-F1, likely due to ESCO containing short skill and occupation titles, and encoding information on the entity-level.

Dokumententyp:	Konferenzbeitrag (Paper)
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik 400 Sprache > 400 Sprache
URN:	urn:nbn:de:bvb:19-epub-121962-6
ISBN:	978-1-959429-72-2
Ort:	Stroudsburg, PA
Sprache:	Englisch
Dokumenten ID:	121962
Datum der Veröffentlichung auf Open Access LMU:	04. Nov. 2024 14:19
Letzte Änderungen:	04. Nov. 2024 14:19

Dokument bearbeiten