Learning Semantic Representations for Novel Words: Leveraging Both Form and Context

www.lmu.de | UB | Blättern | Hilfe

Zur erweiterten Suche

English

Zur erweiterten Suche

Schick, Timo und Schütze, Hinrich (Januar 2019): Learning Semantic Representations for Novel Words: Leveraging Both Form and Context. Thirty-Third AAAI Conference on Artificial Intelligence; AAAI-2019, Honolulu, Hawaii, USA, 27. January – 01. February 2019. [PDF, 165kB]

Vorschau

Eingereichte Version

DOI: 10.5282/ubm/epub.61859

Abstract

Word embeddings are a key component of high-performing natural language processing (NLP) systems, but it remains a challenge to learn good representations for novel words on the fly, i.e., for words that did not occur in the training data. The general problem setting is that word embeddings are induced on an unlabeled training corpus and then a model is trained that embeds novel words into this induced embedding space. Currently, two approaches for learning embeddings ofnovel words exist: (i) learning an embedding from the novel word’s surface-form (e.g., subword n-grams) and (ii) learning an embedding from the context in which it occurs. In this paper, we propose an architecture that leverages both sources of information – surface-form and context – and show that itresults in large increases in embedding quality. Our architecture obtains state-of-the-art results on the Definitional Nonce and Contextual Rare Words datasets. As input, we only require an embedding set and an unlabeled corpus for training our architecture to produce embeddings appropriate for the induced embedding space. Thus, our model can easily be integrated into any existing NLP system and enhance its capability to handle novel words.

Dokumententyp:	Konferenzbeitrag (Paper)
EU Funded Grant Agreement Number:	740516
EU-Projekte:	Horizon 2020 > ERC Grants > ERC Advanced Grant > ERC Grant 740516: NonSequeToR - Non-sequence models for tokenization replacement
Fakultätsübergreifende Einrichtungen:	Centrum für Informations- und Sprachverarbeitung (CIS)
Themengebiete:	000 Informatik, Informationswissenschaft, allgemeine Werke > 000 Informatik, Wissen, Systeme 000 Informatik, Informationswissenschaft, allgemeine Werke > 004 Informatik 400 Sprache > 400 Sprache 400 Sprache > 410 Linguistik
URN:	urn:nbn:de:bvb:19-epub-61859-8
Sprache:	Englisch
Dokumenten ID:	61859
Datum der Veröffentlichung auf Open Access LMU:	13. Mai 2019 11:42
Letzte Änderungen:	04. Nov. 2020 13:39

Dokument bearbeiten