A Multilingual BPE Embedding Space for Universal Sentiment Lexicon Induction

We present a new method for sentiment lexicon induction that is designed to be applicable to the entire range of typological diversity of the world’s languages. We evaluate our method on Parallel Bible Corpus+ (PBC+), a parallel corpus of 1593 languages. The key idea is to use Byte Pair Encodings (BPEs) as basic units for multilingual embeddings. Through zero-shot transfer from English sentiment, we learn a seed lexicon for each language in the domain of PBC+. Through domain adaptation, we then generalize the domain-specific lexicon to a general one. We show – across typologically diverse languages in PBC+ – good quality of seed and general-domain sentiment lexicons by intrinsic and extrinsic and by automatic and human evaluation. We make freely available our code, seed sentiment lexicons for all 1593 languages and induced general-domain sentiment lexicons for 200 languages.


Introduction
Lexicons play an important role in sentiment analysis. Sentiment lexicons are available for highresource languages like English (Pang et al., 2008;Baccianella et al., 2010;Mohammad and Turney, 2013), but not for many low-resource languages. Researchers are trying to fill this gap by inducing lexicons monolingually (Badaro et al., 2014;Eskander and Rambow, 2015;Rouces et al., 2018) as well as multilingually (Chen and Skiena, 2014), often by transfer from high-resource to low-resource languages.
The world's languages are heterogeneous -of particular relevance for us is heterogeneity with respect to morphology and with respect to marking token boundaries. This heterogeneity poses difficulties when designing a universal approach 1 cistern.cis.lmu.de to lexicon induction that works for all languages -implementing a high quality tokenizer and morphological analyzer for each language is not feasible short-term.
Given the small number of native speakers in low-resource languages (Goldhahn et al., 2016), crowdsourcing cannot easily be carried out either.
To overcome this heterogeneity and provide sentiment resources for low-resource languages, we present a new approach to sentiment lexicon induction that is universal -that is, it is applicable to the full range of typologically different languages -and apply it to 1593 languages. Our method first takes a parallel corpus as input and applies BPE (Gage, 1994) segmentation to it. We then create a multilingual BPE embedding space, from which a ZS (zero-shot) lexicon for each language L is extracted by zero-shot transfer from English sentiment to L. We use PBC+, an expansion of the Parallel Bible Corpus (Mayer and Cysouw, 2014), as our parallel corpus. The ZS lexicons show high quality, but are specific to the domain of PBC+ (the Bible). We then adapt them to the general domain. For brevity, we also use generic to refer to general-domain.
Our method is universal and language-agnostic -it does not require language-dependent preprocessing. We carry out intrinsic and extrinsic, automatic and human evaluations on 95 languages. Intrinsic evaluation shows that our approach produces word ratings that strongly correlate with gold standard lexicons and human judgments. Extrinsic evaluation on Twitter sentiment classification demonstrates that our lexicons perform comparably or better than existing lexicons derived in multilingual settings.
We chose an approach to sentiment analysis based on lexicons in this paper because it is transparent and meets high standards of explainability. A classification decision can easily be traced back to the lexicon entries in the document that are responsible. Many more complex methods, e.g., many deep learning approaches, do not meet this standard. Transparency is of particular importance for low-resource languages because error analysis and verification are paramount when working with small and noisy resources that are typical of lowresource languages.
Our contributions: (i) We propose a new method for inducing sentiment lexicons for a broad range of typologically diverse languages. We use BPEs as basic units and show that they work well across languages. (ii) We carry out extensive evaluation to confirm correctness and high quality of the created lexicons. (iii) We make our code, the 1593 ZS seed sentiment lexicons and 200 generic sentiment lexicons freely available to the community. This is the up-to-now largest sentiment resource in terms of language coverage that has been published.

Related Work
Monolingual Lexicon Induction. Sentiment lexicons for many languages have been induced. Eskander and Rambow (2015), Wang and Ku (2016), and Rouces et al. (2018) create Arabic, Chinese, and Swedish sentiment lexicons, respectively. Monolingually induced sentiment lexicons for specific domains like Twitter and finance are also devised Hamilton et al., 2016). These methods are specialized such that applying them to other languages is non-trivial. For example, Eskander and Rambow (2015) link AraMorph (Buckwalter, 2004) with SentiWordNet by additionally considering part-ofspeech information, which may not be available in lexical resources in other languages. Inducing Chinese sentiment lexicons (Wang and Ku, 2016) needs properly tokenized corpora, which is not a hard requirement in Swedish. In contrast, we aim to design a method applicable to typologically diverse languages and we apply it to 1500+ languages.
Bi/Multi-Lingual Lexicon Induction. Gao et al. (2015) propose a graph based method for learning sentiment lexicons in target language by leveraging English sentiment lexicons. They rely on a high-quality word alignment, which is difficult to produce if languages are typologically diverse and the size of the parallel corpus is small. Chen and Skiena (2014)   In contrast, our approach uses BPE embeddings to extract alignment signals from the parallel corpus, an approach that is better applicable across diverse languages. We do not require resources like Wiktionary. We cover more languages than Chen and Skiena (2014) and more words (e.g., 300K for Amharic).
Language-Agnostic NLP. Language-agnostic NLP has demonstrated strong performance in areas such as neural machine translation (NMT) and universal representation learning. A particular difficulty is languages that do not mark token boundaries by whitespace such as Japanese. We refer to them as non-segmented languages. Sennrich et al. (2016) show the strength of BPE in translating rare words. Kudo (2018) introduces subword regularization that utilizes multiple subword sequences to improve the robustness of NMT models. Sennrich et al. (2016)'s subword-nmt 2 requires preprocessing (specifically, tokenization) for non-segmented languages, however, sentencepiece 3 (Kudo and Richardson, 2018) used by Kudo (2018) requires no preprocessing even for non-segmented languages. This research indicates the potential of language-agnostic NMT.
Effective representations of words (Schütze, 1993), e.g., word embeddings (Mikolov et al., 2013;Pennington et al., 2014), have been extended to be bilingual (Ruder, 2017;Artetxe et al., 2017) or multilingual (Dufter et al., 2018), with  and without (Conneau et al., 2017) supervision. Artetxe and Schwenk (2018) train a language-agnostic BiLSTM encoder creating universal sentence representations of 93 languages, and performing strongly in crosslingual tasks. Lample and Conneau (2019) show that pretraining the encoders with a crosslingual language model objective helps in achieving state-of-the-art results in crosslingual classification and NMT. This research demonstrates the strength of language-agnostic methods for representation learning in NLP. Language-agnostic NLP models can generalize across languages without requiring language-dependent preprocessing. These advantages motivate us to design a universal approach for sentiment lexicon induction for 1500+ languages. Figure 1 shows the four steps of our method: (i) BPE segmentation. (ii) Multilingual embedding space creation. (iii) ZS lexicon induction. (iv) Domain adaptation to the general domain. We work with the parallel corpus PBC+. PBC+ extends the Parallel Bible Corpus by adding 4 500 translations of the New Testament in 334 languages, resulting in a sentence-aligned parallel corpus containing New Testament verses in 2164 translations of 1593 languages. Many languages have several translations of the New Testament in PBC+. We use the term "edition" to refer to a single translation. Table 1 shows a verse in three languages. As shown, the Japanese (jpn) verse is not tokenized.

BPE Segmentation
Given the linguistic heterogeneity of the world's languages, it is crucial to first decide which type of linguistic unit to use to represent a language L in the multilingual space. The word, the linguistic unit typically generated from whitespace tokenization, is not ideal for universal approaches because non-segmented languages require carefully designed tokenizers. Character (or byte) n-gram is an alternative unit (Wieting et al., 2016;Gillick et al., 2016;Schütze, 2017;Dufter et al., 2018), but the optimum length n varies across languages, e.g., n = 2 may be suitable for Chinese (Foo and Li, 2004), but clearly not for English.
In our desire to design a universal approach, we use sentencepiece to segment PBC+ editions in all 1593 languages into sequences of BPE segments. We will show that this segmentation works across languages.
The widely used BPE segmentation algorithm subword-nmt only considers BPE segments within words (Sennrich et al., 2016) and some frequent BPEs are essentially valid words.
sentencepiece adopts this setting for segmented languages like English (Kudo, 2018). But for non-segmented languages, sentencepiece does not require any language-dependent preprocessing -it learns a data-driven "tokenizer" onthe-fly from raw text. Hence, sentencepiece BPE segments can be larger linguistic units than say, English words, e.g., phrases. Examples for Japanese BPE segments in PBC+ are: "愛のうち に" (in love) and "何と言えばよいでしょうか" (what should I say).
We will use the term "BPE" to refer to all BPE segments produced by sentencepiece, including subwords, words and cross-token units like phrases. Figure 1 (a) shows some sample units. As shown, the English segments can be words or subwords (underlined). Dominant contexts of shown subwords -insp: inspiration, inspired; crim: crime, criminals; blasphe: blasphemy, blasphemed; hest: highest, richest.

Multilingual Space Creation
We next create the multilingual space hosting BPEs in 1593 languages of PBC+. We use the Sentence ID (S-ID) method (Levy et al. (2017), cf. also Le and Mikolov (2014)), a strong baseline in multilingual embedding learning.
Given a sentence-aligned parallel corpus, the S-ID method first creates an embedding training corpus by recording co-occurrences between the sentence ID and the sentence's words (the New Testament verse ID and BPEs in our case) in all languages. Figure 2 shows examples from the training corpus; each BPE is associated with a 3-digit ISO 639-3 language code.
After that, an embedding learner is applied to the created corpus to learn the multilingual space. We use word2vecskipgram (Mikolov et al., 2013) as our embedding learner.

Zero-Shot Transfer of English Sentiment
Embeddings encode sentiment information (Pennington et al., 2014;Tang et al., 2014;Amir et al., 2015;Rothe et al., 2016). We exploit this for zero-shot transfer of English sentiment to the other 1592 languages. We train two linear SVMs to classify sentiment of English BPE embeddings as positive vs. non-positive (POS) and as negative vs. non-negative (NEG).
We use this setup -as opposed to binary classification positive vs. negative -to address the fact that some long BPE segments in non-segmented languages may encode both sentiments. Using two SVMs allows us to identify then filter out segments with compositional sentiments during zeroshot transfer. This setup also enables direct comparison with Dufter et al. (2018) in Table 2.
The two SVMs are then applied to all embedding vectors in the multilingual space to yield a ZS lexicon for each of the 1593 languages.

PBC+ to General Domain Adaptation
Our ZS lexicons show high quality (see §5.2), but are specific to the PBC+ domain, i.e., the Bible. We adapt them to the general domain by obtaining generic embeddings and using ZS lexicon BPEs as labels to predict the sentiment of each generic embedding.
We assume that we have access to generic embeddings or, alternatively, that we can learn them from a generic corpus. We now describe how we predict the sentiment of generic embeddings. Given the PBC+ ZS lexicon B and the generic em- where e w , e v ∈ R d are embeddings of BPEs w, v. d is embedding dimension. n is vocabulary size. α ∈ [0, 1] is the hyperparameter balancing the two sub-objectives. λ is a regularization weight. P ∈ R d×d is an identity matrix in the first dimension, i.e., a selector. This objective concentrates sentiment information in an embedding vector to a 1-dimensional ultradense sentiment space, resulting in a real-valued generic sentiment score. We minimize the objective using stochastic gradient descent (SGD). After training, the generic sentiment score of BPE w in language L is computed as s w = P Q L e w . We refer to this method as REG and we call a lexicon computed by REG a generic DA (domain-adapted) lexicon since we always adapt from the Bible to the general domain in this paper.
REG is inspired by Densifier (Rothe et al., 2016), which is state of the art on SemEval2015 10E (Rosenthal et al., 2015) -determining strength of association of Twitter terms with sentiment. Rothe et al. (2016) show that Densifier induces high quality and coverage sentiment lexicons in a domain adaptation setup. Densifier forces Q L to be orthogonal to preserve the structure of the embedding space. As we are only interested in accurate sentiment prediction, we replace the orthogonality with l2 regularization: λ 2 P Q L 2 F . The orthogonal constraint in Densifier -computing an SVD after each batch updateis expensive (O(d 3 )) and requires non-trivial training regime (Rothe et al., 2016). We will show that our formalization delivers comparable results.
In our experiments, we can use the generic word embeddings provided by Bojanowski et al. (2017) for 157 languages. Additionally, Heinzerling and Strube (2018) create generic BPE embeddings for 257 languages by segmenting Wikipedia articles using sentencepiece then running GloVe on the segmented corpora. As discussed above ( §3.1), some BPEs in the PBC+ ZS lexicons are words, some are subwords -so we can utilize both sets.

Datasets and Settings
We use the 7958 New Testament verses in PBC+ that were also used by Dufter et al. (2018) to create the multilingual BPE embedding space. To cover as many BPEs as we can, we segment each PBC+ edition three times with vocabulary sizes 2000, 4000 and 8000 using sentencepiece. S-ID generates a 31GB embedding training corpus including 7,414,810 BPEs in 1593 languages.
English training set. We employ VADER, a simple but widely used rule-based model for general sentiment analysis (Hutto and Gilbert, 2014), to create sentiment labels for English BPEs. We consider BPEs with sentiment score +0.1 (resp. -0.1) as positive (resp. negative). BPEs with score 0 are treated as neutral. As a result, we have 851 positive, 906 negative and 13,861 neutral training BPEs in English. We uniformly sample 878 = floor((851 + 906)/2) neutral BPEs to speed up training.
Zero-shot transfer. The two SVMs for POS and NEG ( §3.3) are trained on English training set (see above), then applied to all vectors in the multilingual BPE embedding space to create ZS lexicons for 1593 languages. We only keep highconfidence BPEs -those with a predicted probability for either POS or NEG of ≥ 0.7 (Platt et al., 1999) -to ensure ZS lexicons encode clear sentiment signals. The PBC+ ZS lexicon of language L is then the set of all high-confidence sentimentbearing BPEs from L.  (Abdaoui et al., 2017) and English (EN) (WHM lexicon, the concatenation of Wilson et al. (2005), Hu and Liu (2004) and Mohammad and Turney (2013), created by Rothe et al. (2016)). F1 is evaluation metric. We always compute F1 on the intersection of our and gold lexicon. Gold lexicons are also used in intrinsic evaluation of generic DA lexicons (Table 6). Additionally, the English WHM lexicon is also used in the evaluation of the universality of our approach (Table 8).
For intrinsic evaluation of generic DA lexicons, we compare our results with Densifier. Rothe et al. (2016) provide embeddings and train/validation splits of gold standard lexicons in CZ, DE, ES, FR and EN -we also use them in our experiments. We show (i) using GEN (the same training words as Densifier), REG ( §3.4) induces generic lexicons in comparable quality; (ii) using PBC+ ZS lexicons, the induced generic DA lexicons are also in high quality. Kendall's τ (Kendall, 1938) is evaluation metric. As Densifier is implemented in MATLAB, we implement our model in NumPy (Oliphant, 2006) which is more accessible to the community.
For extrinsic evaluation of generic DA lexicons, we carry out Twitter sentiment classification in 13 languages. For each language, we retrieve ≈12,000 tweets from the human annotated dataset devised by Mozetič et al. (2016), and sample balanced number of positive and negative tweets (for clearer comparisons and descriptions) which are then randomly split 80/20 into train/test. We compare our lexicons with Chen and Skiena (2014)'s work. Two classification models are used ( §5.3) -COUNT (count-based, Chen and Skiena (2014)) and ML (machine-learning-based, Eskander and Rambow (2015)). Accuracy is evaluation metric.
We tune the two linear SVMs for POS and NEG by 5-fold cross validation on English training set.
Following Rothe et al. (2016), when inducing generic DA lexicons, we run a grid search on their train/validation sets to find α and λ. With the same settings, we additionally conduct an experiment on Japanese (JA Wiki), a non-segmented language, to show the universality of our approach. For EN Twitter (SemEval2015 10E), we tune our model on the trial (dev) set and report results on the test set. In all experiments, we search α ∈ {0.3, 0.4, 0.5, 0.6, 0.7}, λ ∈ {0.01, 0.1, 1}. Learning rate is 0.1, batch size 100, and the maximum number of updating steps 30,000.
Following Eskander and Rambow (2015), in machine-learning-based Twitter sentiment classification for each of the 13 languages, we find the optimum SVM (positive vs. negative tweet) hyperparameters (C and kernel) by running 5-fold cross validation on the training set.

Multilingual BPE Space Evaluation
We first evaluate the multilingual BPE space by carrying out the crosslingual verse sentiment classification experiment in Dufter et al. (2018). Two linear SVMs are trained on 2147 English training verses to classify the verse sentiment (positive vs. non-positive, i.e., POS, and negative vs. non-negative, i.e., NEG). A verse is represented as the TF-IDF weighted sum of the embeddings of its BPEs. We then conduct the crosslingual verse sentiment analysis -using the SVMs to classify 476 test verses of Dufter et al. (2018)'s 1664 editions in 1259 languages. Table 2 gives results averaged over 1664 editions. Word and Char are two multilingual spaces created by Dufter et al. (2018). For Word, whitespace tokenization is used to segment all editions. For Char, all editions are segmented to sequences of overlapping byte-ngrams (length n varies across languages, see Dufter et al. (2018)). Next, the S-ID method is utilized to create the two multilingual spaces.
The S-ID BPE space outperforms both S-ID Word and S-ID Char spaces. This observation meets our expectation -the data-driven BPE   segmentation is superior to splitting on whitespace (Word) or overlapping byte-ngram segmentation (Char), for non-segmented languages like Japanese whose PBC+ editions are not tokenized.
For the more challenging subtask POS, we find the biggest improvement of S-ID BPE over Word is for non-segmented languages like Classical Chinese (lzh), Japanese (jpn), Khmer (khm) and S'gaw Karen (ksw) as shown in Table 3 (left). For segmented languages, S-ID BPE delivers similar performance as S-ID Word as shown in Table  3 (right). This observation also meets our expectation -lots of BPEs in segmented languages are essentially valid words.
These observations show the universality of our approach. The sentiment information derived from English is successfully transferred to heterogeneous languages without language-dependent preprocessing -even for non-segmented languages.

PBC+ ZS (Zero-Shot) Lexicon Evaluation
Sample entries in the English ZS lexicon are shown in Table 4 (left) as a qualitative evaluation.  two SVMs trained on English BPE embeddings perform strongly in a zero-shot crosslingual setting, and the resulting PBC+ ZS lexicons in difficult (morphologically rich, e.g., Czech; nonsegmented, e.g., Japanese) languages encode clear sentiment information.  (Rothe et al., 2016). Intrinsic evaluation: ranking correlation. We compute ranking correlation between our generic DA lexicons and gold standard lexicons. There are overlapping words between our PBC+ ZS lexicon BPEs and the validation/test sets used by Rothe et al. (2016) -we discard these training words for a clean comparison.

Generic DA (Domain-Adapted) Lexicon Evaluation
Columns (i) and (ii) of Table 6 show that REG ( §3.4) delivers results comparable to Densifier (ORTH) when using the same set of generic training words (GEN) in lexicon induction. However, our method is more efficient -no need to compute the expensive SVD after every batch update.
Comparing columns (ii) and (iii), we see a marginal decrease of τ between .020 and .057 when GEN is replaced by PBC+ ZS lexicons. Note that PBC+ ZS lexicons have much fewer training BPEs than GEN (e.g., 343 vs. 4298 in JA Wiki) -this may contribute to the decrease. These comparable results also reflect the correctness of PBC+ ZS lexicons.
We also use α = 0.4 and λ = 0.01, the optimal hyperparameter values found on the trial set of EN Twitter, to induce generic DA lexicons for the other languages. This is the common setting    in real applications -other languages most likely do not have validation sets available. Results are shown in column (iv). Compared with tuned results (PBC+/T), performance slightly drops as the hyperparameters are not tuned (PBC+/NT) for languages other than EN Twitter. Overall, the performance differences between GEN (based on generic gold standard lexicons) and PBC+ (based on PBC+ ZS lexicons) are small and τ correlations are high. The high quality of generic DA lexicons in these six diverse (morphologically rich and non-segmented) languages shows the universality of our approach again -no language-dependent preprocessing is needed.
Extrinsic evaluation: Twitter sentiment classification. Based on the subset of frequent words only, 5 we use the top 10% most positive and most negative words for this evaluation. We compare with the closest work -lexicons from Chen and Skiena (2014).
Two classification models are used -wordcount-based model COUNT (Chen and Skiena, sqi bul hrv deu hun pol por rus srp slk slv spa swex COUNT C&S .55 .57 .57 .61 .61 .55 .57 .54 .51 .55 .64 .54 .57 .57 Ours .50 .60 .60 .56 .64 .62 .53 .65 .50 .61 .57 .55 .63 .58 ML C&S .58 .59 .60 .62 .64 .56 .54 .56 .51 .57 .66 .53 .59 .58 Ours .54 .65 .65 .64 .66 .66 .54 .67 .51 .64 .59 .57 .64 .61  2014), and machine-learning-based model ML (Eskander and Rambow, 2015). COUNT labels a tweet with the sentiment that has more word occurrences in the tweet (positive in case of ties). COUNT does not require training and the results are from all tweets for each language. In ML, the vector representation of a tweet is created according to Figure 3. Our generic DA lexicons support computing real-valued vectors in this way. Chen and Skiena (2014)'s lexicons are discrete (1/-1); we use these discrete values when applying ML to their lexicons. Finally, for each language, an SVM is trained on the 2-dimensional vectors. Table 7 shows results. The baseline accuracy is 0.5 for all experiments as our dataset is balanced. Rows Ours and C&S show results using our and Chen and Skiena (2014)'s lexicons respectively. As shown, the two sets of lexicons give comparable results in COUNT. But ML generally performs better than COUNT, and our lexicons give better classification results -our real-valued representation of tweets is superior to the discrete one computed with Chen and Skiena (2014)'s lexicons.
Overall, intrinsic and extrinsic evaluations on diverse languages demonstrate the high quality of our generic DA lexicons.

Evaluation of Universality
We further conduct automatic and human evaluations on 95 diverse languages to show the universality of our approach. We focus on intrinsic evaluation -verifying the correctness of PBC+ ZS lexicons with F1, and assessing the quality of generic DA lexicons using τ . The extrinsic evaluation, i.e., Twitter sentiment classification, is not feasible here due to missing human annotated Twitter datasets in low-resource languages.
Automatic evaluation. Similar to Chen and Skiena (2014); Abdaoui et al. (2017), we use Google Translate (GT) for automatic evaluationgiven a non-English language L, we translate its PBC+ ZS lexicon and generic DA lexicon into English. Translated English lexicons are then evalu-ated against the gold English lexicon WHM.
GT supports 102 non-English languages. We omit ten languages that (i) are not covered by PBC+ (Corsican, Galician, Pashto, Yiddish); (ii) are covered in PBC+, but not in the alphabet used by GT (Malayalam); (iii) do not have public pretrained embeddings (Filipino, Hmong, Kyrgyz, Sesotho); or (iv) are very close to another language (we keep Croatian, but do not include Bosnian). We conduct separate experiments for Bokmål and Nynorsk, which are not distinguished by GT. Thus, we evaluate on 93 languages. When translating words to English, we discard entries where GT fails (i.e., output is identical to input). As GT requires the uploaded file to be small ( 1MB), we do the evaluation on uniformly sampled 600 top 1% positive and negative words that are frequent. For ten languages (Chichewa, Hausa, Hawaiian, Igbo, Lao, Maori, Samoan, Shona, Xhosa, Zulu) that have very small embedding training corpora (<5MB Wikipedia pages and articles) and vocabulary sizes (e.g., 5000 for Hausa), we sample 200 words at 10%. Table 8 shows results. We see that PBC+ ZS lexicons show high consistency with gold labels across all 93 languages (F1 columns), including morphologically rich languages like Czech and Turkish, and non-segmented languages like Japanese and Khmer. The generic DA lexicons show high correlation with gold labels (τ columns) -with two exceptions. First, some languages have low-quality embeddings due to small embedding training corpora (e.g., Hawaiian: 998 KB; Igbo: 1014 KB) or because the training corpora apparently have low quality -e.g., the Luxembourgish embedding vocabulary contains a large amount of French and German words, suggesting that it was trained on mixed text and that the genuine Luxembourgish part is small. Second, GT does not perform well for some of the languages, again this is the case for Luxembourgish and also for Frisian. To give an example from Lux-   embourgish for both problems: "vergloust" and its first nearest neighbor "verglousten" are translated by GT as "glowed" and "forget about it". We recommend to use the higher quality PBC+ ZS lexicon for these languages.
Apart from above exceptions, both F1 and τ are reasonably high, evidencing that our universal approach is applicable to a broad range of typologically diverse languages.
We do human evaluation for Hiligaynon and Tibetan, languages not supported by GT.
There are no public pretrained embeddings for Hiligaynon. We train embeddings on a concatenation of texts from project Palito (Dita et al., 2009) and Jehovah's Witnesses e-books (www. jw.org). From the generic DA Hiligaynon and Tibetan lexicons, we uniformly sample 199 from the top 10% positive and negative frequent BPEs.
Two Tibetan scholars and three Hiligaynon speakers annotated these BPEs as positive, negative, neutral, unclear where the last category refers to cases where the intended word is not apparent from the BPE. We omit entries labeled as unclear and compute τ . Table 9 shows τ averaged over annotators. We see that our lexicons have consistent positive correlation with the human annotation in both languages.

Conclusion
We proposed a universal approach for sentiment lexicon induction. By creating a multilingual BPE embedding space for 1500+ languages, we successfully transfer sentiment to each language without language-dependent preprocessing. We created 1593 ZS (zero-shot) sentiment lexicons and showed for a subset that they are highly consistent with gold lexicons. To address the fact that the small-size ZS lexicons are specific to PBC+'s domain, we conduct domain adaptation and induce large-size generic DA (domain-adapted) lexicons for 200 languages. Extensive intrinsic and extrinsic, automatic and human evaluations on 95 languages confirm the correctness and good quality of our lexicons. We make our code and lexicons freely available to the community.
To induce generic lexicons, our approach requires generic embeddings, which are not always available for low-resource languages. Solving this problem is non-trivial as many low-resource languages have a limited amount of written text in electronic form (and in any form). In such cases, the PBC+ ZS lexicons can be utilized because they also have high quality.