Heterogeneous document embeddings for cross-lingual text classification

Autor:	Moreo, Alejandro, Pedrotti, Andrea, Sebastiani, Fabrizio
Rok vydání:	2021
Předmět:	Cross lingual Computer science business.industry Feature vector Posterior probability 020207 software engineering Pattern recognition 02 engineering and technology Ensemble learning Class (biology) Transfer learning ComputingMethodologies_PATTERNRECOGNITION Dimension (vector space) Word embeddings 020204 information systems Text classification 0202 electrical engineering electronic engineering information engineering Enhanced Data Rates for GSM Evolution Artificial intelligence business Transfer of learning Heterogeneous transfer learning
Zdroj:	SAC SAC 2021: 36th ACM/SIGAPP Symposium On Applied Computing, pp. 685–688, Online conference, 22-26/03/2021 info:cnr-pdr/source/autori:Moreo A.; Pedrotti A.; Sebastiani F./congresso_nome:SAC 2021: 36th ACM%2FSIGAPP Symposium On Applied Computing/congresso_luogo:Online conference/congresso_data:22-26%2F03%2F2021/anno:2021/pagina_da:685/pagina_a:688/intervallo_pagine:685–688 Proceedings of the 36th Annual ACM Symposium on Applied Computing SAC '21: Proceedings of the 36th Annual ACM Symposium on Applied Computing
DOI:	10.1145/3412841.3442093
Popis:	Funnelling (Fun) is a method for cross-lingual text classification (CLC) based on a two-tier ensemble for heterogeneous transfer learning. In Fun, 1st-tier classifiers, each working on a different, language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a meta- classifier that uses this vector as its input. The metaclassifier can thus exploit class-class correlations, and this (among other things) gives Fun an edge over CLC systems where these correlations cannot be leveraged. We here describe Generalized Funnelling (gFun), a learning ensemble where the metaclassifier receives as input the above vector of calibrated posterior probabilities, concatenated with document embeddings (aligned across languages) that embody other types of correlations, such as word-class correlations (as encoded by Word-Class Embeddings) and word-word correlations (as encoded by Multilingual Unsupervised or Supervised Embeddings). We show that gFun improves on Fun by describing experiments on two large, standard multilingual datasets for multi-label text classification.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::6dbede8e63d37e5f652c426a613197da https://doi.org/10.1145/3412841.3442093 Zobrazit plný text záznamu