Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19

Autor:	Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, Rannie Lin
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	Social and Information Networks (cs.SI) FOS: Computer and information sciences Information retrieval Computer Science - Computation and Language Coronavirus disease 2019 (COVID-19) Computer science Computer Science - Social and Information Networks 02 engineering and technology Mega Annotation 020204 information systems Scale (social sciences) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Misinformation Computation and Language (cs.CL)
Zdroj:	EACL
Popis:	We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::dc53ef5d979e59285101eca7f731960c http://arxiv.org/abs/2005.06012 Zobrazit plný text záznamu