Mega-COV: A Billion-Scale Dataset of 100+ Languages for COVID-19
Autor: | Muhammad Abdul-Mageed, AbdelRahim Elmadany, El Moatez Billah Nagoudi, Dinesh Pabbi, Kunal Verma, Rannie Lin |
---|---|
Jazyk: | angličtina |
Rok vydání: | 2020 |
Předmět: |
Social and Information Networks (cs.SI)
FOS: Computer and information sciences Information retrieval Computer Science - Computation and Language Coronavirus disease 2019 (COVID-19) Computer science Computer Science - Social and Information Networks 02 engineering and technology Mega Annotation 020204 information systems Scale (social sciences) 0202 electrical engineering electronic engineering information engineering 020201 artificial intelligence & image processing Misinformation Computation and Language (cs.CL) |
Zdroj: | EACL |
Popis: | We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available. |
Databáze: | OpenAIRE |
Externí odkaz: |