Blog annotation: from corpus analysis to automatic tag suggestion

Autor: Adeline Nazarenko, François Lévy, Ivan Garrido-Marquez, Jorge García Flores
Přispěvatelé: Laboratoire d'Informatique de Paris-Nord (LIPN), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS), ANR-11-IDEX-0005,EFL,Empirical Foundations of Linguistics : data, methods, models(2011), Université Sorbonne Paris Cité (USPC)-Institut Galilée-Université Paris 13 (UP13)-Centre National de la Recherche Scientifique (CNRS), This work is supported/ partially supported by a public grant overseen by the French National Research Agency (ANR) as part of the progam 'Investissements d’Avenir' (reference: ANR-10-LABX-0083), Lévy, François, Nazarenko, Adeline, Université Sorbonne Paris Cité - - USPC2011 - ANR-11-IDEX-0005 - IDEX - VALID, Sansonetti, Morgane, Pascale Fung, Tomas Mikolov, Simone Teufel, Piek Vossen
Jazyk: angličtina
Rok vydání: 2016
Předmět:
Zdroj: Research in Computing Science
Research in Computing Science, National Polytechnic Institute, 2016, Special Issue: Advances in Opinion Mining, Social Network Analysis, and Authorship Attribution, pp.95-106
17th International Conference on Intelligent Text Processing and Computational Linguistics
17th International Conference on Intelligent Text Processing and Computational Linguistics, Apr 2016, Konya, Turkey
HAL
17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016)
17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016), Pascale Fung; Tomas Mikolov; Simone Teufel; Piek Vossen, Apr 2016, Konya, Turkey
ISSN: 1870-4069
Popis: International audience; Nowadays, blogs cover a large audience and they raised from the underground to become part of mainstream media. Blogs contain information on diverse topics, personal opinions, and discussions between bloggers and readers. Tags and categories are structural elements of a blog post that increase the blog's visibility, enhance navigation and searching within the blog history. We suppose that those annotations are made on subjective grounds rather than in a systematic way. Even if there are tools to help bloggers to tag and categorize their posts, we still don't know to which extent these tools take into account information contained in previous posts. This paper presents a 11 million word corpus of blogs posts in French dedicated to study these questions, and an experiment in tag and category prediction. Preliminary results show that around 27\% of the overall tags can be predicted from lexical frequency analysis of blog posts. However, a first comparison experience with an existing tag suggestion tool shows that an important proportion of the tags used for blog description are not present in the blog post. This shows that tag suggestion tools should exploit the diachronic analysis of blogs.
Databáze: OpenAIRE