Blog annotation: from corpus analysis to automatic tag suggestion
Autor: | Adeline Nazarenko, François Lévy, Ivan Garrido-Marquez, Jorge García Flores |
---|---|
Přispěvatelé: | Laboratoire d'Informatique de Paris-Nord (LIPN), Université Paris 13 (UP13)-Institut Galilée-Université Sorbonne Paris Cité (USPC)-Centre National de la Recherche Scientifique (CNRS), ANR-11-IDEX-0005,EFL,Empirical Foundations of Linguistics : data, methods, models(2011), Université Sorbonne Paris Cité (USPC)-Institut Galilée-Université Paris 13 (UP13)-Centre National de la Recherche Scientifique (CNRS), This work is supported/ partially supported by a public grant overseen by the French National Research Agency (ANR) as part of the progam 'Investissements d’Avenir' (reference: ANR-10-LABX-0083), Lévy, François, Nazarenko, Adeline, Université Sorbonne Paris Cité - - USPC2011 - ANR-11-IDEX-0005 - IDEX - VALID, Sansonetti, Morgane, Pascale Fung, Tomas Mikolov, Simone Teufel, Piek Vossen |
Jazyk: | angličtina |
Rok vydání: | 2016 |
Předmět: |
[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI]
Corpus analysis Tag suggestion Information retrieval Exploit Computer science Annotation Lexical frequency [INFO.INFO-TT] Computer Science [cs]/Document and Text Processing General Medicine [SHS.LANGUE] Humanities and Social Sciences/Linguistics [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] World Wide Web [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing Diachronic analysis Categorization Mainstream Spam blog [SHS.LANGUE]Humanities and Social Sciences/Linguistics ComputingMilieux_MISCELLANEOUS |
Zdroj: | Research in Computing Science Research in Computing Science, National Polytechnic Institute, 2016, Special Issue: Advances in Opinion Mining, Social Network Analysis, and Authorship Attribution, pp.95-106 17th International Conference on Intelligent Text Processing and Computational Linguistics 17th International Conference on Intelligent Text Processing and Computational Linguistics, Apr 2016, Konya, Turkey HAL 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016) 17th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016), Pascale Fung; Tomas Mikolov; Simone Teufel; Piek Vossen, Apr 2016, Konya, Turkey |
ISSN: | 1870-4069 |
Popis: | International audience; Nowadays, blogs cover a large audience and they raised from the underground to become part of mainstream media. Blogs contain information on diverse topics, personal opinions, and discussions between bloggers and readers. Tags and categories are structural elements of a blog post that increase the blog's visibility, enhance navigation and searching within the blog history. We suppose that those annotations are made on subjective grounds rather than in a systematic way. Even if there are tools to help bloggers to tag and categorize their posts, we still don't know to which extent these tools take into account information contained in previous posts. This paper presents a 11 million word corpus of blogs posts in French dedicated to study these questions, and an experiment in tag and category prediction. Preliminary results show that around 27\% of the overall tags can be predicted from lexical frequency analysis of blog posts. However, a first comparison experience with an existing tag suggestion tool shows that an important proportion of the tags used for blog description are not present in the blog post. This shows that tag suggestion tools should exploit the diachronic analysis of blogs. |
Databáze: | OpenAIRE |
Externí odkaz: |