Coffee and Tea? A comparison of different methods in corpus-based natural language processing on 'coffee' and 'tea'

Autor: Natchanun Sanitdee
Jazyk: angličtina
Rok vydání: 2023
Předmět:
DOI: 10.5281/zenodo.7759451
Popis: Coffee and tea have been an integral part of many cultures around the world for centuries, and the words are used widely in everyday life. This paper aims to compare different methods to analyze the words coffee and tea in a corpus. The methods employed in this study are frequency analysis, Pointwise Mutual Information (PMI) (Evert et al. 2016; Church et al. 1990), Log-Likelihood Ratio (LLR) (Gries. 2009; Bybee et al. 2001), colligational analysis using Keyword in Context (KWIC), and Vector Space Model (VSM). Python programming language is used to analyze the dataset from COHA corpus (Corpus of Historical American English) in Google Colaboratory. There is also a brief review of Topic Modeling method applied on GlowBe (Global Online data for Worldwide Behavioral Studies of English). The comparison reveals that KWIC method provides the most insightful results because it identifies the use of coffee and tea in the sense of a social gathering. It also shows a specific domain of commercialization where coffee is used as in "coffee franchise" and "coffee stall", as well as dominant structures in which people usually use coffee and tea: possessive noun phrases and direct objects. Although there are tools like Antconc, which can help in separating tokens into different positions, colligational method requires the most effort to prepare concordances, and identify, categorize and analyze patterns. VSM reflects specific social and cultural contexts where coffee and tea are used. With the associated words like beer, wine, liquor, and champagne, it shows the common coffee consumption approach in a social setting where alcoholic drinks are present, and for tea, with the associated words like breakfast, supper and dinner, it indicates that tea is usually consumed with meals. Frequency analysis, PMI, and LLR yield a similar set of collocates, although each method scores and ranks differently. The observations are that PMI underestimates low-frequency collocates like have with the score of 0, but it is not found to overestimate high-frequency collocates like drink with coffee and cup with tea. In addition, LLR captures approximately 15% more stop-words than PMI. Frequency analysis is the only method that can illustrate the trend of language use over time based on the number of occurrences. It suggests that coffee gains more popularity while tea sees the opposite. Lastly, Topic Model appears to be sensitive to the size of the dataset in that it captures many stop words in big datasets. This is an interesting observation because stop words are normally filtered out in the coding process; therefore, they do not show up in a result. In general, however, topic modeling works better with large datasets (Isoaho, Gritsenko & Mäkelä. 2019; Ylä-Anttila. 2016).
{"references":["Barnbrook, Geoff, David Mason & Roland Krishnamurthy. 2017. Collocation: applications and implications. Routledge.","Brezina, Vaclav, Tony McEnery & Susan Wattam. 2015. Collocations in context: A new perspective on collocation networks. Routledge.","Bybee, Joan L. & Paul J. Hopper. 2001. Introduction to frequency and the emergence of linguistic structure. John Benjamins Publishing.","Church, Kenneth W. & Hanks, Patrick. 1990. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 16(1), 22-29.","Collins English Dictionary. (n.d.). Tea definition. Retrieved March 18, 2023, from https://www.collinsdictionary.com/dictionary/english/tea","de Bolla, Peter, Etienne Jones, Paul Nulty, Giacomo Recchia & James Regan. 2016. The Idea of Liberty, 1600–1800: A Distributional Concept Analysis. Stanford University Press.","Evert, Stefan & Andrew Hardie. 2010. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011 Conference, 183-193.","Evert, Stefan & Krennmayr, Tina. 2016. Methods for corpus linguistics. In J. Conzett & S. Mair (Eds.), Methods in Contemporary Linguistics (pp. 1-25). De Gruyter Mouton.","Firth, John Rupert. 1957. A synopsis of linguistic theory 1930-1955. Studies in Linguistic Analysis, 1-32.","Gries, Stefan Th. 2009. Dispersions and adjusted frequencies in corpora: further explorations. Language and Computers, 69-80.","Gries, Stefan Th. 2009. Quantitative Corpus Linguistics with R: A Practical Introduction. Routledge.","Gries, Stefan Th. 2013. Statistics for linguistics with R: A practical introduction. Walter de Gruyter.","Gabrielatos, Costas. 2007. Keyness Analysis: nature, metrics and techniques. University of Lancaster.","Manning, Christopher D. & Hinrich Schütze. 1999. Foundations of statistical natural language processing. MIT Press.","Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press.","van Eijnatten, Joris & Peter Huijnen. 2021. Something Happened to the Future: Reconstructing Temporalities in Dutch Parliamentary Debate, 1814-2018. Brill.","Ylä-Anttila, Tuukka. 2016. Topic modeling for frame analysis: A study of media debates on climate change in India and USA. Public Understanding of Science, 25(2), 202-216. doi: 10.1177/0963662514538644.","Isoaho, K., Gritsenko, D. & Mäkelä, E. 2019. Topic modelling and text analysis for qualitative policy research. Policy Design and Practice, 2(3), 283-299. doi: 10.1080/25741292.2019.1629976."]}
Databáze: OpenAIRE