Popis: |
In this work, we aim predict human intuitions about traits of fictional characters (gender, polarity, and age) based on their names only, attempting to leverage systematic sound- symbolic associations, i.e., non-arbitrary relationship between the sound and the meaning of a word, where aspects of form are associated with certain aspects of meaning. Building on previous studies on the sound symbolism of names [1-3], we first investigate whether letter 1-grams [4,5] allow us to reliably predict human intuitions about fictitious characters’ attributes from names only. Then, we extract word embeddings, i.e., distributed semantic repre- sentations, leveraging lexical co-occurrences and character n-grams using FastText (FT [6]). We analyze different types of names (made-up, e.g., Morgra; real, e.g., John; and talking, i.e., relying on existing English words, e.g., Bolt) to assess whether sound symbolism equally impacts intuitions when names do or do not rely on words with established semantics [7]. We derived a set of 63 (23 made-up, 20 talking, 20 real) target names from a fantasy fan-fiction corpus (crawled from Archive Of Our Own, AOOO names) and 119 (38 made-up, 41 talking, 40 real) from a corpus of children and Young-Adult books (YA names). AOOO names were manually tagged as referring to male/female, good/evil characters. YA names were manually tagged as referring to male/female, young/old characters. We then ran an online survey on Prolific (approved by Tilburg University Research Ethics and Data Manage- ment Committee) asking 300 English monolingual participants to drag a slider bar (anchored between -50 and 50) to indicate how much a name would fit a good/evil, male/female, young/old character. Names were featurized as count vectors of letter 1-grams, to investigate the predictive power of surface form features. Moreover, we trained two FT models on the Corpus of Contemporary American English (CoCA [8]), to derive word embeddings for all names. To tease apart the role of lexical co-occurrences from that of sub-lexical patterns, one FT model was trained leveraging only lexical co-occurrences to derive word embeddings (FTco-occ), while the other model was trained also using character n-grams. (FTngrams). We predicted participant ratings using Random Forest (RF) and Neural Network (NN) regressors, and evaluated per- formance using Mean Absolute Error (MAE). Results (see Fig.1) indicate that the NNs trained on FTngrams embeddings had superior performance on all three dimensions compared to the RF models trained on letter unigrams. The FTngram embeddings performed well when predicting intuitions about gender (R2 = 0.63), best predicting real names (MAE = 16.80), followed by talking names (MAE = 17.11), and made-up names (MAE = 21.20). Figures decrease for age (R2 = 0.20) and polarity (R2 = 0.19). Similarly, FTco-occ performed well when predicting gender (R2 = 0.51), best predicting real names (MAE = 15.96) followed by talking names (MAE = 18.04). Performance decreases for polarity (R2 = 0.14), and age (R2 = 0.009). In general, we see that embeddings built on sub/lexical patterns better predict human intuitions about characters- features from names alone. Letter 1-grams did not perform well for polarity (R2 = 0.06). This model best predicted real names (MAE = 14.71) followed by talking names (MAE = 15.22) and made-up names (MAE = 18.50). However, they did even worse for gender (R2 = -0.007) and age (R2 = -0.005). Our results suggest that, when predicting people’s intuitions about fictional characters’ features from their names, semantic representations learned from sub-lexical patterns are more informative than both surface features and semantic representations learned from co- occurrence patterns alone. This applies regardless of the target attribute, suggesting that gen- der, polarity, and age are all encoded in embedding spaces and can be picked up by a simple classifier in a probing task, approximating human intuitions about character’s features elicited from names alone. Contrary to expectations [9], however, the performance of FTngrams tends to be better than FTco-occ for real and talking names, suggesting that even when names rely on an established meaning, sub-lexical patterns are still informative about names’ attributes. Fi- nally, and most importantly, we observe that made-up names (akin to pseudowords) entertain 0.09). In general, we see that embeddings built on sub- lexical patterns better predict human intuitions about characters’ features from names alone. informative semantic relations with existing words in a shared representational space [10], suggesting that people can construct informative semantic representations from form alone exploiting form-meaning regularities in language. FT: We sentence-tokenized the CoCA, removed stop-words and non-alphabetic strings. Both FT models were trained using 300 dimensions with a window size of 2. FT[co-occ] was trained with a minimum n-gram size and a maximum n-gram size of 0, while FT[ngrams] with a minimum n-gram size of 2 and a maximum n-gram size of 5. We computed the mean vector on the whole vocabulary for made-up names in FT[co-occ], since these names do not have a vector in this model: the mean is the best guess when no contextual information is available. NNs: All neural networks had 1 hidden layer with ReLU activation and were trained with early stopping (patience = 3) on validation loss (mean squared error) using the Adam optimizer. The final layer had linear activation. Using a grid search, we found the optimal number of nodes and dropout percentage for age (nodes = 300, dropout = 0.5), gender (nodes = 512, dropout = 0.3), and polarity (nodes = 512, dropout = 0.5). References: [1] Elsen, H. (2018). Some proper names are more equal than others. The sound symbolic value of new names. Cahiers de lexicologie, 113(2), 79-94. [2] Smith, R. (2006). Fitting Sense to Sound: Linguistic Aesthetics and Phonosemantics in the Work of J.R.R. Tolkien. Tolkien Studies, 3(1), 1-20. [3] Sidhu, D. M., & Pexman, P. M. (2019). The Sound Symbolism of Names. Current Directions in Psychological Science, 28(4), 398-402. [4] Monaghan, P., & Fletcher, M. (2019). Do sound symbolism effects for written words relate to individual phonemes or to phoneme features? Language and cognition, 11(02), 235-255. [5] Westbury, C., Hollis, G., Sidhu, D. M., & Pexman, P. M. (2017). Weighing up the evidence for sound symbolism: Distributional properties predict cue strength. Journal of memory and language, 99, 122-150. [6] Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135-146. [7] Sabbatino, V., Troiano, E., Schweitzer, A., & Klinger, R. (2022). "splink" is happy and "phrouth" is scary: Emotion Intensity Analysis for Nonsense Words. arXiv preprint arXiv:2202.12132. [8] Davies, Mark. (2008-) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. [9] Sidhu, D. M., & Pexman, P. M. (2015, 05). What’s in a Name? Sound Symbolism and Gender in First Names. PLOS One, 10(5), 1-22. [10] Cassani, G., Chuang, Y. Y., & Baayen, R. H. (2020). On the Semantics of Nonwords and Their Lexical Category. Journal of Experimental Psychology-Learning Memory and Cognition, 46(4), 621-637. doi:10.1037/xlm0000747 |