Popis: |
This chapter introduces the representation of texts as elements of feature spaces, as well as various exploratory tools to study such representations. It investigates how students of humanities can discover groups of topically similar texts in a large textual collection and how recurring themes giving rise to similarity can be detected. Concepts including feature space, bag of words, cosine similarity, document collection, document vector, and document–term matrix are explained through a number of simplified examples; key engineering procedures (feature selection, feature scoring) in the construction of feature spaces are also introduced. The more complex application example of the study of the Anglo-Saxon Chronicle demonstrates how to detect similar records and recurring themes using explorative methods such as dimensionality reduction, clustering and topic modelling. The chapter concludes by pointing out the limitation of feature space representation and by defining what topical similarity means in the context of language technology. |