Popis: |
Computing Textual Similarity implies, at least, two aspects: howto represent items, and how to compare item representations (similarityfunctions). In this paper, we focus on the second problem, andwe discuss the empirical properties of general similarity functions. Wefocus on the Information Contrast Model (ICM), a parameterized generalizationof Pointwise Mutual Information (PMI) which has optimaltheoretical properties but has not been thoroughly tested empiricallyyet. In this paper, we propose an unsupervised parameter estimationcriterion for ICM, and we study the empirical behavior of ICM withrespect to traditional similarity functions over different representationmodels (bag of words and word embeddings) and a diverse set of textualsimilarity problems, including lexical similarity, sentence similarity andshort texts similarity. Our empirical results show that (i) the optimalvalues for the ICM β always lie within the range predicted by the theory,1 < β < 2, regardless of the task and the representation methodchosen; (ii) our proposed estimator ˆ β closely matches the optimal empiricalβ value. In the experiments, our unsupervised method to fixICM parameters efficiently predicts the optimal values, and ICM outperformsor at least matches the performance of traditional similarityfunctions. |