Algorithm to Correct the Bigram Method to Identify an Author's Text.

Autor: Voronina, M. Yu., Kislitsyn, A. A., Orlov, Yu. N.
Zdroj: Mathematical Models & Computer Simulations; Apr2023, Vol. 15 Issue 2, p245-254, 10p
Abstrakt: This paper proposes a model for recognizing the authors of literary texts based on the proximity of an individual text to the author's standard. The standard is the empirical frequency distribution of letter combinations, constructed according to all reliably known works of the author. Proximity is understood in the sense of the norm in L1. The tested text is assigned to the author whose standard text is closest to the tested text. For identification, a library of authors is used, each of which has a sufficiently large number of works defining the corresponding standards of two letter combinations. Testing this identification method on the authors of the library has shown that it is very accurate. In the analyzed corpus of texts, 1783 texts of 100 authors were collected and the recognition error by the best method turned out to be 0.12. It is important that after the exclusion of erroneously recognized texts, a library of 88 authors and 1450 texts remained, each of which was identified correctly. The studied problem is the assessment of the probability that there is no standard of the author of the tested text among the library's standards. To solve it, the paper analyzes the dependence of the probability of erroneous identification on the length of the text. Using the example of an unmistakably determined subgroup of texts, it turns out that the empirical probability of correctly recognizing a text fragment, although it decreases with a decrease in the length of the fragment, still exceeds 0.5 up to the fragmentation of the text into 10 parts. If we take smaller fragments, some of them are identified incorrectly. If the correct standard is excluded from consideration, the second closest standard is assigned as such, but it turns out to be unstable: the ambiguity of such identification of the author of fragments occurs when the text is cut into 4 fragments. Thus, the stability of the identification of the author of text fragments can be proposed as a criterion for the correctness of the method. [ABSTRACT FROM AUTHOR]
Databáze: Complementary Index