Autor: |
Wojciech Dobrowolski, Mikolj Libura, Maciej Nikodem, Olgierd Unold |
Jazyk: |
angličtina |
Rok vydání: |
2024 |
Předmět: |
|
Zdroj: |
IEEE Access, Vol 12, Pp 79003-79013 (2024) |
Druh dokumentu: |
article |
ISSN: |
2169-3536 |
DOI: |
10.1109/ACCESS.2024.3409425 |
Popis: |
The log sequence is often referred to as a language in automated log analysis. The natural consequence of this is that the log sequence should have a structure consisting of words and sentences. However, the word definitions in the log sequence are not uniform in the literature. The first approach splits line-by-line, and the second retrieves word-like structures from the log sequence. The main challenge in the second approach is the measurement of results. There are approaches for constructing unsupervised metrics; however, we found them to be inconsistent. Other methods rely on manually prepared golden standards; however, a benchmark for golden segmentation is not available for any set of logs. To overcome this problem, we created a benchmark of preprocessed log event IDs gathered from the open-source CloudStack log and commercial Nokia software execution. We created a gold segmentation standard with the help of a human expert, and made it publicly available. We then tested known unsupervised segmentation methods used for log sequence segmentation and adapted the Nested Pitman-Yor Language Model. We found that the results of log segmentation performed by these methods vary significantly between the natural language domain and the log domain. VotingExperts achieved the best F-score, recording 97.3% for CloudStack and 44.1% for Nokia logs. The results are related to the uni-gram entropy of the log sequence, which differs across software platforms. |
Databáze: |
Directory of Open Access Journals |
Externí odkaz: |
|