An adaptable sentence segmentation based on Indonesian rules

Autor: Johannes Petrus, Ermatita Ermatita, Sukemi Sukemi, Erwin Erwin
Rok vydání: 2023
Předmět:
Popis: Sentence segmentation that breaks textual data strings into individual sentences is an important phase in natural language processing (NLP). Each word in the string that is added a punctuation mark such as a period, question mark, or exclamation point, becomes the location for splitting the string. Humans can easily see the punctuation and split the string into sentences, but not machines. Basically, the three punctuation marks also perform other functions so that the sentence segmentation process must really be able to detect whether a word marked with punctuation is a sentence boundary or not. This research proposes a sentence segmentation system called segmentasi kalimat bahasa Indonesia (SKBI) or Indonesian language Sentence Segmentation by applying a set of rules and can be used in Indonesian texts and can be adapted for English. There are 34 rules built with a combination of 27 fairly complete features that contribute to this research. The experimental results for the Indonesian text show that the SKBI is able to achieve an F1-Score of 96.89% and 97.07% for English. Both need to be improved but now better than previous research.
Databáze: OpenAIRE