Popis: |
The Covid-19 pandemic exposed weaknesses in healthcare systems in the world and revealed the importance of efficient Bio surveillance systems that can monitor disease outbreaks on a real-time basis. Event-based health surveillance systems are popular due to their ability to utilize health information from internet sources such as digital newspapers and social networking sites for early detection of outbreaks. Studies claim that all deadly outbreaks declared by WHO are first detected through these informal online sources. Unfortunately, existing systems are not providing actionable data for outbreak prevention. Action plans for handling outbreaks can be developed only if regional-specific data is available. The proposed study is intended to detect local or regional level outbreaks happening in the health domain of Kerala, in particular, by automatic extraction and examination of internet media reports covering Kerala news. In this paper, various methods for retrieving outbreak news from news portals are studied and a novel method is proposed for retrieving disease-related news items using ML techniques by implementing various text classification algorithms. Implementation of a modified term weighting approach to augment classification accuracy is a major contribution of the proposed work. Traditional TF-IDF term weighting algorithm do not consider the significance of a term in a particular domain. The Random Forest classifier gave maximum accuracy of 94.48% by the TF-IDF approach which improved to 100% by our modified term weighting scheme where the significance of the term with respect to a particular domain is also considered while determining the weight of the term during vectorization. |