Popis: |
Single-nucleotide polymorphisms (SNPs) are important genetic variables that are very popular in Genome-wide association study at the present time. They are often used in studies related to genetic disorders. A distinctive trait of SNPs is that there are a lot of them since they are variables originated from various positions in a DNA sequence. Unfortunately, the number of samples investigated are usually far fewer than the number of SNPs and so an over-fitting often occurs when one wants to construct a predictive model for classifying a sample into a case or a control. This study investigated a dataset on beta-thalassemia, a common genetic disorder widely found in Thai population. The data in the set are divided into two groups: severe and mild groups. The aims of the study were to develop and evaluate methods for screening and ranking SNPs related to this disorder. The screening methods tested were Chi-squared test (χ2), Information Gain, and Gradient Boosting (GB). The SNPs that were screened in and selected were then used to construct a predictive model for classifying a sample to be either a severe or mild case. The model construction methods tested were Support Vector Machine (SVM), GB, and Naive Bayes. Several combinations of a screening method and a model construction method were evaluated, and the evaluation results show that the best combination was χ2-SVM which used the number of selected SNPs of 10. |