Popis: |
Our null hypothesis was that a computer algorithm will not predict breast cancer patients' 10-year survival with greater accuracy than the 64.3% baseline of the Surveillance Epidemiology and End Results (SEER) database [3]. The aims of this study were to (1) Build an infrastructure to convert SEER data into a machine readable format; (2) Train Machine Learning (ML) algorithms to predict breast cancer patients' 10-year survival; and (3) Measure the predictive accuracy of the ML algorithms. We downloaded 657,711 breast cancer patients' clinical and demographic characteristics from the SEER database and converted them into machine-readable feature vectors. An oncologist generated a list of potential variables for the ML algorithms. We trained the WEKA Machine Learning package's Logistic Regression (LR), Naive Bayes, and C4.5 Decision Tree algorithms on the data using ten-fold cross validation. LR, Naive Bayes, and C4.5 Decision Tree achieved accuracies of 76.29%, 59.71%, and 77.43% respectively. We compared the results of the LR algorithm with those of a well-known website, Adjuvant! Online. The results rejected the null hypothesis for LR and C4.5 Decision Tree, but failed to reject for Naive Bayes. Of the algorithms tested, C4.5 proved to be the most accurate predictor of patient survival in ten years. In addition, LR provided more accurate predictions than Adjuvant! without Adjuvant!'s limitations. |