Ovarian Cancer Prediction Using PCA, K-PCA, ICA and Random Forest

Şahin, Asiye; Sahin, Asiye; Özcan, Nermin; Ozcan, Nermin; Nur, Gökhan; Nur, Gokhan

doi:10.54856/jiswa.202112168

AKILLI SİSTEMLER VE UYGULAMALARI DERGİSİ
JOURNAL OF INTELLIGENT SYSTEMS WITH APPLICATIONS
J. Intell. Syst. Appl.

E-ISSN: 2667-6893

This work is licensed under a Creative Commons Attribution 4.0 International License.

Ovarian Cancer Prediction Using PCA, K-PCA, ICA and Random Forest

PCA, K-PCA, ICA ve Random Forest Kullanarak Yumurtalık Kanserinin Tahmini

How to cite: Şahin A, Özcan N, Nur G. Ovarian cancer prediction using pca, k-pca, ica and random forest. Akıllı Sistemler ve Uygulamaları Dergisi (Journal of Intelligent Systems with Applications) 2021; 4(2): 103-108.

Full Text: PDF, in English.

Total number of downloads: 692

Title: Ovarian Cancer Prediction Using PCA, K-PCA, ICA and Random Forest

Abstract: Ovarian cancer, which is the most common in women and occurs mostly in the post-menopausal period, develops with the uncontrolled proliferation of the cells in the ovaries and the formation of tumors. Early diagnosis is very difficult and in most cases, it is a type of cancer that is in advanced stages when first diagnosed. While it tends to be treated successfully in the early stages where it is confined to the ovary, it is more difficult to treat in the advanced stages and is often fatal. For this reason, it has been focused on studies that predict whether people have ovarian cancer. In our study, we designed a RF-based ovarian cancer prediction model using a data set consisting of 49 features including blood routine tests, general chemistry tests and tumor marker data of 349 real patients. Since the data set containing too many dimensions will increase the time and resources that need to be spent, we reduced the dimension of the data with PCA, K-PCA and ICA methods and examined its effect on the result and time saving. The best result was obtained with a score of 0.895 F1 by using the new smaller-sized data obtained by the PCA method, in which the dimension was reduced from 49 to 6, in the RF method, and the training of the model took 18.191 seconds. This result was both better as a success and more economical in terms of time spent during model training compared to the prediction made over larger data with 49 features, where no dimension reduction method was used. The study has shown that in predictions made with machine learning models over large-scale medical data, dimension reduction methods will provide advantages in terms of time and resources by improving the prediction results.

Keywords: Dimension reduction; machine learning; ovarian cancer; random forest

Başlık: PCA, K-PCA, ICA ve Random Forest Kullanarak Yumurtalık Kanserinin Tahmini

Özet: Kadınlarda en sık rastlanan ve çoğunlukla menopoz sonrası dönemde ortaya çıkan yumurtalık kanseri, yumurtalıklardaki hücrelerin kontrol dışı çoğalması ve tümör oluşturması ile gelişir. Erken tanısı oldukça zordur ve çoğu durumda ilk tanı konduğunda ileri evrelerde olan bir kanser türüdür. Yumurtalık ile sınırlı olduğu erken evrelerde başarılı bir şekilde tedavi edilmeye yatkınken ileri evrelerde tedavisi daha zordur ve sıklıkla ölümcül olmaktadır. Bu nedenle kişilerin yumurtalık kanseri olup olmadığının tahminini yapan çalışmalar üzerine yoğunlaşılmıştır. Biz de çalışmamızda 349 gerçek hastaya ait kan rutin testi, genel kimya testi ve tümör belirteci verilerini içeren 49 özellikten oluşan veri setini kullanarak Random Forest tabanlı yumurtalık kanseri tahmin modeli tasarladık. Veri setinin çok fazla boyut içermesi harcanması gereken zaman ve kaynakları arttıracağı için PCA, K-PCA ve ICA yöntemleri ile verinin boyutunu azaltıp sonuca ve zaman tasarrufuna etkisini inceledik. Boyutun 49’dan 6’ya düşürüldüğü PCA yöntemi ile elde edilen daha küçük boyutlu yeni verinin RF yönteminde kullanılmasıyla, 0.895 F1 puanı ile en iyi sonuç elde edilmiştir ve modelin eğitimi 18.191 saniye sürmüştür. Bu sonuç, hiçbir boyut azaltma yönteminin kullanılmadığı dolayısıyla 49 özelliğe sahip daha büyük boyutlu veri üzerinden yapılan tahminden hem başarı olarak daha iyi hem de model eğitimi sırasında geçen zaman açısından daha tasarruflu olmuştur. Çalışma büyük boyutlara sahip medikal veriler üzerinden makine öğrenmesi modelleri ile yapılacak tahminlerde, boyut azaltma yöntemlerinin tahmin sonuçlarını iyileştirerek zaman ve kaynaklar açısından avantaj sağlayacağını göstermiştir.

Anahtar kelimeler: Boyut azaltma; makine öğrenmesi; yumurtalık kanseri; rastgele orman

Bibliography:

Whitwell HJ, Worthington J, Blyuss O, Gentry-Maharaj A, Ryan A, Gunu R, Kalsi J, Menon U, Jacobs I, Zaikin A, Timms JF. Improved early detection of ovarian cancer using longitudinal multimarker models. Molecular Diagnostics 2020; 122(6): 847-856.
Granato T, Midulla C, Longo F, Colaprisca B, Frati L, Anastasi E. Role of HE4, CA72.4, and CA125 in monitoring ovarian cancer. Tumour Biology 2012; 33(5): 1335–1339.
Aslan K, Onan MA, Yilmaz C, Bukan N, Erdem M. Comparison of HE 4, CA 125, ROMA score and ultrasound score in the differential diagnosis of ovarian masses. Journal of Gynecology Obstetrics and Human Reproduction 2020; 49(5): 101713.
Lu M, Fan Z, Xu B, Chen L, Zheng X, Li J, Znati T, Mi Q, Jiang J. Using machine learning to predict ovarian cancer. International Journal of Medical Informatics 2020; 141: 104195.
Moore RG, McMeekin DS, Brown AK, DiSilvestro P, Miller MC, Allard WJ, Gajewski W, Kurman R, Bast RC Jr, Skates SJ. A novel multiple marker bioassay utilizing HE4 and CA125 for the prediction of ovarian cancer in patients with a pelvic mass. Gynecologic Oncology 2009; 112(1): 40–46.
Jacobs I, Oram D, Fairbanks J, Turner J, Frost C, Grudzinskas JG. A risk of malignancy index incorporating CA 125, ultrasound and menopausal status for the accurate preoperative diagnosis of ovarian cancer. British Journal of Obstetrics and Gynaecology 1990; 97(10): 922–929.
Anton C, Carvalho FM, Oliveira EI, Maciel GAR, Baracat EC, Carvalho JP A comparison of CA125, HE4, risk ovarian malignancy algorithm (ROMA), and risk malignancy index (RMI) for the classification of ovarian masses. Clinics (Sao Paulo) 2012; 67(5): 437–441.
Zhang P, Wang C, Cheng L, Zhang P, Guo L, Liu W, Zhang Z, Huang Y, Ou Q, Wen X, Tian Y. Development of a multi-marker model combining HE4, CA125, progesterone, and estradiol for distinguishing benign from malignant pelvic masses in postmenopausal women. Tumour Biology 2016; 37(2): 183-2191.
Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science 2015; 349(6245): 255–260.
Yildiz E, Sevim Y. Comparison of linear dimensionality reduction methods on classification methods. In 2016 National Conference on Electrical, Electronics and Biomedical Engineering (ELECO) 2016; 1(2): 161-164.
Yang F, Wang HZ, Mi H, De Lin C, Cai WW. Using random forest for reliable classification and cost-sensitive learning for medical diagnosis. BMC Bioinformatics 2009; 10(Suppl. 1): S22.
Nguyen C, Wang Y, Nguyen HN. Random forest classifier combined with feature selection for breast cancer diagnosis and prognostic. Journal of Biomedical Science and Engineering 2013; 6(5): 551–560.
Sun G, Li S, Cao Y, Lang F. Cervical cancer diagnosis based on random forest. International Journal of Performability Engineering 2017; 13(4): 446–457.
Ramirez J, Gorriz JM, Segovia F, Chaves R, Salas-Gonzalez D, Lopez M, Alvarez I, Padilla P. Computer aided diagnosis system for the Alzheimer's disease based on partial least squares and random forest SPECT image classification. Neuroscience Letters 2010; 472(2): 99–103.
Mi Q, Jiang J, Znati T, Fan Z, Li J, Xu B, Chen L, Zheng X, Lu M. Data for: Using machine learning to predict ovarian cancer. Mendeley Data, Version 11, 2020.
Çalışan M, Talu MF. Comparative analysis of dimension reduction methods. Türk Doğa ve Fen Dergisi 2020; 9(1): 107–113.
Wang Q. Kernel principal component analysis and its applications in face recognition and active shape models. Arxiv Computer Vision and Pattern Recognition, 2012.
Sohrabian B. Multivariate Geostatistical Estimation Using Independent Component Analysis, Hacettepe University, PhD Thesis, Ankara, Turkey, 2013.
Palmer DS, O'Boyle NM, Glen RC, Mitchell JBO. Random forest models to predict aqueous solubility. Journal of Chemical Information and Modeling 2007; 47(1): 150–158.
Pal M. Random forest classifier for remote sensing classification. International Journal of Remote Sensing 2005; 26(1): 217-222.