Machine Learning Techniques in Prostate Cancer Diagnosis According to Prostate-Specific Antigen Levels and Prostate Cancer Gene 3 Score
Article information
Abstract
Purpose
To explore the role of artificial intelligence and machine learning (ML) techniques in oncological urology. In recent years, our group investigated the prostate cancer gene 3 (PCA3) score, prostate-specific antigen (PSA), and free-PSA predictive role for prostate cancer (PCa), using the classical binary logistic regression (LR) modeling. In this research, we approached the same clinical problem by several different ML algorithms, to evaluate their performances and feasibility in a real-world evidence PCa detection trial.
Materials and Methods
The occurrence of a positive biopsy has been studied in a large co-hort of 1,246 Italian men undergoing first or repeat biopsy. Seven supervised ML algorithms were selected to build biomarkers-based predictive models: generalized linear model, gradient boosting machine, eXtreme gradient boosting machine (XGBoost), distributed random forest/ extremely randomized forest, multilayer artificial Deep Neural Network, naïve Bayes classifier, and an automatic ML ensemble function.
Results
All the ML models showed better performances in terms of area under curve (AUC) and accuracy, when compared to LR model. Among them, an XGBoost model tuned by the au-toML function reached the best metrics (AUC, 0.830), well overtaking LR results (AUC, 0.738). In the variable importance ranking coming from this XGBoost model (accuracy, 0.824), the PCA3 score importance was 3-fold and 4-fold larger, when compared to that of free-PSA and PSA, re-spectively.
Conclusions
The ML approach proved to be feasible and able to achieve good predictive performances with reproducible results: it may thus be recommended, when applied to PCa prediction based on biomarkers fluctuations.
INTRODUCTION
Screening using prostate-specific antigen (PSA) is characterized by low specificity for prostate cancer (PCa), since elevated PSA may be due to benign conditions, especially within a PSA range of 4–10 ng/mL. Only one fourth of men with PCa suspicion go on to have a positive biopsy.1 Prostate cancer gene 3 (PCA3), first described by Bussemakers et al.2 in 1999, is a noncoding, prostate-specific mRNA highly overexpressed in 95% of PCa cells, with a median 66-fold upregulation compared with adjacent nonneoplastic prostatic cells. As the name implies, PCA3 is specific for PCa and is expressed only in this disease. Since 2012, PCA3 was approved as an auxiliary biomarker in the molecular diagnosis of PCa in the European Union, Canada, and the United States. Many studies investigated the diagnostic value of urine PCA3 in PCa, but results regarding its applicability in a clinical setting (first and/or repeat biopsy) have been inconclusive.3
Recently, in the era of big data, artificial in-telligence technology, and machine learning (ML) techniques have been applied to analyze large amounts of data in medical field, and their adequacy and usefulness in diagnosis are increasing.4–6 This approach is playing an emerging role even in urology: pattern recognition and classification, confounder discrimination, cancer new markers identification and computer-assisted diagnosis, image processing and radiomics, computational biology, new surgical techniques validation, bridging clinical data with histopathological and genetics/ genomics ones to build up a data warehouse.7,8 When dealing with PCa, ML can be applied to assist several procedures, like capsule segmentation, fusion-targeted biopsy, robotic-assisted surgical systems, just to digital pathology and automatic diagnostics.9–11
In recent years, one of our main interests focu-sed on the predictive role of PCA3 score for PCa, when combined with classical risk factors like PSA and free-PSA (%fPSA). Our first experience investigated this biomarker on a large real-world cohort of Italian men undergoing first or repeat biopsy for PCa. At that time, we used the logistic regression (LR) modeling to predict PCa detection rate at different PCA3 score values.12 Actually, no studies evaluated the PCa prediction role of age, PSA, and PCA3 score using ML methods. The aim of the current research is to improve our past results by a modern approach, now proposing the use of several supervised ML algorithms to build biomarkers-based predictive models for PCa diagnosis.
MATERIALS AND METHODS
The original study took place in 3 Italian ins-titutions (San Luigi Gonzaga Hospital Orbassano, Gradenigo Hospital Torino, and San Raffaele Hospital Milano) and recruited 3,571 men, who consecutively underwent PCA3 testing between October 2008 and December 2010. A total of 3,446 urine samples (96.5%) had adequate levels of PCA3 and PSA mRNAs to calculate the PCA3 score. All patients (n=1,246, 36.1%) who underwent ≥1 biopsy after PCA3 assessment as of December 31, 2010, were enrolled. Seven hundred and thirty-one subjects had their first biopsy due to a serum PSA ≥2.5 ng/mL after ruling out the presence of urinary tract infections and/or inflammation with clinical history, urine cultures and digital rectal examination (DRE); the remaining 515 ones had 1 or 2 previous negative biopsies and underwent repeat biopsy due to PSA elevation persistency. The current study is a reanalysis of the original cohort dataset by ML techniques, without any impact either on patient's clinical history or future treatment decision. Due to the retrospective observational nature of this research and according to Italian law (Agenzia Italiana del Farmaco-AIFA, Guidelines for observational studies, March 20, 2008), no formal approval from the local Institutional Review Board/Independent Ethics Committee was needed.
1. Statistical Analysis
At first, the determinants for a positive biopsy (dependent variable, target) have been estimated by the multivariate binary LR model. Eight predictors (independent variables, features) were tested as PCa risk factors: 4 continuous (age, PSA, %fPSA, and PCA3 score) and 4 categorical (family history for PCa, DRE, high-grade pro-state intraepithelial neoplasia [HG-PIN]). The continuous variables were reported as median-interquartile range (IQR) while the categorical ones as absolute/relative frequencies. Two diffe-rent inferential tests were applied, the Mann-Whitney and the Fisher exact test, for continuous and categorical covariates respectively. All reported p-values were obtained by the 2-sided exact method at the conventional 5% significance level. Data were analyzed as of February 2021 using R 4.0.5 package H2O version 3.32.1.1 (R Foundation for Statistical Computing, Vienna, Austria).13
2. Development and Validation of ML Models
At a second step, 6 different supervised ML algorithms for binomial classification were trained and cross-validated for target prediction (biopsy result for PCa) using the same 8 features; these estimation processes have been performed by H2O for R, an open-source distributed ML platform.14 The ML algorithms were generalized linear model (GLM), gradient boosting machine (GBM), eXtreme Gradient Boosting machine (XGBoost), distributed random forest (DRF)/ eXtremely randomized forest (XRT), multilayer artificial Deep Neural Network (DNN), and naïve Bayes classifier (NB).15–20 Moreover, the modeling process has been performed by H2O AutoML too, an automatic supervised ML ensemble function that sequentially trains, cross-validates and tunes an ordered series of ML models, ranking them by performance metrics: 3 XGBoosts, a fixed grid of GLMs, 1 DRF, 5 GBMs, 1 DNN, 1 XRT, a random grid of XGBoost, and 2 Stacked Ensemble models too, the former containing all the models, the latter only the best ones from each algorithm class.21 Automatic ML algorithm searches for the optimal combination of a collection of prediction algorithms stacking together various classifiers and it is considered among the newest frontiers for ML, often challenging the predictions deriving from a manual ML hyperparameters tuning.
For all the models, the target was balanced in the training and test data via resampling (either oversampling the minority class or undersampling the majority one) and the missing values were replaced by the Multiple Imputation by Chained Equations procedure.22 Sample size is quite critical in ML modeling: not to loose statistical power and conversely from our 2012 study, we investigated the whole 1,246 patients (instances) cohort, disregarding if they underwent either first or repeat biopsy. Therefore, the original dataset was randomly splitted for train into 80% training frame and 20% test one. After the training phase and to decrease the risk of model overfitting, a 5-fold cross-validation was used to compare the classifiers and produce a single estimation: the training frame was split into 5 folds, using 4 of them for training and 1 for cross-validation, replying 5 times with each fold used once as a test frame. Model performances have been investigated on the test set and the whole training/ cross-validation/test procedure has been replied 20 times for estimation stability, each time using a different training/test split partitioning. The best prediction performance was identified by the area under curve (AUC) of the receiver operating characteristic (ROC) curve. The ROC curve shows the trade-off between false positive rate and true positive rate and its AUC allows to compare learning algorithms for binary classification better than accuracy. The AUC represents the likelihood that a positive case (patient with positive biopsy) is ranked higher than a negative one (patient with negative biopsy), considering all possible thresholds: the higher AUC, the better PCa detection performance.23 Conversely, the accuracy measures, for a given threshold (0.5 by default), the percentage of correctly classified cases, regar-dless of which class (negative or positive biopsy) they belong to. Dealing with binary classification, AUC is used to evaluate how well a model is able to distinguish between true positives and false positives, while accuracy to estimate the number of correct predictions made as a ratio of all predictions.
RESULTS
The main patients’ characteristics are reported in Table 1. Seven hundred fifty cases had complete data, while for the other 496 a missing replacement was needed, mostly for %fPSA; the size for training frame was 996, while 250 for the test one.
Among the 1,246 participants, whose median (IQR) age was 67 years (61–72 years), a positive biopsy was found in 325 of them (26.1%). When comparing the 2 subcohorts (negative vs. positive biopsy), PSA as well %fPSA and PCA3 score were statistically significant different, being their median values 6.5 ng/mL versus 7.4 ng/mL, 16 versus 13, and 35 versus 63, respectively (p<0.001 for every comparison). Likewise, age and DRE had a different distribution between the 2 subcohorts, while a family history for PCa and the occurrence of HG-PIN was not associated to a major risk of positive biopsy.
In the multivariate binary LR model with all the 8 features, the main risk factors for PCa occurrence were PSA (odds ratio [OR], 1.07), %fPSA (OR, 0.94), and PCA3 score (OR, 1.01) (p<0.001 for every biomarker, Table 2). Using AUC as a measure of model performance for PCa detection rate, that from the logistic model was 0.738: this is our reference for ML models.
Table 3 shows the median (IQR) and the best AUC as well the accuracy obtained by all the ML classifiers on the test set (best AUC stays for the highest among all the 20 modeling runs). Notably, all the 6 algorithms were able to overtake multivariate logistic model performance, ranging their best AUC from 0.772 to 0.808 and their accuracy from 0.769 to 0.824, always based on the same 8 features set.
The AutoML function had better performances: the top model was an XGBoost one, with AUC 0.830 and accuracy 0.824: 197 of 250 biopsies were correctly classified with an global error rate equal to 21.2%, while the marginal error was 17.8% for the 180 of 250 negative biopsies, and 30.0% for the 70 of 250 positive ones.
The graphics helps to perform an explanatory model analysis for AutoML models. Fig. 1 reports the model ranking for the best AutoML run: XGBoost is the top classifier and 4 different XGBoost models appear among the top 10. Fig. 2 reports an heatmap with the frequencies of identical predictions; those of most AutoML models are quite correlated (especially XGBoosts and GBMs), while XRT/DRFs whose frequencies are not. Fig. 3 represents the variable importance across all AutoML models, after it has been scaled between 0 and 1: the contribute of PCA3 score is clearly prevalent, especially in XGBoosts and GBMs. In Fig. 4, the scaled variable importance for the top XGBoost model is plotted: the PCA3 score sharply leads this ranking, being the most critical feature for a positive biopsy. Finally, Figs. 5–7 show 3 partial dependence plots for PCA3 score, %fPSA, and PSA, respectively: while the x-axis reports the biomarker values, the y-axis represents the likelihood of a positive biopsy. The marginal effect that these features exert on the target follows a different pattern: as for %fPSA and PSA, the probability of PCa changes progressively without any cutoff. For PCA3 score instead, this risk follows a bimodal distribution: it jumps up from around 25% to 60%, when the PCA3 score increases from around 80 to 120.
Of note, it's possible to make individual inter-pretations for any single patient too, e.g., for the 2 patients with the lowest and highest probability of positive biopsy, estimating the Break Down profile that shows the contribution of every feature to target prediction (always working with AutoML XGBoost top model). Even if 130 patients had a negative biopsy, his probability of PCa was very high (91.0%), mostly due to his extreme PCA3 score: the ML model thus underlines the potential risk that this man has had a false negative result at biopsy (Table 4).
DISCUSSION
The management of PCa poses difficult cha-llenges, mainly due to the lackness of ideal tools to predict its occurrence. At present, up to two-thirds of patients undergoing a systematic prostate biopsy have a negative histological finding, depending on the decision to perform biopsy on PSA, a sensitive but highly unspecific biomarker. To avoid unnecessary prostate biopsies, multiparametric magnetic resonance imaging (mpMRI) and serum/urine biomarkers such as free-PSA, total/free-PSA ratio and 4Kscore, prostate health index (PHI), and PCA3 score may be used to diagnose PCa.1–3,24
In a recent systematic review and meta-analysis, Muñoz Rodrìguez and Perdomo25 documented an overall sensitivity of 69% and specificity of 65%, at a PCA3 score cutoff of 35. Additionally, the PCa occurrence OR was 4.24 (95% confidence interval, 3.49–5.17), and the area under the curve 0.734.
In our previous study, we investigated the role of PCA3 score on a large real-world cohort of Italian men undergoing first or repeat biopsy for PCa.12 At that time, we used the LR model to predict PCa detection rate at different PCA3 score values. We confirmed the usefulness of the PCA3 score determination among men who had a previous negative biopsy and an elevated PSA level. In this subgroup, a sensitivity of 73.2% and a specificity of 75.5% were documented using a cutoff of 39, its median value.
Takeuchi reported that the prediction rate for PCa improved by about 5%–10% when using a multilayer artificial neural network compared to classical LR in 334 patients who underwent 3.0T mpMRI before trans rectal ultrasound-guided prostate biopsy.26 The authors applied various ML algorithms to calculate PCa prediction rate comparing LR to ML approach. In patients aged >75 years and with a PSA level of 2.5–10 ng/mL, LR analysis showed the best prediction rate (74.6%); however, ML methods performed better than LR in other patient groups. In particular, the Random Forest model (prediction rate, 70.5%–72.4%) overtook LR results (prediction rate, 65.6%–70.0%) in patients aged 65–74 years.
It is worth noting that our retrospective case series included patients from 2008 to 2010. At that time, the use of mpMRI as a triage test was not already diffused in Europe, especially in Italy. The first experiences with the use of mpMRI before surgery were published in 2011,27 whilst the use of mpMRI as a guidance for target biopsies were developed in Italy from 2017.28
Nevertheless, in the current PCa diagnostic scenario, MRI cannot be used for screening purposes yet, being time-consuming and not homogeneously available everywhere, as well requiring experienced radiologists. A possible solution could be the development of a ML-based clinical decision support system based on biomarkers, as in this case PCA3 score and PSA, to stratify patients according to their risk of PCa progression, properly selecting candidates for mpMRI, thus reducing unnecessary prostate biopsies.
The present study has been planned to investigate ML modeling performances in the prediction of PCa, to confirm the role of PCA3 score as a key determinant for positive biopsy, to test ML techniques feasibility and reproducibility in a real-world urological context.
Each of the ML models, GLM (AUC, 0.793), GBM (AUC, 0.787), XGBoost (AUC, 0.772), DRF (AUC, 0.808), DNN (AUC, 0.776) and NB (AUC, 0.774), outperformed the classical LR model (AUC, 0.738). Notably, the top model has been set up with the AutoML function (AutoML XGBoost; AUC, 0.830): this represents an interesting improvement in PCa detection, when compared with our previous results: a biomarker-driven PCa prediction appears to be feasible by a ML approach.
As for PCA3 score, all ML algorithms recognized it as a main predictor of positive biopsy; it would be of value to test by ML how this biomarker per-forms when associated either to other ones (4Kscore, PHI, PSA density) or to mpMRI, histopathological and genetics data, and for specific subcohorts like e.g., subjects with “gray zone” PSA.
As for ML techniques feasibility, the H2O ML platform is freely available both for R and Python, among the most diffused open-source programming languages. Furthermore, a brand new ML addendum like the explanatory model analysis, here presented for the best model AutoML (XGBoost), forms a graphics set that can remarkably help the reader in understanding what happened inside the “black box” model.
Among the methodological limitations of this research, the biases depending on the retrospective observational design could be overtaken by a randomized controlled trial. Second, the absence of external validation; since the development cohort was retrieved from only 3 institutions, the risk of selection bias regarding patient population or biopsy indications should be overcome by external validation. Third, all ML models are affected by the sample size: in artificial intelligence investigations, instances are never excessive and no sample size could be really defined as adequate. Finally, when analyzing today a 10-year-old cohort, it's important to remember that diagnosis and therapy for PCa have been changed and updated so much. About that, 10 years ago mpMRI was not widespread in Italy and its use was reserved to a minority of cases and only in a repeat biopsy setting. At that time, in all studies on urine or serum biomarkers, the gold standard for PCa detection was the pathological examination of multiple nontargeted systematic TRUS-guided prostate biopsies. Intrinsically, this approach implies that no cancer predicted by the biomarker could count for a biopsy missed cancer.
CONCLUSIONS
In our experience, the ML approach may be recommended, when applied to PCa prediction based on biomarkers fluctuations. It proved to be feasible and had better performances with reproducible results in terms of AUC and accuracy, when compared to the LR model.
Notes
The authors claim no conflicts of interest.