Editorial by Manuel Marfil (on behalf of GIBI230: Research Group on Biomedical Imaging of Instituto de Investigación Sanitaria La Fe, HULAFE)
Grounded in Evidence: building trust through Clinical Validation
After more than three years of the ProCAncer-I, an EU funded project, we have come a long and fascinating way that now faces a challenging yet crucial task: the final evaluation of the models before their deployment. We have witnessed so far the creation of the cloud infrastructure that enabled the collection of the largest European database of prostate cancer data, the development of AI-based diagnostic support models as well as their internal validation with a subset of prospective data. It is now time to valuate their performance by simulating a realistic environment using the FUTURE-AI guidelines as a fundamental pillar: a technical and a clinical validation. The technical validation will consist of monitoring the generalization ability of the models throughout the remaining project duration with unseen data, focusing on the principles of Fairness (F), Traceability (T), and Robustness (R). This ensures models do not discriminate against subsets of the population, that their development and methodology are well-documented for reproducibility and scientific validity, and that they are reliable over time and under various conditions, respectively. This process will allow retraining the models with all repository data once monitoring is complete and correcting factors that may reduce performance, culminating in the final models ready for deployment.
Simultaneously, the clinical validation of the generated AI models will be developed before their deployment. The AI developed in this project aims to integrate into daily clinical practice as a disruptive tool that initially generates a mix of fascination and doubt but eventually becomes indispensable, much like other medical marvels such as blood tests analysis, human genome sequencing or multiparametric magnetic resonance images (mpMRI). Therefore, clinical validation will provide the closest estimation of the interaction between the clinician and the tool.
Analyzing this interaction is critical because, personally, I believe a tool, no matter how good it is considered, is useless if no one is willing to use it. This is precisely the challenge we face, well indicated by the other three principles of the FUTURE-AI uidelines: Universality (U), Usability (U), and Explainability (E). These ensure that these tools are accessible and applicable internationally regardless of hospital protocols and methodologies. They also ensure that the tool is as intuitive as possible for linicians to interact with, and that integration into their work environment is easy to deploy. Last but not least, for these models to be useful and in demand, both the clinician and the patient should trust them. Trust comes from understanding, which is why a undamental pillar worked on by the model developers is the ability to provide clear explanations of their decisions.
The clinical validation framework will be separated by use cases, mainly classified into classification, detection and segmentation models. The primary goal has been to create a harmonized methodology for all of them, while accommodating certain ifferences due to their purposes. Regarding the shared aspects, for each use case, approximately two dozen clinicians will evaluate the most mature models. To control the variable of clinical experience, these clinicians will be divided into groups based on their years of experience reporting prostate cases with mpMRI and the estimated number of case reads per year.
Additionally, since multiple clinicians and models exist for each use case, it is necessary to analyze inter-clinician and inter-model variability. Thus, a subset of the cases will be validated by several clinicians and the outputs of different models will be compared to assess how these two variables affect the intrinsic variability of data from numerous European centers. This variability is due not only to patient differences based on geographic location but also to different mpMRI providers and models, as well as differences in acquisition methodologies.
Firstly, focusing on the classification and detection models, the methodology focuses on comparing traditional diagnosis with human-AI symbiosis. Two sessions will be conducted where the clinician analyzes the same cases to compare performance and subjective perception in terms of diagnostic time for each patient, confidence in the diagnosis and diagnostic accuracy through metrics like sensitivity and specificity. The histopathological information will serve as the gold standard for determining if the clinician was correct. On the opposite side of the outputs, inputs for diagnosis will include a combination of mpMRI (T2W, DWI & ADC) sequences along with clinical data, excluding data that could allow clinicians to “cheat” such as biopsy results from a patient where goal is to use mainly medical images to detect lesion prior to any invasive intervention.
Lastly, for segmentation models, the goal differs from the previous: not to compare a clinician with and without AI, but to evaluate the ability to generate accurate masks of the prostate organ compared to manual segmentation by clinicians from the T2W MRI sequence. The objective is to assess how the segmentation masks of the region of interest under-segment, over-segment, and overlap with the gold standard, in other words, the manual masks.
In conclusion, the next step of this project is increasing the hype not only among consortium participants but also among participating clinicians who recognize the potential these models could have in their daily practice. Beyond any other purpose of the project, however, is the patient who is suffering from the physical and psychological effects of cancer, to whom these new technologies would not only improve their quality of life but also increase the chances of overcoming the disease with fewer side effects.