Image by PickleTech.
Medical devices containing a machine learning algorithmic component, e.g. for diagnosis, have a huge market and health potential. But getting the approval to sell a medical device is possible only after a thorough process of certification. For this, the particularities of statistical learning need to be taken into account as soon as possible in the device design. Otherwise, the risk of getting trapped in the certification process with suboptimal -viability killing- components is just too high. In this post we describe good practices for machine learning development and operationalization within medical devices, driven from our experience at PickleTech.
Medical devices that include machine learning components have the potential to improve our health systems. They work with multiple types of data: physiological measurements, blood and any other fluid samples, all types of medical images, or voice records to mention some examples. These devices can assist medical practitioners on diagnosis tasks, providing earlier warning signals and triggering faster medical interventions. Some examples in emergency medicine can be found here. But for these to work, good practices for machine learning development adapted to medical devices must be in place.
Compliance with medical device regulations requires meeting specific standards of performance, quality, safety and efficacy, to ensure the benefits and safety of the device for the end patients. The specific details of the process change depending on the geography -e.g. FDA vrs. CE marking– or on the purpose of the device -e.g. medical devices vrs. in vitro devices.
Without entering on the requirement details [2]-[6], certification involves an exhaustive process of research, development, documentation, and validation, including either a clinical trial or at least clinical evaluation of the device. For these, some of the device characteristics need to be fixed, and well under control, before proceeding with further steps. In contrast, machine learning systems are designed to be dynamic, reactive to increasing or changing data, improving through feedback processes. These are actually at the essence of statistical learning.
How do we merge both worlds and develop optimal machine learning models for a medical device that gets certified?
Model Pipeline & Selection Schema
Machine learning solution development usually starts by assessing which existing models have been successful for similar problems in the past. When specific algorithms have an inflated publicity, it is very easy to constrain development to few models, and push for too specific machine learning constructions. Conversely, we need to review which models have been successful in the scientific literature, including not only the algorithmic side, but the whole model pipeline. This includes checking whether the conditions and limitations in the publications are compatible with our medical device context. And it necessarily involves assessing data availability issues. A specific model is not as important as the complete model pipeline.
Domain knowledge for Feature Engineering
Beyond the machine learning algorithm itself, a model pipeline includes feature engineering and a model selection framework. For medical device solutions, the feature engineering process must include the informed input of medical domain experts. It is also very important that the solution includes from the beginning the capability to build, filter, and select relevant features as increasing data allows more variables to be successfully ingested to the algorithms.
In the early stages of medical device development, data may be very limited in contrast to the expectation for the upcoming real-world operationalization of the device. Under these conditions, different models may be adequate for different data size scenarios. A key recommendation for medical device development is to implement a state of the art model selection framework. This framework protects the model device against overfitting, makes sure the best model pipeline is deployed, and provides an unbiased estimate of the model pipeline out-of-sample performance and uncertainty. In our experience at PickleTech, nested cross-validation frameworks are many times a very good choice.
Domain knowledge for Benchmarking
Within the model selection framework we should select the best model pipeline within a pool of multiple models, some of them adapted to the different and changing characteristics of the problem the medical device aims to solve. Within those models, make sure to always include well established benchmarks by the community. These benchmarks, together with pre-defined improvement criteria, are fundamental to assess when a model selection is actually justified. To define suitable performance benchmarks for a medical device, a close collaboration with medical domain experts is again necessary.
Avoid freezing the model pipeline too much too soon
In clinical scenarios data may be too scarce at the beginning of the device development. If the device development then plans for the requirements of the certification process to force a model freeze too soon, that may be a risky burden for the future medical device success. Understanding the expected performance evolution of the machine learning model with increasing data availability is necessary from the medical device inception stage.
Make sure the device development plan avoids freezing the model architecture too much too soon in the development. Getting stuck with a specific algorithm or feature engineering process too soon may imply ending up in a very expensive clinical trial that will not really tell you anything conclusive of what could have been the real potential of your model pipeline and medical device if properly implemented. It will simply validate the performance of a flawed underperforming model pipeline.

Monitor & Adapt
Even after ensuring you follow best practices on model development and validation, context aspects may change and affect the actual performance of the medical device model in operation. Two common challenges in the operationalization of machine learning are data drift and concept drift. These refer to changes in the operational data distribution with respect to the training data distributions and changes in the respective data set relations between the input features and the model output behaviors respectively [10].
Operating a machine learning model necessarily implies monitoring
No real world model in deployment can easily survive without a monitoring capability. Lack of model monitoring leads to unexpected consequences already in the short term, impacting users and solution providers through actual performance setbacks. This is at the root of why algorithm audits are fundamental. And this is the reason behind the increasing attention machine learning model operationalization (MLOps) is receiving.
There are several automated techniques to continuously monitor for data and concept drifts. Although specific implementations are out of the scope of this post, developing a medical device requires planning adequate monitoring of the actual model performance. From early stages in the development, we need to account for how the model is going to be monitored, and more importantly, how the model is going to adapt to changes in performance. In MLOps language, how improvements are going to be orchestrated.
Plan how to maintain performance
Model adaptations may include retraining the machine learning model pipeline when certain performance triggers appear. These may involve limiting model device usage within specific data conditions. Or we may plan to add human interventions either to correct data and concept drift issues through extended feedback and labeling phases, or to take full control on the device when extreme conditions request that.
Regardless of the specific details behind the monitoring and adaptation module, a quality management system is one of the requirements to get the medical device certified. Getting the system right is difficult if implemented ad-hoc in the device development. Conversely, we suggest embedding the monitoring and adaptation module just as another dynamic step in the model pipeline, already from the earliest operationalization design tasks.
Include Interpretability when monitoring
Something which is too many times missed in a monitoring module is monitoring aspects of feature importance and interpretability. Proficient machine learning medical devices involve interpretability aspects to increase and facilitate the trust and confidence of medical domain experts on the use and impact of the device. While this is part of the model development, it is actually something to track as well when monitoring the device. Changes in feature usage and the interpretability of the model behavior enable us to detect and explain model performance variations.
Trace & Understand
In every stage of the medical device development we need to have proper mechanisms to register and trace which model pipelines were and are being used. This involves tracing which model pipelines have been part of the development phases, within model selection and in any experimentation performed. This also includes accounting for the features considered, even if finally discarded. And of course, we need to trace which model pipelines are being deployed at all times.
Tracing involves model pipelines, data, hypothesis and limitations
Tracing also involves the data sets ingested in all those experiments, and particularly to train the model pipeline in deployment. This is the only way to properly monitor the model performance variations, and also ensure reproducibility in the medical device development. On top of being a necessity to fulfill regulations, a proper tracing system speeds up model improvements and error fixes when these are necessary. Tracing involves the model pipeline, its expected performance, the data sets, hypothesis considered, and usability limitations. In clinical trials it may also be required to trace which organizations or individual practitioners are using the medical device, something that needs to be part of the device development plan.
While a machine learning model operating within a medical device may aim to operate as autonomously as possible, Health solutions require thorough monitoring and tracing processes where the benefits of the device and the safety of the patient are always well under control. Make sure your machine learning experts work closely with the medical experts to optimize the whole medical device architecture.
Powered by Data, Driven by Science
Medical devices that include machine learning components have the potential to improve our health systems. These devices can assist medical practitioners on diagnosis tasks, better trialing processes, or providing earlier warning signals to trigger medical interventions. Read some examples in emergency medicine here. Implementing best practices on machine learning development are fundamental to ensure an efficient medical device certification process and ultimately the safety and performance of the approved medical device.
At PickleTech, we work developing tailored solutions to improve competitive aspects related to Health, Sports, and DeepTech. We believe Data Science and advances in Machine Learning coupled with domain knowledge and experimentation have the potential to provide new tools to better understand, monitor, and systematically improve the competitive performance of organizations.
[1] Data Science for Emergency Medicine, PickleTech, https://pickletech.eu/blog-emergency/
[2] Artificial Intelligence and Machine Learning in Software as a Medical Device, US Food and Drug Administration, https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-software-medical-device
[3] Medical devices, European Medicines Agency, https://www.ema.europa.eu/en/human-regulatory/overview/medical-devices
[4] CE Mark Versus FDA Approval: Which System Has it Right?, CRST, https://crstodayeurope.com/articles/2015-feb/ce-mark-versus-fda-approval-which-system-has-it-right/
[5] Regulatory Requirements for Medical Devices with Machine Learning, Johner Institute, https://www.johner-institute.com/articles/regulatory-affairs/and-more/regulatory-requirements-for-medical-devices-with-machine-learning/
[6] Guidance on Clinical Evaluation (MDR) / Performance Evaluation (IVDR) of Medical Device Software, by the Medical Device Coordination Group (MDCG) of European Commission, https://health.ec.europa.eu/system/files/2020-09/md_mdcg_2020_1_guidance_clinic_eva_md_software_en_0.pdf
[7] Powered by Data, Driven by Science, PickleTech, https://pickletech.eu/blog/
[8] On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation, G. Cawley, N. Talbot, https://jmlr.csail.mit.edu/papers/volume11/cawley10a/cawley10a.pdf
[9] Nested versus non-nested cross validation, Scikit Learn,
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
[10] Data Drift vs. Concept Drift, Philip Tannor,
https://deepchecks.com/data-drift-vs-concept-drift-what-are-the-main-differences/
[11] Algorithm Audit for a Trustworthy AI, PickleTech, https://pickletech.eu/blog-audit/
[12] MLOps, https://en.wikipedia.org/wiki/MLOps
[13] Quality Management Systems, The European Union Medical Device Regulation, https://eumdr.com/quality-management-system/