Addressing Data Drift to Ensure Fair and Optimal Patient Care

Female doctors look at X-Ray images

Addressing Data Drift to Ensure Fair and Optimal Patient…

University of Miami Frost Institute for Data Science and Computing (IDSC) Innovation Officer and Director of the IDSC AI + Machine Learning program, Dr. Yelena Yesha and her team provide an update on their intermediate results from the 2022 agreement establishing an FDA AI Testbed and the NSF/FDA Grant awarded in September of 2023—a combined effort to ensure that medical devices are trained to be accurate and reliable.

Most modern medical imaging, including CT scans in digital form, are typically stored as DICOM files (Digital Imaging and Communications in Medicine), which contain both the image data and important metadata (patient information, scan settings, etc.). In utilizing medical image datasets, “reading” the data is done using viewing software or specialized tools to visualize the images in a 2D or 3D format. At this point, the images are just being viewed, not changed.

Why Change the Data?

Sometimes the raw data from, for example a CT scan, isn’t ideal for analysis and must be processed or adjusted to improve accuracy and make comparisons easier. This can include:

Normalization, which is where the term data distribution comes into play. In a raw image, the pixel values are already distributed in a certain way. They reflect the density or brightness of tissues. To better analyze the image, one might redistribute those pixel values to highlight or enhance certain features.

Also, Enhancement may be used to increase the contrast between healthy and abnormal tissue, or highlight important features (e.g., tumors, bones, or soft tissue) when a lack of contrast makes it hard to see certain details. Normalization may also include ensuring all images are on the same scale, especially when working with scans from different machines or settings.

Taking Enhancement a step further, Segmentation—for example, isolating an organ from surrounding tissue—enables the study of specific areas of interest.

And finally, Cleaning or Filtering removes “noise” (visual distortions that appear as random specks or grains) or irrelevant data that could obscure important details and lead to misinterpretation.

Addressing Distributional Data Drift

It has been found that with the implementation of machine-learning models to process medical image datasets, distributional data drift may occur over time. This increasing use of Clinical Artificial Intelligence/Machine Learning (AI/ML)-enabled Software as a Medical Device (SaMD) is posing significant challenges for regulatory bodies in ensuring that these “devices” are, to quote the grant language: “valid, robust, transparent, explainable, fair, safe, and accurate.”1 Ignoring data drift can lead to incorrect diagnoses or suboptimal treatment recommendations thereby potentially compromising patient care.

Drift may result from various factors, including alterations in imaging equipment, differences in imaging protocols, variations in patient demographics, or updates in image preprocessing techniques. Additionally, current methods had several limitations in detecting drift, such as:

  • Limits on specialized feature extraction (isolating characteristics or patterns) and dimensional reduction techniques (addressing the number of variables)
  • Abnormal data sets arising from data entry errors, sensor malfunctions, or malicious activities that require efficient anomaly-detection techniques.
  • The absence of real-time processing solutions for handling abnormal data, especially during the preprocessing stage. The existing process often relies on hospital experts to manually label and remove invalid data before training, which significantly increases the cost and complexity of model development. When datasets contain substantial noise or inconsistencies, the drift detection process may incorrectly identify or overlook drifts, thereby compromising the reliability of the analysis.

To address these limitations and help ensure the accuracy and reliability of diagnostic models in dynamic clinical environments, Dr. Yesha’s team answered challenge of a National Science Foundation/Food and Drug Administration grant entitled “Towards an active surveillance framework to detect AI/ML-enabled Software as a Medical Device (SaMD) data and performance drift in clinical flowby implementing scalable and reliable solutions.

Successful Solutions with Impressive Results

Data sketches are compact, approximate representations of large datasets that retain key characteristics of the data while using significantly less memory or computational resources. These sketches are used to summarize complex medical image data for faster analysis, query, and processing, without the need to store or work with the entire dataset in its original form. The team developed a robust, baseline, library model (a pre-built, reusable set of algorithms or functions that can be applied across different datasets without needing to reinvent the process each time) for real-time anomaly detection, allowing for efficient comparison of incoming images and identification of anomalies. They discovered that data sketches significantly enhance data quality by creating compact, noise-reduced representations, which lead to more stable similarity comparisons.

Additionally, the team fine-tuned a vision transformer (ViT) pretrained model to extract relevant features—using breast cancer images as an example—significantly enhancing model accuracy to 99.11%!

Combined with the data sketches and fine-tuning solutions, a feature extraction evaluation demonstrated that cosine similarity scores (a metric used to quantify how closely related two are based on their pixel or feature representations) between similar datasets provided greater improvements, increased from around 50% to 100%.

Finally, a sensitivity evaluation showed that the team’s solutions were highly sensitive to even 1% salt-and-pepper and speckle noise, and not sensitive to lighting noise (e.g., lighting conditions have no impact on data drift).

FDA Mandate and Testbed

Answering another of the grant’s mandates: “to promote diversity, equity, and inclusion in STEM fields by providing opportunities for underrepresented minority groups and female scholars-in-residence to work as research scholars at the FDA”, the Principal Investigator Dr. Yelena Yesha and Co-Principal Investigator Dr. Phuong Nguyen are both women. Along with co-authors Yusen Wu, Hao Chen, and Alex Pissinou Makki, the team published their findings at aixrv.org2 .

Dr. Yesha and her team also utilized their “FDA AI Testbed” from an FDA signed agreement, established in 2022. This testbed serves as the platform for post-market, AI/ML-enabled device evaluation and monitoring to handle the challenges at clinical practices, such as:

  • Complexity of clinical records and backwards observation
  • Complexity of measuring algorithmic performance
  • Complexity of EHR/PACS interoperability

The potential AI devices that were used for testing and monitoring include:

  • Early lung cancer screening using CT scans
  • Breast cancer screening using mammography
  • Brain tumor classification using CT scans/PET scans

With their advanced methods for detecting distributional drift in CT-scan medical imaging, the team’s data-sketching techniques and fine-turned ViT model enhanced anomaly detection accuracy and stability, and offered a scalable solution to improve model robustness. Read more about their findings here.

_________

1      This research was supported by the National Science Foundation under Award #2326034 NSF/FDA: Towards an active surveillance framework to detect AI/ML-enabled Software as a Medical Device (SaMD) data and performance drift in clinical flow as part of the NSF/FDA scholar-in-residence program, with valuable input from Regulatory Scientists at the FDA’s DIDSR within CDRH’s OSEL.

2      Yusen Wu, Hao Chen, Alex Pissinou Makki, Phuong Nguyen, Yelena Yesha (15 August 2024). “Efficient Data-Sketches and Fine-Tuning for Early Detection of Distributional Drift in Medical Imaging” https://arxiv.org/abs/2408.08456.

 

About Dr. Yelena Yesha

At the University of Miami, Dr. Yelena Yesha is the Knight Foundation Endowed Chair of Data Science and AI at the Frost Institute for Data Science and Computing (IDSC). At IDSC, Dr. Yesha is also the Director for the AI + Machine Learning program, as well as Innovation Officer. In her Innovation role, Dr. Yesha assists faculty in engaging government and industrial partners to collaborate with the University and consults with faculty on developing research ideas into innovations.

Dr. Yesha was the Founding Director of the National Science Foundation Center for Accelerated Real Time Analytics (CARTA), an NSF-funded Industry/University Cooperative Research Center (I/UCRC) that aims to develop long-term partnerships among industry, academia, and government. CARTA partners with Rutgers University New Brunswick, North Carolina State University, the University of Maryland Baltimore County (UMBC), Tel Aviv University, and the University of Miami.

Dr. Yesha received her B.Sc. degrees in Computer Science and in Applied Mathematics from York University, Toronto, Canada, and her M.Sc. degree and Ph.D. degree in Computer Science from The Ohio State University. She has published 11 books as author or editor, and more than 200 papers in prestigious refereed journals and refereed conference proceedings, and she has been awarded external funding in a total amount exceeding $65 million dollars. She is currently working with leading industrial companies and government agencies on new innovative technology in the areas of blockchains, cybersecurity, and big data analytics with applications to electronic commerce, climate change, and digital healthcare. Dr. Yesha is a fellow of the IBM Centre for Advanced Studies.

Forbes magazine highlighted Dr. Yesha’s accomplishments in technology in a two-part profile: Part I: Dr. Yelena Yesha: Meet The Tenacious Pioneer Pushing Innovation To Address Real World Problems and Part II: Dr. Yelena Yesha: Pushing Technology Boundaries To Solve The World’s Biggest Problems, and covered her recent work with NASA here: Revolutionizing Satellite Security: NASA’s Groundbreaking Project To Integrate AI, Blockchain, & Nanosatellites

Yelena Yesha, PhD
Innovation Officer, University of Miami Frost Institute for Data Science and Computing (IDSC)
Director, IDSC AI + Machine Learning
Knight Foundation Endowed Chair in Data Science and AI
Professor, Department of Radiology
Professor, Department of Computer Science
Board Member + Senior Liaison for Interaction with Industries and Special Projects, IMSA (Institute of the Mathematical Sciences of the Americas at UM)
Founding Director, NSF CARTA