How Health Systems can Provide Safer Care by Leveraging AI/Machine Learning Technology


The current practice of medicine is incredibly biased — because its policies, procedures, technologies and people are all implicitly biased. Though there has been ongoing attention to explicitly biased individuals and processes in healthcare, there are also long-standing policies, procedures, and technologies that have ingrained implicit bias. 

Recently, many have wondered if the introduction of artificial intelligence and machine learning (AI/ML) technologies in the healthcare setting will result in increased bias and harm. It is possible — when AI/ML solutions use inherently biased studies, policies or processes as inputs, the technology, of course, will serve biased outputs. However, AI/ML technology can be key in terms of making the practice of medicine more fair and equitable. When done right, AI/ML technology has the potential to greatly reduce bias in medicine by flagging insights or critical moments that a clinician might not see. In order to create technology that better serves at risk and underserved individuals and communities, technologists and healthcare organizations must actively work to minimize bias when creating and deploying AI/ML solutions. They can do so by leveraging the following three strategies:


Creating a checklist that evaluates potential sources of bias and what groups may be at risk for inequity,


Proactively evaluating models for bias and robustness ; and


Continuously monitoring results and outputs over time.

 

Understanding why healthcare is biased and the sources of bias

Bias enters healthcare in a variety of ways. Depending on the way medical instruments were developed, they may not  account for a variety of races. For example, pulse oximetry is more likely to miss hypoxemia (as measured by arterial blood gas) in black patients than white patients. This is because pulse oximeters were developed and calibrated with light-skinned individuals; and since a pulse ox reads light passing through the skin, it’s not surprising that skin color could impact readings.

Policies and processes can also hold inherent bias. Many organizations prioritize patients for care management using models that predict a patient’s future cost based on the assumption that patients with the highest healthcare costs also have the greatest needs. The issue with this assumption is Black patients tend to generate lower healthcare costs than White patients with the same level of comorbidities, likely because they have more barriers to accessing health care. As a result, resources might be mis-allocated to patients with lower needs (but higher predicted cost). 

Historical studies have also led to inequities in care. Interpretation of spirometry data (for lung capacity) creates unfairness because Black people are assumed to have 15% lower lung capacity than white people, and Asians are assumed to have 5% lower. These “correction factors” are based on historical studies that conflated average lung capacity with healthy lung capacity, without accounting for socioeconomic distinctions. Lung capacity tends to be reduced for individuals that live near roads, but this is correlated with disadvantaged ethnic groups.

These care disparities have a significant impact. For example, Sepsis, a condition which causes over 300,000 deaths per year, disproportionately impacts minority communities. According to the Sepsis Alliance, Black and Hispanic patients have a higher incidence of severe sepsis as compared to white patients; Black children are 30% more likely than white children to develop sepsis after surgery; and Black women have more than twice the risk of severe maternal sepsis as compared to white women. 

For health systems, creating tools that actively work to combat these disparities in care isn’t a nice to have, but a mission critical must have. Health systems have a responsibility to provide equitable, safe care, and AI/ML technologies have the promise to help them do so.

What can be done to combat bias and promote equity in AI/ML technology?

Health organizations can implement these three strategies when launching AI/ML technologies to drive better, more equitable care outcomes.

Create a checklist that evaluates potential sources of bias and what groups may be at risk for inequity. Prior to validating or deploying a predictive model, it is worthwhile to clearly describe the clinical/business driver(s) for the intended predictive model and how the model will be used. Given the intended use, is there a risk that the model might perform unequally across subgroups and/or result in an unequal allocation of resources or outcomes for specific subgroups? If the prediction target is only a proxy for the outcome of interest, could that lead to unintended disparities between subgroups?

Once the objectives are clearly determined, it is possible to identify potential sources of bias in a given model. Some example questions to address include:

  • Are there inputs that might be predictive of the outcome for some subgroups (e.g., socioeconomic status) that are not included in the model?

  • Is the prediction target measured in the same way for all subgroups?

  • Are input variables more likely to be missing in one subgroup than another?

  • Could end users use the model outputs differently for specific subgroups?

 

Proactively evaluate models for bias and robustness.  Identifying subgroups at risk of bias or inequity facilitates explicit testing for differences in model performance between subgroups. Understanding differences in performance is necessary to avoid and mitigate bias, but it is not sufficient because the validation data may still differ in important ways from the environment in which the model is ultimately deployed. Fortunately, new machine learning techniques can evaluate whether models are robust to differences in data and also identify the conditions under which the model will no longer perform and potentially become unsafe. 

Continuously monitor results and outputs over time. Done incorrectly we risk harming patients, making care less safe and potentially exacerbating bias. Even if models are free from bias when initially validated and deployed, it is essential to continue monitoring model performance to ensure performance does not degrade over time. Models are particularly susceptible to failure after unanticipated changes in technology (e.g., new devices, new code sets), population (e.g., demographic shifts, new diseases), or behavior (e.g., practice patterns, reimbursement incentives). These changes are collectively referred to as dataset shift because the data used in clinical practice differs from data used to train the predictive model. Although clinicians, administrators, or IT teams can mitigate changes in performance by explicitly identifying scenarios when dataset shift is likely, it is equally important that solution vendors monitor model performance on an ongoing process and update the models when needed 

As more health systems and healthcare organizations implement AI/ML technology to help enable patient-specific insights to drive improved care, they need to be actively working to reduce bias and provide better, more equitable care by implementing three key strategies. Understanding the potential sources of bias, proactively looking for and evaluating for bias in models, and monitoring results overtime will help reduce differential treatment of patients by race, gender, weight, age, language and income.


Building and deploying AI predictive tools in healthcare isn’t easy. The data are messy and challenging from the start, and building models that can integrate, adapt, and analyze this type of data requires a deep understanding of the latest AI/ML strategies and an ability to employ these strategies effectively. Recent studies and reporting have shown how hard it is to get it right, and how important it is to be transparent with what’s “under the hood” and the effectiveness of any predictive tool. 

What makes this even harder is that the industry is still learning how to evaluate these types of solutions. While there are many entities and groups (such as the FDA) working diligently on creating guidelines and regulations to evaluate AI and predictive tools in healthcare, at the moment, there’s no governing body explaining the right way to do predictive tool evaluations, which is leaving a gap in terms of understanding what a solution should look like and how it should be measured.

As a result, many are making mistakes when evaluating AI and predictive solutions. These mistakes can lead to health systems choosing predictive tools that aren’t effective or appropriate for their population. As a long-time researcher in the field, I have seen these common mistakes made, and also have been guiding health systems on how to overcome them to have a safe, robust, and reliable tool.

Here are the top seven common mistakes typically made when evaluating an AI / predictive healthcare tool, and how to overcome these challenges to ensure an effective tool:

  1. Only the workflow is evaluated, not the models: The models are just as important as the workflow. Look for high performing models, e.g. with both high sensitivity and high precision before implementing within workflow. Not evaluating if the models work before implementation, and assuming you can obtain efficacy through optimizing workflows alone is like not knowing if a drug will work and changing the label on it to try to increase effectiveness. 

  2. The models are evaluated, but with the wrong metrics: The models should be evaluated, but the metrics should be determined based on the mechanism of action for each condition area. For example, in sepsis, lead time–median time  prior to antibiotics administration–is critical. But, you also don’t want to alert on too many people because low quality alerts that are not actionable will lead to provider burnout and over-treatment. The key criteria to look for in a sepsis tool are high sensitivity, significant lead time, and low false alerting rate.

  3. Adoption isn’t measured on a granular level: Typically, end user adoption isn’t measured. However, to obtain sustained outcome improvements, a framework for measuring adoption (at varying levels of granularity) and improving adoption is critical. Look to see if the tool also comes with an infrastructure that continuously monitors use, and provides strategies to improve and increase adoption.

  4. The impact on outcomes isn’t measured correctly: Many studies rely on coded data to identify cases and measure outcome impact. These are not reliable because coding is highly dependent on documentation practices and often a surveillance tool itself impacts documentation. In fact, a common flawed design is a pre/post study where the post period leverages a surveillance tool that dramatically increases the number of coded cases, in turn, leading to the perception that outcomes have improved because adverse rate (e.g., sepsis mortality rate on coded cases) has decreased. Look for rigorous studies of the tool that account for these types of issues.

  5. The ability to detect and tackle shifts isn’t identified: If a model doesn’t proactively tackle the issue of shifts and transportability, it is at risk of being “unsafe.” Strategies to reduce bias and adapt for dataset shift is critical because practice patterns are frequently changing (see what happened at one hospital during Covid-19, for example). Look for evidence of high performance across diverse populations to see if the solution is detecting and tuning appropriately for shifts (read more about best practices for combating dataset shift in this recent New England Journal of Medicine article).

  6. “Apples to oranges” outcome studies are compared: A common mistake is to overlook what the standard of care was in the environment where the outcome studies were done.  For example, a 10% improvement in outcomes at a high reliability organization may be just as much or more impressive than similar improvement at a different organization with historically poor outcomes. Understanding the populations in which the studies were done and the standard of care in those environments will help you understand how and why the tool worked.  

  7. Assuming a team of informaticists can tune any model to success: Keeping models tuned to be high-performing over time is a significant lift. Further, a common mistake is to assume any model can be made to work in your environment with enough rules and configurations added on top. The predictive AI tool should come with its own ability to tune, with an understanding of when and how to tune. Starting with the rudimentary model is akin to being given the names of molecules and asking you to create the right drug if you can mix the ingredients correctly. 

 

When dealing with predictive AI tools in the healthcare space, the stakes could not be higher. As a result, predictive solutions need to be monitored and evaluated to ensure effectiveness, otherwise it’s likely the tools will have no impact, or worse, result in a negative patient impact. Understanding the common mistakes made, as well as the best practices for evaluation, will help health systems identify solutions that are safe, robust, and reliable, and ultimately, help physicians and care team members deliver safer, and higher quality care.

Learn more about Bayesian Health’s research-first mentality, recent evaluations and outcome studies here.


Logo
Stay up to date on the latest in machine learning and healthcare

By submitting this form, you agree to receive newsletter emails from Bayesian Health. You can unsubscribe at any time.

© 2026 Bayesian Health

All Rights Reserved

Logo
Stay up to date on the latest in machine learning and healthcare

By submitting this form, you agree to receive newsletter emails from Bayesian Health. You can unsubscribe at any time.

© 2026 Bayesian Health

All Rights Reserved

Logo
Stay up to date on the latest in machine learning and healthcare

By submitting this form, you agree to receive newsletter emails from Bayesian Health. You can unsubscribe at any time.

© 2026 Bayesian Health

All Rights Reserved