Hugging Face starts benchmarking Large Language Models in Healthcare

Advertisements

As generative AI models make their way into healthcare settings, the debate over their efficacy and potential pitfalls intensifies. While early adopters tout the promise of increased efficiency and novel insights, critics raise concerns about inherent flaws and biases that could adversely impact patient outcomes.

Enter Hugging Face, the AI startup, which proposes a solution in the form of a newly launched benchmark test called Open Medical-LLM. Developed in collaboration with researchers from Open Life Science AI and the University of Edinburgh’s Natural Language Processing Group, Open Medical-LLM aims to standardize the evaluation of generative AI models across various medical tasks.

This benchmark test amalgamates existing datasets like MedQA and PubMedQA, encompassing a wide array of medical domains such as anatomy, pharmacology, genetics, and clinical practice. By posing multiple-choice and open-ended questions that demand medical reasoning and comprehension, Open Medical-LLM provides a comprehensive assessment of generative AI models’ performance in healthcare-related tasks.

While Hugging Face heralds Open Medical-LLM as a robust assessment tool for healthcare-bound AI models, some medical experts urge caution. They caution against over-reliance on benchmarks like Open Medical-LLM, highlighting the significant gap between simulated question-answering environments and real-world clinical practice.

For example, when given a medical query (see below), GPT-3 incorrectly recommended tetracycline for a pregnant patient, despite correctly explaining its contraindication due to potential harm to the fetus. Acting on this incorrect recommendation could lead to bone growth problems in the baby.

Hugging Face starts benchmarking Large Language Models in Healthcare image 84

Indeed, the deployment of generative AI models in healthcare requires careful consideration and rigorous testing. As Hugging Face research scientist Clémentine Fourrier emphasizes, while benchmarks offer initial insights, real-world testing is indispensable to assess a model’s applicability in clinical settings accurately.

The cautionary tale of Google’s AI screening tool for diabetic retinopathy serves as a stark reminder. Despite theoretical accuracy, the tool faltered in real-world testing, underscoring the challenges of translating lab performance to practical clinical use.

Startup Century Health utilizes AI to give pharmacy access to good patient data

While Open Medical-LLM provides valuable insights, it is not a panacea. As the field of AI in healthcare evolves, it’s clear that rigorous real-world testing remains paramount in ensuring patient safety and efficacy.