While large language models like ChatGPT, BARD, and LlaMA have their shortcomings, their impact on the healthcare community is without precedent. Equally important is the effect they have had on IT developers who want to create algorithms specifically designed for the medical profession. There are now applications that can pass the United States Medical Licensing Examination (USMLE), answer basic questions in patients’ emails, and summarize the narrative notes in an EMR.
But as our title suggests, this only scratches the surface. Most of the healthcare related applications that are now getting the profession’s attention rely on the ability of these chatbots to generate natural language text. Large language models (LLMs), however, are also capable of analyzing other types of data, including the ICD codes and timeline content in EHR records.
To help differentiate two broad categories of LLMs, Michael Wornow of Stanford University and his colleagues divide them into CLaMs and FEMRs. Clinical language models (CLaMs) are trained primarily on clinical and biomedical text, which can be extracted from narrative notes and patient questions. Foundation models for electronic medical records (FEMRs), on the other hand, “are trained on the entire timeline of events in a patient’s medical history. Given a patient’s EMR as input, a FEMR will output not clinical text, but rather, a machine-understandable “representation” for that patient.” The data input for these FEMRs does not depend solely on natural language text but includes patients’ medical history, a variety of codes, lab values, insurance claim data, and so on.
This approach has several advantages. It usually costs less than general purpose LLMs, with the multi-million-dollar price tags required to scrape billions of data sources. Some FEMRs have been successfully trained on public data sets with fewer than 40,000 patients. There is also evidence to suggest that they have better predictive performance than traditional machine learning models, including better sensitivity and specificity ratings as classifiers.
Wornow et al. list several other advantages of FEMRs, including the fact that they require less labeled data, and they exhibit “emergent capabilities that enable new clinical applications.” For example, they can generate patient representations that enable developers to create time to event models for hundreds of clinical outcomes at the same time. And FEMRs can also handle multimodal data and improve clinician-AI conversations. It is even possible to prompt them to suggest a clinical outcome that you want to see occur, in which case they may be able to recommend a therapeutic regimen to reach that goal.
Nigam Shah, PhD, Chief Data Scientist for Stanford Health Care, and his colleagues summarized the problem currently facing decision makers and developers who are attempting to apply AI tools in healthcare: “By not asking how the intended medical use can shape the training of LLMs and the chatbots or other applications they power, technology companies are deciding what is right for medicine. The medical profession has made a mistake in not shaping the creation, design, and adoption of most information technology systems in healthcare.”
With that in mind, the medical community should be asking stakeholders to rethink their priorities, and to address the key question: Are the LLMs being trained with the relevant data and the right kind of self-supervision?