CardioLLM: A Fine-Tuned Large Language Model for Cardiac Condition Prediction from Clinical Notes

Krishiv Bhatia

22/01/2026

https://doi.org/10.65161/recsFewcWZaWC9vK7

Background: Cardiac complications remain a leading cause of ICU mortality, requiring rapid identification from extensive clinical documentation. While traditional NLP approaches using TF-IDF vectorization provide interpretable baselines, recent advances in medical LLMs offer superior contextual understanding. This study presents instruction fine-tuning of Google's MedGemma model specifically for cardiac outcome prediction, evaluated against traditional machine learning baselines using a large-scale ICU dataset.
Methods: We extracted 120,000 discharge summaries from MIMIC-IV-Note v3.1 intensive care unit (2008-2022), comprising 3.81GB of clinical text with patients experiencing cardiac events (myocardial infarction, heart failure, cardiogenic shock, and arrhythmias). Our baseline employed TF-IDF vectorization with logistic regression classifier. For the proposed approach, we instruction fine-tuned MedGemma-2B using QLoRA (4-bit quantization) on cardiac-specific instruction-response pairs generated from clinical notes. The fine-tuning dataset included explicit reasoning chains for cardiac symptom identification, laboratory value interpretation, and risk stratification. A web-based interface was developed for real-time deployment of both models.
Results:
The instruction fine-tuned MedGemma model demonstrated strong predictive ability across all four cardiac outcomes, achieving a macro-average AUROC of 0.93 and AUPRC of 0.55 on the held-out test set. While overall discrimination was slightly lower than the TF-IDF logistic regression baseline (AUROC = 0.97, AUPRC = 0.61), the large language model exhibited enhanced stability on minority classes such as arrhythmia and cardiac arrest and better generalized to complex narrative patterns within clinical documentation. MedGemma processed approximately 45 million tokens during training, and inference averaged ~2 seconds per note (512 tokens), supporting near–real-time prediction in an ICU decision-support setting.
Conclusions:
Instruction fine-tuning of MedGemma on domain-specific ICU data yields competitive performance relative to traditional TF-IDF approaches. While overall discrimination was modestly lower than the TF-IDF logistic regression baseline (AUROC = 0.97 vs 0.93), the fine-tuned LLM demonstrated superior contextual reasoning, enhanced minority class stability, and improved recognition of subtle decompensation patterns within clinical narratives. These findings validate the complementary value of medical LLMs for interpretable risk stratification in time-critical ICU settings. The open-source implementation and curated instruction dataset provide a reproducible foundation for extending this approach to other critical-care applications.