Skip to content

English edition

Harvard + Llama in Medical Diagnosis: What Open Models Prove

Clinical AI performance is no longer exclusive to closed systems. This case shows where open models are credible and where governance still decides outcomes.

VZ editorial frame

Read this piece through one operating lens: AI does not automate first, it amplifies first. If the underlying decision architecture is clear, AI scales clarity. If it is noisy, AI scales noise and cost.

VZ Lens

Through a VZ lens, this is not content for trend consumption - it is a decision signal. Clinical AI performance is no longer exclusive to closed systems. This case shows where open models are credible and where governance still decides outcomes. The real leverage appears when the insight is translated into explicit operating choices.

TL;DR

A study led by Harvard Medical School found that the open-source Llama 3.1 405B model (70%) outperformed GPT-4 (64%) in complex clinical diagnoses. This is not just a performance victory, but an institutional turning point: for the first time, hospitals can run cutting-edge AI locally on their own infrastructure, bypassing the HIPAA/GDPR challenges posed by the use of external APIs.


For a long time, it was easy to say about open-source AI: “It’s interesting, cheaper, and more flexible—but for serious tasks, you still need proprietary models.”

That claim is now starting to seriously crack.

In March 2025, the journal JAMA Health Forum published an NIH-funded study led by researchers at Harvard Medical School—in collaboration with clinicians at Beth Israel Deaconess Medical Center and Brigham and Women’s Hospital. The subject of the research: the diagnostic performance of the Llama 3.1 405B open-source model on difficult clinical cases, compared to GPT-4.

The result: Llama 70%, GPT-4 64% correct diagnoses across 92 clinically challenging cases. In terms of best-guess accuracy: Llama 41%, GPT-4 37%.

This isn’t just a number. It marks the beginning of an institutional turning point.


What Actually Happened?

Details of the Harvard Study

The researchers examined 92 cases drawn from the New England Journal of Medicine’s weekly diagnostic column—the “Case Records of the Massachusetts General Hospital” series. These cases were deliberately selected to include the most difficult, rare, and complex clinical scenarios: rare diseases, atypical presentations, and complex conditions affecting multiple systems.

The methodological rigor is also noteworthy: the researchers selected 70 cases for the Llama test that had previously been used to evaluate GPT-4’s performance, then added 22 new cases from after Llama’s training period—with the aim of ruling out the possibility of training data leakage (the model may have already encountered the old cases during training).

The results:

  • Total of 92 cases: Llama 70% correct diagnoses, GPT-4 64%
  • Top-1 ranking accuracy: Llama 41%, GPT-4 37%
  • On the 22 newer cases alone: Llama 73% — which is particularly strong, since these cases were definitely not included in the training data

What’s on the surface, and what’s going on underneath?

The headlines say: “OpenAI beat GPT-4 in medical diagnosis.”

This is true — but it’s not the most important message.

The most important message is what the study’s authors explicitly highlight: for the first time, hospitals are able to run a state-of-the-art diagnostic AI on their own private infrastructure without having to send patient data to an external network.

This has been a long-standing structural barrier in healthcare AI. Closed, API-based models—GPT-4, Claude, Gemini—are powerful, but they run on external servers. Sending medical data to an external API raises serious HIPAA compliance and data privacy issues in the United States—and similar problems under GDPR in Europe.

An open model that can run on a local server with state-of-the-art performance would eliminate this barrier.


Why is this important now?

What has changed technologically?

Llama 3.1 405B is Meta’s 2024 open-source release—and, together with its other versions (8B, 70B), one of the most significant open-source model families on the market. The 405B version was the largest generally available open-source model at the time of its release.

What has changed: Open models are no longer just “good enough”—they now outperform state-of-the-art closed models on certain tasks. This represents a qualitative leap in terms of institutional acceptance.

What has changed in healthcare AI?

Healthcare AI is characterized by five simultaneous tensions:

1. Performance requirements: Diagnostic AI is expected to deliver high accuracy—the stakes in healthcare do not tolerate poor performance.

2. Data sovereignty: Patient data is among the most sensitive personal data. HIPAA (U.S.) and GDPR (EU) impose strict compliance restrictions on the transfer of data to third parties.

3. Auditability: Medical decisions require legal and professional accountability. An opaque, closed model makes auditing difficult.

4. Customizability: Every hospital system has its own protocols, terminology, and electronic health records (EHRs). The ability to perform local fine-tuning is crucial.

5. Sustainable Cost: In the long term, Frontier API fees can add up to staggering amounts at the hospital level.

Closed frontier models are strong on the first dimension—but weaker on the second, third, fourth, and fifth. An open, local model presents a different profile across all dimensions.


Where has the public discourse gone wrong?

What does the “open-source AI is dangerous in hospitals” narrative mean?

One of the most common arguments against open-source AI in a healthcare context is the lack of control: “Who is responsible for the output if the model is open?”

This is a valid question—but it is partly misunderstood.

The question of responsibility does not depend on whether the model is open or closed. Clinical AI is always a supplementary tool, not an independent decision-maker—neither in the case of GPT-4 nor Llama. The doctor is responsible for the diagnosis, not the AI.

What actually differs with an open model is that the institution has internal control over the model’s behavior. It can customize, audit, and restrict it—without having to wait for the vendor’s consent or API updates.

This control in enterprise AI—and especially in healthcare AI—is not an abstract technical advantage. It is a matter of compliance, governance, and risk management.

Why isn’t this about “open source winning the AI race in healthcare”?

The performance of proprietary models is also constantly improving. GPT-4o, Claude 4 Sonnet, and Gemini 2.0 are likely already outperforming Llama 3.1 405B in general testing.

What actually follows from the Harvard study is this: open-source models have reached the institutional decision-making threshold. They don’t win every time—but they are no longer automatically treated as a second-rate option.

This marks a shift in the question: it is no longer “Should open models be used in serious environments?”, but rather “In which cases is it more rational to work with an open model, because it is good enough, more controllable, and more easily localized?


What deeper pattern is emerging?

The logic behind institutional AI adoption

Institutional AI adoption—in hospitals, law firms, financial organizations, and educational institutions—is not driven solely by performance. Rather, it is driven by a complex decision matrix:

  • Risk Management: Who is responsible if the AI makes a mistake? How can the decision be audited?
  • Compliance: Does data processing comply with applicable laws and regulations?
  • Vendor Dependency: What happens if the vendor raises prices, discontinues the service, or changes the API?
  • Customizability: Can it be integrated into internal systems and data structures?
  • Long-term sustainability: What is the multi-year TCO (Total Cost of Ownership)?

Closed-frontier models excel in the first dimension—their performance justifies nearly every institutional decision-making process. But they raise serious questions in other dimensions.

Open models present a different profile: they meet the necessary threshold in terms of performance (as the Harvard study shows) and hold a structural advantage in other dimensions.

Data sovereignty as an institutional competitive advantage

There is one aspect that is missing from most headlines: in the context of the Harvard study, the Llama-based system allows the hospital to train the AI on its own data—its own cases, protocols, and EHR data—without sharing this data with an external vendor.

In the long run, this isn’t just a compliance issue. It’s a source of competitive advantage: a hospital system that fine-tunes diagnostic AI using its own patient data can potentially achieve better diagnostic performance on its own patient population than a general-purpose model.

This is the combination of data sovereignty and the local learning cycle—and it applies across sectors from healthcare to the legal sector, from the financial sector to industrial applications.

Why is this not an isolated incident?

The Harvard healthcare AI case illustrates a broader trend: the “institutional barrier to entry” for open models has rapidly decreased.

Two years ago, the vast majority of healthcare decision-makers automatically ruled out open models—due to concerns about performance and governance. Today, that same group of decision-makers does not automatically rule them out. They evaluate them.

This shift in perspective is significant in itself—because decision cycles in the institutional AI market are slow. If open models are now on the serious evaluation list, the first real institutional deployments will appear in the next 12–24 months.


What are the strategic implications of this?

What does a decision-maker need to understand from this?

Healthcare—and more broadly, institutional—AI strategies will become significantly more differentiated in the coming period.

The automatic “let’s use the Frontier API” decision will not disappear. But alongside it, a more rational question will emerge: For which tasks, with what data security profile, and with what governance requirements is it worth deploying a proprietary, local open model?

Institutions that are now beginning to seriously ask this question have time to build the necessary infrastructure and governance framework. Those that do not ask it will find themselves starting from a reactive position in a year or two.

Where does this create a competitive advantage?

Localization of diagnostic AI. A hospital system that runs an open model fine-tuned on its own patient population and protocols can potentially achieve better performance on its specific tasks than a general frontier model.

Compliance as an architectural decision. GDPR and HIPAA compliance is not necessarily more complicated with local deployment—in many cases, it is simpler than managing the data protection implications associated with an external API.

Institutional AI autonomy. A vendor-independent AI stack reduces the long-term risks of price hikes, API changes, and platform migrations—which is particularly important for institutions that plan in terms of decades.


What should we be watching for now?

What can we expect in the next 6–12 months?

The first real-world hospital Llama deployments. Following the Harvard study, it is expected that some hospital systems will begin testing open-model-based clinical AI—initially in controlled pilot environments.

Clarification of regulatory frameworks. The FDA (U.S.) and the EMA (EU) are working increasingly actively on the regulatory framework for medical AI. Guidelines regarding the handling of open-source models are also expected to be released.

The proliferation of domain-specific medical fine-tuning. The number of Llama-based models fine-tuned for the medical field—MedLlama, BioMedLM, OpenBioLLM—will grow as open infrastructure matures.


Conclusion

The Harvard study can be summarized in one sentence: open-source AI has arrived on the scene of serious institutional decision-making.

Not because it has outperformed proprietary models in every dimension. But because, in the most important dimension—performance—it has reached the necessary threshold while retaining its structural advantages in the dimensions of data sovereignty, auditability, and customizability.

The logic of institutional AI decisions is changing. Not tomorrow, not overnight. But the process has begun.

The question for decision-makers is not whether to follow the trend. It is when to start preparing for a world where local, controlled, auditable AI deployment will be the norm—not the exception.


Key Takeaways

  • Open models have reached the threshold of institutional acceptance — Llama 3.1’s 405B clinical performance proves that open models are no longer second-rate options but represent a serious alternative to closed frontier models.
  • Data sovereignty and compliance are the main advantages — An open model that can run on a local server eliminates the legal and data protection risks associated with sending patient data to an external API, which is critical from a HIPAA and GDPR perspective.
  • Institutional decision-making is about more than just performance — Hospitals’ choices are influenced not only by accuracy but by a complex matrix of risk management, auditability, vendor lock-in, and total cost of ownership (TCO).
  • Control is the key value of the open model — The institution can internally customize, audit, and restrict the model, which is not possible with closed models, thereby increasing the level of governance.
  • The question has shifted: “When is it worth using an open model?” — The debate is no longer about whether open models are secure, but rather about which tasks it makes more sense to choose them for, due to better control and localizability.

Strategic Synthesis

  • Translate the core idea of “Harvard + Llama in Medical Diagnosis: What Open Models Prove” into one concrete operating decision for the next 30 days.
  • Define the trust and quality signals you will monitor weekly to validate progress.
  • Run a short feedback loop: measure, refine, and re-prioritize based on real outcomes.

Next step

If you want your brand to be represented with context quality and citation strength in AI systems, start with a practical baseline and a priority sequence.