Medyx v2.0 — AI Disease Diagnosis

🧬

Enter patient data and click "Run Multi-Agent Diagnosis" to begin.

6 AI agents collaborate in real time.

Loading history…

Evaluation Results

All benchmarks run on 100 cases each. DDxPlus re-run on 2026-05-26 (v3 architecture). Datasets sourced from public Hugging Face repositories.

Benchmark	Subsystem	Top-1	Secondary Metric	Cases
Chest X-ray Pneumonia hf-vision/chest-xray-pneumonia	ImageAgent ViT	95.0%	Macro-F1: 0.95	100
DDxPlus Symptoms aai530-group6/ddxplus · keyword pipeline · v3 (2026-05-26)	Full pipeline	69.0%	Top-3: 87.0%	100
MedQA-USMLE GBaker/MedQA-USMLE-4-options	Pipeline + Groq MCQ picker	62.0%	—	100
Bone Fracture Hemg/bone-fracture-detection	ImageAgent ViT (bone)	80.0%	F1: 0.889 · Precision: 1.00	100
HAM10000 Skin Lesion marmal88/skin_cancer · 7 classes	ImageAgent ViT-Large	66.0%	Macro-F1: 0.11 ⚠️	100
Brain Tumor MRI AIOmarRehan/brain_tumor_mri_dataset · 4 classes	ImageAgent ViT (brain)	39.0%	Macro-F1: 0.29	100
ODIR-5K Retinal bumbledeep/odir · 8 ODIR classes	ImageAgent DINOv2	36.0%	ODIR score: 0.117	100
PAPILA Glaucoma imlab-uiip/papila (unavailable)	ImageAgent DINOv2	N/A	Dataset not accessible	—

Key Findings

✅ Chest X-ray: 95% — strongest result

Binary pneumonia vs normal classification. Only 5 errors in 100 cases. The ViT model is well-calibrated for this task — model and dataset are tightly matched.

Binary taskBalanced dataset

📈 DDxPlus v3: 68% → 69% top-1, 86% → 87% top-3

Re-evaluated on 2026-05-26 with v3 architecture (HistoryAgent now shifts BeliefState, 41-marker lab registry). +1pp from History prior shifts on cardiovascular cases. Keyword ClinicalReasoner active; enabling Groq LLM projected to add +15–25pp.

v3 architecture25.3s · 0.25s/case

📈 MedQA fixed: 0% → 62% with Groq MCQ picker

Previous 0% was a format mismatch — pipeline outputs disease names but MedQA answers are drugs/mechanisms. Fix: added a Groq step that selects the best answer letter given the differential. 62% is competitive with early GPT-3-class baselines on this benchmark.

Fixed format mismatchllama-3.1-8b-instant

⚠️ HAM10000: 66% top-1 but macro-F1 = 0.11 (class collapse)

The skin ViT predicts "melanocytic nevi" for all 100 cases. Since nevi make up ~66% of the sample, this yields 66% accuracy by always predicting the majority class. Critically, melanoma and BCC have zero recall — a clinical safety concern. Model needs class-weighted retraining.

Class collapseZero melanoma recall

⚠️ Brain Tumor: 39% — domain routing failure

Many MRI scans are misrouted to the "chest" domain by the pixel-level heuristic, which then predicts "no tumor". The brain ViT itself performs well when correctly routed. Fix: train a learned domain classifier instead of using pixel heuristics.

Domain routing bugModel not at fault

🔬 ODIR: improved 3× with correct model (11% → 36%)

The previous eval used a degenerate model that predicted "other" for every image. Switching to Isaskar/dinov2-base-ODIR-5K raised top-1 to 36% and ODIR composite score to 0.117. Strong recall on Diabetes (77%) and Normal (57%). Minority classes need a balanced sample.

3× improvementDINOv2-base

⚕️ For research and educational use only. Not validated for clinical deployment.