Enter patient data and click "Run Multi-Agent Diagnosis" to begin.
6 AI agents collaborate in real time.
| Benchmark | Subsystem | Top-1 | Secondary Metric | Cases |
|---|---|---|---|---|
| Chest X-ray Pneumonia hf-vision/chest-xray-pneumonia |
ImageAgent ViT | 95.0% | Macro-F1: 0.95 | 100 |
| DDxPlus Symptoms aai530-group6/ddxplus · keyword pipeline · v3 (2026-05-26) |
Full pipeline | 69.0% | Top-3: 87.0% | 100 |
| MedQA-USMLE GBaker/MedQA-USMLE-4-options |
Pipeline + Groq MCQ picker | 62.0% | — | 100 |
| Bone Fracture Hemg/bone-fracture-detection |
ImageAgent ViT (bone) | 80.0% | F1: 0.889 · Precision: 1.00 | 100 |
| HAM10000 Skin Lesion marmal88/skin_cancer · 7 classes |
ImageAgent ViT-Large | 66.0% | Macro-F1: 0.11 ⚠️ | 100 |
| Brain Tumor MRI AIOmarRehan/brain_tumor_mri_dataset · 4 classes |
ImageAgent ViT (brain) | 39.0% | Macro-F1: 0.29 | 100 |
| ODIR-5K Retinal bumbledeep/odir · 8 ODIR classes |
ImageAgent DINOv2 | 36.0% | ODIR score: 0.117 | 100 |
| PAPILA Glaucoma imlab-uiip/papila (unavailable) |
ImageAgent DINOv2 | N/A | Dataset not accessible | — |
Binary pneumonia vs normal classification. Only 5 errors in 100 cases. The ViT model is well-calibrated for this task — model and dataset are tightly matched.
Binary taskBalanced datasetRe-evaluated on 2026-05-26 with v3 architecture (HistoryAgent now shifts BeliefState, 41-marker lab registry). +1pp from History prior shifts on cardiovascular cases. Keyword ClinicalReasoner active; enabling Groq LLM projected to add +15–25pp.
v3 architecture25.3s · 0.25s/casePrevious 0% was a format mismatch — pipeline outputs disease names but MedQA answers are drugs/mechanisms. Fix: added a Groq step that selects the best answer letter given the differential. 62% is competitive with early GPT-3-class baselines on this benchmark.
Fixed format mismatchllama-3.1-8b-instantThe skin ViT predicts "melanocytic nevi" for all 100 cases. Since nevi make up ~66% of the sample, this yields 66% accuracy by always predicting the majority class. Critically, melanoma and BCC have zero recall — a clinical safety concern. Model needs class-weighted retraining.
Class collapseZero melanoma recallMany MRI scans are misrouted to the "chest" domain by the pixel-level heuristic, which then predicts "no tumor". The brain ViT itself performs well when correctly routed. Fix: train a learned domain classifier instead of using pixel heuristics.
Domain routing bugModel not at faultThe previous eval used a degenerate model that predicted "other" for every image. Switching to Isaskar/dinov2-base-ODIR-5K raised top-1 to 36% and ODIR composite score to 0.117. Strong recall on Diabetes (77%) and Normal (57%). Minority classes need a balanced sample.
3× improvementDINOv2-base⚕️ For research and educational use only. Not validated for clinical deployment.