MULTI-AGENT AI · v2.0
⚡ debate via
Patient Information
🩻
Click to upload
PNG · JPG · JPEG
Preview
Loading…
🧬

Enter patient data and click "Run Multi-Agent Diagnosis" to begin.

6 AI agents collaborate in real time.

Loading history…
Evaluation Results
All benchmarks run on 100 cases each. DDxPlus re-run on 2026-05-26 (v3 architecture). Datasets sourced from public Hugging Face repositories.
BenchmarkSubsystemTop-1Secondary MetricCases
Chest X-ray Pneumonia
hf-vision/chest-xray-pneumonia
ImageAgent ViT 95.0% Macro-F1: 0.95 100
DDxPlus Symptoms
aai530-group6/ddxplus · keyword pipeline · v3 (2026-05-26)
Full pipeline 69.0% Top-3: 87.0% 100
MedQA-USMLE
GBaker/MedQA-USMLE-4-options
Pipeline + Groq MCQ picker 62.0% 100
Bone Fracture
Hemg/bone-fracture-detection
ImageAgent ViT (bone) 80.0% F1: 0.889 · Precision: 1.00 100
HAM10000 Skin Lesion
marmal88/skin_cancer · 7 classes
ImageAgent ViT-Large 66.0% Macro-F1: 0.11 ⚠️ 100
Brain Tumor MRI
AIOmarRehan/brain_tumor_mri_dataset · 4 classes
ImageAgent ViT (brain) 39.0% Macro-F1: 0.29 100
ODIR-5K Retinal
bumbledeep/odir · 8 ODIR classes
ImageAgent DINOv2 36.0% ODIR score: 0.117 100
PAPILA Glaucoma
imlab-uiip/papila (unavailable)
ImageAgent DINOv2 N/A Dataset not accessible
Key Findings

✅ Chest X-ray: 95% — strongest result

Binary pneumonia vs normal classification. Only 5 errors in 100 cases. The ViT model is well-calibrated for this task — model and dataset are tightly matched.

Binary taskBalanced dataset

📈 DDxPlus v3: 68% → 69% top-1, 86% → 87% top-3

Re-evaluated on 2026-05-26 with v3 architecture (HistoryAgent now shifts BeliefState, 41-marker lab registry). +1pp from History prior shifts on cardiovascular cases. Keyword ClinicalReasoner active; enabling Groq LLM projected to add +15–25pp.

v3 architecture25.3s · 0.25s/case

📈 MedQA fixed: 0% → 62% with Groq MCQ picker

Previous 0% was a format mismatch — pipeline outputs disease names but MedQA answers are drugs/mechanisms. Fix: added a Groq step that selects the best answer letter given the differential. 62% is competitive with early GPT-3-class baselines on this benchmark.

Fixed format mismatchllama-3.1-8b-instant

⚠️ HAM10000: 66% top-1 but macro-F1 = 0.11 (class collapse)

The skin ViT predicts "melanocytic nevi" for all 100 cases. Since nevi make up ~66% of the sample, this yields 66% accuracy by always predicting the majority class. Critically, melanoma and BCC have zero recall — a clinical safety concern. Model needs class-weighted retraining.

Class collapseZero melanoma recall

⚠️ Brain Tumor: 39% — domain routing failure

Many MRI scans are misrouted to the "chest" domain by the pixel-level heuristic, which then predicts "no tumor". The brain ViT itself performs well when correctly routed. Fix: train a learned domain classifier instead of using pixel heuristics.

Domain routing bugModel not at fault

🔬 ODIR: improved 3× with correct model (11% → 36%)

The previous eval used a degenerate model that predicted "other" for every image. Switching to Isaskar/dinov2-base-ODIR-5K raised top-1 to 36% and ODIR composite score to 0.117. Strong recall on Diabetes (77%) and Normal (57%). Minority classes need a balanced sample.

3× improvementDINOv2-base

⚕️ For research and educational use only. Not validated for clinical deployment.