Valyu Logo

Updates,
Research

Benchmarking Medical Search: How Valyu Powers Reliable AI Reasoning

Today, we’re releasing benchmark results showing Valyu’s state-of-the-art performance on benchmarks for medical retrieval. Valyu’s Search API ranked highest across three core benchmarks: MEDAGENTS-BENCH, Clinical Trials, and Drug Labels. Together, the benchmarks rigorously test whether a search API can find the right evidence for clinical reasoning, surface structured data from trial registries, and extract precise information from regulatory documentation.

These benchmarks expose where retrieval fails in practice. Most systems treat medical documents as generic text blocks, ignoring how meaning depends on structure, qualifiers, and clinical context. Pull the wrong snippet, miss a dosage adjustment, or skip a contraindication, and the output cascades into error. Valyu solves this at the retrieval layer, returning only what is accurate, relevant, and safe to reason over.

The Benchmark

Our medical evaluation spans three benchmarks, each designed to test whether a system can retrieve accurate, high-stakes information in clinical and regulatory contexts.

  1. MEDAGENTS-BENCH:
    Multi-step clinical reasoning tasks that combine patient context, symptoms, and medical knowledge to determine the correct diagnosis or treatment. Performance here reflects how well a system supports real-world medical decision-making.
  2. Clinical Trials:
    Structured evidence retrieval across outcomes, eligibility criteria, and study design from trial registries like ClinicalTrials.gov. These questions test whether a system can surface timely and relevant trial data to support evidence-based reasoning.
  3. Drug Labels:
    Clause-level lookups targeting dosage, contraindications, and administration based on patient profiles and conditions. This benchmark evaluates whether a system can extract precise regulatory information in safety-critical workflows.

Results

MedAgent benchmark: Valyu achieved 48% accuracy, Google 45%, Exa 44%, and Parallel 42% on complex medical queries.

MedAgent benchmark: Valyu achieved 48% accuracy, Google 45%, Exa 44%, and Parallel 42% on complex medical queries.

Valyu achieved the highest overall accuracy on MEDAGENTS-BENCH, correctly answering 40.2% of the benchmark’s 482 complex clinical questions - outperforming Google (38.8%), Exa (37.1%), and Parallel (34.4%).

Why This Matters

These benchmarks results show Valyu returns the exact information your models need without extra filtering, post-processing, or patching together multiple APIs. It supports applications for medical tasks, from pulling the right clause in a drug label to surfacing trial outcomes or handling multi-hop clinical reasoning. If you’re building agents or workflows in healthcare, this lets you move faster, avoid silent failures, and ship with confidence.

Try it yourself

We’re building search for a future where every AI interaction is grounded in the best and most up-to-date knowledge.

Don’t just trust us — test us.

🔑 Get your API key
📚 Read the docs
🧠 Use with LangChain