Valyu Logo
Product

Valyu DeepResearch sets a new state-of-the-art on ScholarQA

>_ Alexander Ng

Last updated: March 2026

Valyu DeepResearch (fast mode) is now #1 on ScholarQA, the leading benchmark for AI-generated scientific literature reviews (published in Nature, February 2026). We beat the previous state-of-the-art, OpenScholar, by 29% on answer quality, and outperform Perplexity, and PaperQA2 across 218 expert-annotated questions in computer science, biology, and physics.

ScholarQA is the leading benchmark for evaluating how well AI systems synthesize scientific literature. Published in Nature in February 2026, it was developed by researchers at the University of Washington, Allen Institute for AI, Meta, and Carnegie Mellon. The benchmark comprises 218 expert-annotated questions across two evaluation suites:

  • ScholarQA-Multi (108 questions) - multi-domain questions spanning CS (NLP, HCI), Biology (bioimaging, genetics), and Physics (astrophysics, photonics, biophysics), evaluated using the Prometheus open-source LLM judge
  • ScholarQA-CS (110 questions) - computer science questions evaluated against expert-written rubric criteria using GPT-4o as judge

Each question has detailed rubric criteria written by domain experts who specify exactly what a good answer should cover - specific criteria like "discuss transformer architectures" or "compare pre-training approaches" that define what a comprehensive answer requires. The score reflects how well the system actually answers and addresses the question, not how well it games a retrieval pipeline.

We ran both ScholarQA benchmarks against Valyu DeepResearch using our "fast" mode - our lowest tier (<5mins run-time for the output, including PDF doc/deliverables).

ScholarQA-Multi Results (Prometheus Evaluation, 108 questions)

ScholarQA-Multi spans CS, Biology, and Physics. Evaluation uses the Prometheus model - an open-source LLM judge scoring Organization, Coverage, and Relevance on a 5-point scale.

scholarqa_multi_chart.png


SystemAvg. Score (out of 5)Percentage
Valyu Deepresearch (fast mode)4.5691.2%
OpenScholar-GPT4o4.5190.2%
Perplexity Pro4.1583.0%
OpenScholar-8B4.1282.4%
GPT-4o4.0180.2%
PaperQA23.8276.4%

ScholarQA-CS Results (Rubric Evaluation, 110 questions)

110 computer science questions, each scored against expert-written rubric criteria by a GPT-4o judge. The quality score reflects how well the system addresses every criterion the domain expert specified.

scholarqa_cs_chart.png
SystemQuality ScoreAvg. Citations
Valyu Deepresearch (fast mode)74.5%23.3
OpenScholar-GPT4o57.7%~12
OpenScholar-8B51.1%~12


Valyu DeepResearch is #1 on both ScholarQA benchmarks. 74.5% vs 57.7% on CS (29% relative improvement), and 4.56 vs 4.51 on Multi - beating OpenScholar, Perplexity, GPT-4o, and PaperQA2.

This is with our fast mode, our lowest cost tier, consuming 2x less compute than our standard mode, 20x less compute than Heavy, and 40x less compute than Max.


How it works

Three things drive the result:

  1. Structured output from rubric criteria. Each ScholarQA question comes with expert-annotated rubric items. We generate a JSON schema per question where every rubric criterion maps to a required output section. DeepResearch fills in the schema, which means every criterion gets addressed. This is the single biggest driver of the quality improvement.
  2. Deep academic indexing. Valyu has indexed the full text of PubMed, arXiv, bioRxiv, and medRxiv for full-text multimodal retrieval - millions of papers searchable in real-time with structured academic citations (title, authors, DOI, publication venue) attached to every source. OpenScholar's retrieval stack relies on Semantic Scholar keyword search and the You.com API - both limited to abstracts and metadata, and not built for AI-native retrieval. The Valyu API searches across full-text multi-modal indexed academic sources, producing 2x more citations per answer (23.3 vs ~12) drawn from a wider, more current, and deeper range of sources.
  3. One Valyu API call. Building AI for knowledge work domains means addressing two very hard problems: building AI-native search and retrieval infrastructure, and securing access to the specialised data sources that you AI requires, often requiring lengthy partnership negotiations with publishers. Valyu handles both. No retrieval pipeline to build, no paper datastore to host, no embedding index to maintain, no publisher deals to negotiate. A single `POST /deepresearch` with a question and an output schema - so builders can focus on their product, not low-level infrastructure. See the docs.


What this means

ScholarQA is the most rigorous public benchmark for evaluating AI-generated scientific literature reviews. The previous state-of-the-art was OpenScholar, a purpose-built academic system with a custom retrieval pipeline over 45 million papers, built by researchers across the University of Washington, Allen Institute for AI, Meta, and Carnegie Mellon.

Valyu DeepResearch beats it by 29% on answer quality.

For teams building AI research tools, copilots, or agents that need to reason across academic literature: this is the quality bar you should be comparing against.

This is API is already powering real products. Revision Dojo uses Valyu to bring real academic research directly to 450,000+ students (case study). For a look at what building with Valyu looks like in biomedical research, see Lessons from Building AI Agents for Biomedical Research.

For publishers

Valyu partners with publishers to index and serve content into AI systems with full attribution, usage tracking, and revenue sharing. If you're interested in getting your content into AI the right way, visit valyu.ai/rev-share-partner-programme or reach out at contact@valyu.ai.


Get an API key and get started: https://platform.valyu.ai