Appendix B — Grant Proposal: Benchmarking AI Research Evaluation

Coefficient Giving: AI for Forecasting and Sound Reasoning

This proposal is prepared for Coefficient Giving’s RFP on “AI for Forecasting and Sound Reasoning”. We have linked this in our application form.

For detailed analysis, project discussion, and very-preliminary results, see the Introduction and subsequent chapters of this public workspace.

Summary

We consider how frontier LLMs critique, rate, and prioritize global-priorities-relevant economics and policy research, benchmarking and comparing these against human work.1

We will leverage The Unjournal’s unique assets and curate these as a resource for benchmarking AI research evaluation and reasoning.2

This includes

  • 100+ evaluations from commissioned and compensated global experts, including reports, structured quantitative ratings, uncertainty intervals, and identified consensus critiques3

  • 250+ research papers prioritized for impact-potential by internal Field Specialists

  • An ongoing/future pipeline of prioritization and evaluation, enabling out-of-time validation, hybrid human-AI trials, and a natural source of human enumerators.

While LLM tools are being built and used for peer-review (often without permission), feedback, grant-prioritization, and scientificdiscovery, we lack detailed evidence comparing their reasoning to human experts. This project aims to understand AI’s potential for evaluatinghuman-led research, as well for its potentially transformative role in scientific discovery.4 We seek evidence on AI’s reasoning about research: where does it excel, what are its limitations, and what drives the differences from humans?

Core outputs:

  • Public database and curated resources for benchmarking and testing5

  • Quantitative measures of AI calibration, consistency, and alignment with human expert judgment

  • Taxonomy of reasoning failure modes and success patterns

  • Analysis of systematic “research taste” differences between AI and human evaluators

  • Practical guidance on the uses and limitations of AI for peer review and research prioritization

Relevance and fit with RFP criteria

Principles Relevant to Sound Reasoning

RFP Criterion Our Approaches
Truthfulness

Detection: Measure LLM detection of ~confirmed misleading research claims identified by human evaluators. Human enumerators identify ‘false positive’ critiques and misleading and inconsistent statements in AI output.

Informativeness: Human ratings of AI critiques, performance of human-AI hybrid evaluators

Meta-reasoning

Calibration: AI evaluators make predictions, e.g., ‘What journal tier will this paper be published in’?, ‘What will be the mean human-expert’s overall rating’, and state 90% credible intervals. We can consider the plausibility of stated confidence intervals for percentile ranking measures, and compare to actual outcomes where available (e.g., publication journal-tiers, predicted human ratings). We can measure and compare the calibration (and sharpness) across models, questions and framings.

Explanation: (E.g.), compare rating justifications across different framings/extraneous features (e.g., author/institution prestige)

Consistency

Framing: Test sensitivity of ratings to prompt wording, framing, and non-content features.

Sycophancy: Compare ratings across framings, e.g., evaluating “my paper” vs. “generate a report and ratings that will be similar to a human peer reviewer of this paper”

Navigating debates Indirectly: our evaluation guidelines ask for ratings of “reasoning transparency” and claim characterization, indirectly engaging with argument analysis.

What We Measure (examples)

  • Agreement: How closely do LLM ratings correlate with human expert ratings across dimensions? See preliminary evidence →
  • Calibration: Are stated confidence intervals plausible and well-calibrated against outcomes?
  • Coverage: Does the LLM identify the methodological issues flagged by human evaluators? See critique comparison → (and the ‘illustrative cases’ below)
  • Precision: How many LLM-identified issues are valid vs. false positives?
  • Bias: Do ratings vary based on author names, institutions, or other non-content features?
  • Taste: Where do AI and human evaluators systematically diverge?
    • What types of critiques do each tend to identify and emphasize?7
    • How do the critiques, text content, and multi-dimensional ratings differently predict overall assessments?
    • What features do each prioritize in considering global-priorities relevance and impact potential?
  • Relative performance across models, prompts, and contexts. See PoC rating distribution by model →

Limited scale, no general capabilities development

This project focuses on measurement, diagnosis, and insight, and providing a useful resource for benchmarking and understanding AI reasoning about research. The scale is modest and specific. We are not providing general model fine-tuning as a deliverable—we use existing models and standard prompting (see pipeline documentation →), and (in stage 2) only minimal training on and tuning on previous human evaluations. Our work focuses on evaluation, tooling, and human experimentation rather than capabilities R&D.

Preliminary/Pilot Results

Preliminary and illustrative

The results below (and in the other chapters in this document) are from early pilot work and have not yet been carefully vetted. They are meant to illustrate our approach, show viability, and illustrate the types of comparisons we will make. Our methods, prompts, and analysis will improve substantially throughout the project.8

We have completed initial benchmarking with 55 papers (each with 2+ human evaluations) using GPT-5.2 Pro.

Early indicative findings

Rating agreement: Moderate positive correlation between LLM and human ratings (r ≈ 0.4-0.6 depending on dimension), with substantial variation across papers and metrics. See detailed analysis →

Calibration: Non-trivial credible intervals. (More detailed analysis is needed)

Critique coverage (very preliminary): LLM identifies some but not all human-flagged issues (and raises a substantial number of additional issues which would need vetting). A very preliminary comparison suggests coverage varies by issue type; methodological concerns seem captured better than domain-specific errors. However, this is based on GPT-5 Pro’s own comparison and diagnosis of the correspondence between human and LLM critiques; this needs further human vetting and labeling. — See the ‘illustrative cases’ below (more broadly, see critique comparison) →

(Very preliminary comparison →)

  • Good apparent alignment (Clancy, 2024): GPT-5 pro claims the LLM achieved 100% coverage—identifying all 3 prioritized “consensus critiques”, at least partially. Both seem to raise some concerns about neglecting AI risk while considering all the benefits of science (H1 vs L3/L9, although these go in somewhat different directions). Both flag the lack of consistent treatment and justification of the discount rate. Neither are convinced by the privileging of superforecaster estimates over domain experts.

  • Decent alignment (Dullaghan and Zhang, 2022): GPT-5 pro claims the LLM achieved 88% coverage—identifying all 7/8 “consensus critiques”, at least partially.

    • Several important critiques appear very similar between the human and LLM. E.g., these seem to align over the limitations of tiny sample of expert forecasters, and the consequently overstated claims (H1/L1). Both expressed concerns over the interpretation of forecasts about conditional claims (H3/L9).

    • Still, a closer inspection reveals the overall agreement may be overstated. E.g., for “H2: Material input-cost framing errors may bias forecasts” the human evaluation identifies an entirely different unit conversion error, while the LLM (L4) refers to the wording of the amino-acids input-cost question, an error already acknowledged by the authors. Nonetheless, GPT-5 pro reports this as a 90% agreement on this critique. This suggests further human inspection is needed, and/or we need to improve this diagnostic tool.

    • As GPT notes, at least one critique/suggestion was clearly missed by the LLM (H7, about the design not allowing forecaster discussion and update)

  • Middling alignment (Peterman et al. 2025): For a child development meta-analysis, GPT claims the LLM achieved only 40% coverage. Human experts focused on domain-specific issues (ASQ-3 measurement validity in India, CONSORT reporting, clinical cutoffs), while the LLM seemed to focus more on meta-analysis methodology (heterogeneity, RVE implementation). Differences like these would suggest a divergence in taste and prioritization.

  • Rating divergence despite apparently strong critique overlap (Williams et al. 2024): Human experts rated this forest regeneration paper at 50th percentile overall while GPT-5 Pro rated it at 87th—a 37-point gap. Still, GPT reports 88% critique coverage. This discrepancy may reflect the relative weights assigned to each issue. More prosaically, this might reflect GPT’s misdiagnosis of the critique overlap. While some diagnoses seem correct,9 the very-important H1 (“temporal leakage / contemporaneous predictors in training”) does not seem to be captured by the noted LLM critiques (L7 and L8), which seem more about generalizability. Yet GPT reports this as a 75% match.

As noted, this is all very preliminary. Our project aims to characterize these patterns (domain expertise gaps, different methodological emphasis, severity weighting differences, etc.) more systematically and reliably, and at a larger scale.

Methods documentation: Full pipeline, prompts, and schema documented for reproducibility. See methods →

Dataset

Component Current Medium Scenario Large Scenario
Evaluation packages 55+ 105+ 135+
Individual expert evaluations 100+ 200+ 270+
Papers with consensus critiques identified 14 (single enumerator) 50+ 80+
Models compared 2 3+ 5+
Papers with enumerator adjudication 0 40 60

Team

Core team:

  • Valentin Klotzbücher (co-PI): Economist and statistician (University of Basel). Applied econometrics, causal inference, and reproducible pipelines.
  • David Reinstein (co-PI): Economist, founder/director of The Unjournal. 25+ years research experience (UC Berkeley, Essex, Exeter, Rethink Priorities). Track record of building evaluation infrastructure.
  • Tianmai Michael Zhang (Technical collaborator, co-author): PhD student, University of Washington. Co-author of “Reviewing Scientific Papers for Critical Problems With Reasoning LLMs.”
  • Lorenzo Pacchiardi (Advisor, co-author): Research Associate, Leverhulme Centre for the Future of Intelligence, Cambridge. Leads Open Philanthropy–funded project on benchmarking LLM data science tasks.

Additional resources: Unjournal’s management and field specialist teams, operations staff, and evaluator pool (200+ credentialed researchers).

Budget

The interactive budget model below reflects our detailed cost breakdown.10

Cost categories: Core team time (co-PIs),11 payments for human research evaluators12 (extending The Unjournal’s pipeline), payments for enumerators/experts to label and adjudicate LLM outputs, incentives for calibration testing,13 pipeline and tooling development,14 and compute/API costs.15

Project Timeline

Key research principles:

  • Pre-registration early: We specify and register a general pre-analysis plan in months 1-3 to constrain researcher degrees of freedom
  • Continuous human evaluations: New Unjournal evaluation packages arrive throughout the project, feeding both the benchmark dataset and predictive modeling
  • Living research: Data and results are openly available and updated continuously; there is no “final dataset,” only versioned releases

Who Will Use This

Research-informing organizations: Program officers at philanthropic foundations and research evaluation organizations regularly assess research quality when making funding and prioritization decisions, and are likely to increasingly use frontier reasoning models and bespoke research evaluation tools for this. Coefficient Giving, GiveWell, Founders Pledge and similar organizations could use our calibration data to understand when AI-assisted research triage is reliable and where human expertise remains essential.

Journal and preprint review initiatives: Editors facing reviewer shortages need to know which aspects of evaluation AI can assist with. Our failure-mode taxonomy helps them deploy AI appropriately—perhaps for methodological checklists but not domain-specific validity.

The Unjournal’s own pipeline: We are a test case. If AI evaluation proves useful, we integrate it into our workflow. If not, we document why. Either outcome is valuable.

AI safety and governance researchers: Our curated benchmark provides ground-truth data for evaluating AI reasoning about complex, real-world domains—complementing existing datasets that skew toward CS/NLP.

Deliverables

  1. Living benchmark dataset: Continuously updated paper prioritization + human evaluations + LLM outputs + adjudication scores, openly available
  2. Public dashboard: Continuously-updated website where new models can be compared against evolving benchmarks
  3. Open-source codebase: Pipelines for running AI evaluations and computing metrics
  4. Failure-mode taxonomy: Documented patterns of AI reasoning failures
  5. “Taste/alignment” analysis: Quantitative comparison of human vs. AI research preferences
  6. Academic paper(s): Targeting venues like Nature Human Behaviour, Science, or field-relevant journals, as well as journal-independent evaluation initiatives
  7. Practical guidance and general writeups: ‘Playbook’ for using AI peer review, understanding its strengths and limitations; forum-style posts/ reports on AI research judgement

Success Metrics

Benchmark quality:

  • 80+ papers with complete paired human/LLM evaluations, generated LLM evaluations externally rated as ‘high-quality’ relative to other cutting-edge tools
  • Further pipeline of Unjournal (human team) research prioritization deemed high-value by stakeholders, with a clear emphasis on GCR-relevant research
  • Further high-value human evaluations (gaining engagement, judged as high-quality by authors and external experts and stakeholders)
  • Inter-rater agreement over the human “consensus critiques”; critiques clearly identified and explained
  • Reliable/consistent critique-matching across enumerators/LLM enumeration (i.e., high precision and recall in correctly diagnosing ‘did the LLM identify the same critique as human evaluators’
  • 3+ SOTA models systematically compared, along with 1+ bespoke tools

Informative results:

  • Pre-registered quantitative metrics with adequate statistical power/strong Bayesian inference
  • Chosen metrics of LLM-human concordance are easily interpretable and seen as valid and reliable
  • At least one high-quality ‘journal-tier-worthy’ research output; actual ‘publication outcomes’ or journal-independent credible evaluation/review
  • Clear and powerful evidence of specific AI failure modes (or their absence)
  • Intuitive and interesting insights into AI reasoning about research

Practical uptake:

  • Unjournal team adapts tools into evaluation pipeline, or finds the results credible for providing guidance on the limitations of AI use16
  • 2+ external organizations (GCR funders, journals/evaluation initiatives, research users) engage with research outputs
  • Curated data and benchmarks used by other researchers, especially in AI safety evaluation
  • We report findings transparently regardless of whether LLM tools prove practically useful, documenting how they differ from human judgment and what this implies for AI reasoning
Risk Mitigation
LLMs differ substantially from human assessments This is informative, not a failure. We document how and why AI evaluations diverge, with implications for using them as practical tools and for understanding LLM reasoning capacities.
Human evaluations are noisy Use adjudication and curated subsets where multiple evaluators agree. Statistical modeling accounts for rater variance.
Engineering bottlenecks Modular pipelines, flexible staffing, leverage existing tools. This document provides a PoC.
Low external uptake We’re in contact with researchers/leaders at organizations such as Open Philanthropy, Founders Pledge, Rethink Priorities, GiveWell, ALTER, and The AI Futures project, who have expressed interest. The Unjournal is actively engaged and partnering with a cluster of research evaluation initiatives (e.g., open science initiatives and ‘preprint review’ projects) and has also reached out to academic journal editors (Economic Journal, Journal of Development Economics) for collaboration.

Cost Context

Human research evaluation is expensive. The Unjournal’s current cost is approximately $1,700 per evaluation package (2 evaluators at $350-450 each, plus evaluation management, author incentives, and overhead). If LLM evaluation proved reliable enough to replace some human effort while keeping human prioritization and oversight, costs might drop to ~$400 per Unjournal package (including actual evaluation compute costs of only a few dollars). We expect similar or greater cost reduction for more traditional journal processes.17 The key question this project addresses is whether and when such replacement is appropriate, and what the implications of it would be.

Why Now?

LLMs are now sophisticated enough to plausibly conduct research evaluation, as well as play a major roles in producing research, but the window to understand their failure modes before widespread adoption is closing. The Unjournal’s structured evaluations provide a unique ground truth for benchmarking AI reasoning in economics and policy research; no comparable dataset exists. To the extent AI research evaluation and research reasoning becomes reliable, it could reduce costs and accelerate research and evidence synthesis for time-sensitive decisions. If it is unreliable in particular ways and exhibits systematic biases and priorities, this is decision-relevant.

Contact and Resources


  1. We believe the economics/social science context provides a particularly nuanced setting to measure and understand AI tendencies and capabilities; with a lack of methodological consensus, and a connection to fundamental questions of human values and welfare. We focus on and global priorities and policy– high-stakes domains where AI-assisted evaluation could accelerate evidence synthesis for time-sensitive decisions on issues likeAI governance.↩︎

  2. This can also be applied to bespoke agent-driven research evaluation and feedback tools like QED science , https://scholarsreview.com/, and https://www.refine.ink/. We are in conversation with Ben Golub on benchmarking the latter.↩︎

  3. Human reviews, even from vetted, incentivized experts, are not a ‘ground truth’ about research quality. To the extent we’re aiming at ‘making research assessment more accurate and improving research, and assessing AI’s capacity’, this is a limitation. We address this somewhat by focusing on critiques raised by multiple evaluators particularly where the authors agreed and where we the managers agreed. We also aim measure the ways AI evaluations align with and deviate from expert human researchers. We are confident that Unjournal evaluations represent mainstream research expertise. They often come from world-class experts, they generally have strong credentials, and usually have authored high-status peer-reviewed work related to the papers evaluated. Where they leave their names this can be looked up; for anonymous evaluations we ask them to state their years of experience in the field and number of papers reviewed/evaluated.↩︎

  4. The leaders of the biggest A.I. labs argue that artificial intelligence will usher in a new era of scientific discovery, which will help us cure diseases and accelerate our ability to address the climate crisis” – e.g., Sam Altman on cures for cancer, etc. Recent EU Union recent guidelines argue that “AI has great potential for accelerating scientific discovery and improving the effectiveness and pace of research and verification processes” (Eger et al. (2025)).↩︎

  5. This will include Unjournal evaluation and rating content paired with LLM evaluations, and enumerator evaluations of these evaluations. We also build sets of human-validated ‘consensus critiques’. We will also set up processes for the future Unjournal pipeline to be similarly publicly curated for external users.↩︎

  6. See Eger et al. (2025) and Dycke, Kuznetsov, and Gurevych (2023) for discussion of how existing peer-review datasets skew toward computer science.↩︎

  7. E.g., human reviewers may tend to offer more (correct and incorrect) critiques over  the realism of assumptions and generalizability outside of the context of the research data, while AI may focus more on internal inconsistencies in the research.↩︎

  8. As noted, subject to pre-registration.↩︎

  9. E.g., for “H3: Socioeconomic confounding”, the human reports “the difficulty lies in the fact that biophysical and socioeconomic conditions are deeply interconnected”, while the LLM reports (L9) “dropping socioeconomic covariates causes omitted-variable bias, with biophysical variables acting as proxies for governance/pressure”, which seems closely related.

    H7 mentions the inclusion of “already-regrown areas”, while L2 notes “systematic false negatives”; again this seems to be a very similar point.↩︎

  10. The grant application form records absolute minimum ($40,000) and maximum ($700,000) budget figures. The scenarios below represent our realistic operating range.↩︎

  11. This secures dedicated research time from the PIs, moving from pro-bono/patchwork contributions to focused execution.↩︎

  12. High-quality human evaluation data is the scarce resource here—we need to pay domain experts to evaluate new papers and to provide the “ground truth” benchmark.↩︎

  13. Crowd/prediction market incentives ($10–$20k) support calibration validation by incentivizing forecasts with verifiable outcomes: (1) paying expert forecasters on platforms like Metaculus to predict journal-tier placements and citation outcomes for benchmark papers, creating ground-truth comparison points for LLM calibration; (2) incentivizing human evaluators to make scored predictions using proper scoring rules (e.g., “What will the other evaluator rate this paper?”); and (3) piloting prediction elicitation around Pivotal Questions claims. This directly supports the RFP’s emphasis on meta-reasoning—e.g., we can compare the calibration of LLM CIs to incentivized human forecasters.↩︎

  14. This could include funding technical contractors to build the pipeline, API integrations, and build/adapt “RoastMyPost”-style feedback agents. However, as AI-powered coding tools develop (e.g., Claude Code), these costs may be reduced.↩︎

  15. Covers extensive multi-shot testing and long-context inference on frontier models. These costs are modest relative to human evaluation costs—see token usage analysis.↩︎

  16. Of course this measure is not a neutral one, as we are affiliated with The Unjournal. Nonetheless, outsiders might be able to judge the apparent credibility and success of this ~internal takeup.↩︎

  17. These require similar human peer-review time per journal, roughly a full day’s work per peer review in the relevant domains, but aresearch paper often goes through multiple rounds of review and rejection across multiple journals before finding a “home”. These journals do not engage in “prioritization” in the same way The Unjournal does, nor do they provide much editorial discussion and synthesis. However, journal editors sometimes do encourage submissions of promising research, and they engage in “editorial triage”, “desk-rejecting” a substantial share of submitted research, based on the assessments of fit and quality. We expect they are likely to increasingly employ AI tools for the latter.↩︎