Comparing LLM and human reviews of social science research using data from Unjournal.org

Authors
Affiliations
Valentin Klotzbücher

University of Basel & University Hospital Basel

David Reinstein

The Unjournal

Lorenzo Pacchiardi

University of Cambridge, Leverhulme Centre for the Future of Intelligence

Tianmai Michael Zhang

University of Washington

Published

January 20, 2026

Abstract

We benchmark how frontier LLMs critique, rate, and prioritize research, comparing AI-generated evaluations to human expert assessments from The Unjournal. Human ratings serve as a reference signal—the core goal is measuring and diagnosing AI reasoning quality: understanding calibration, failure modes, and systematic “taste” differences between AI and human evaluators.

Introduction

Include global setup and parameters
source("setup_params.R")
Work in progress

Pages, metrics, and comparisons are under active development. Expect rough edges and frequent updates.

For Grant Reviewers

Looking for our funding proposal? Jump directly to the Proposal chapter for project scope, budget, and deliverables.

Is AI good at peer-reviewing? Does it offer useful and valid feedback? Can it predict how human experts will rate research across a range of categories? How can it help academics do this “thankless” task better? Is it particularly good at spotting errors? Are there specific categories, e.g. spotting math errors or judging real-world relevance, where it does surprisingly well or poorly? How does its “research taste” compare to humans?

If AI research-evaluation works it could free up a lot of scientific resources – perhaps $1.5 billion/year in the US alone (Aczel, Szaszi, and Holcombe 2021) – and offer more continual and detailed review, helping improve research and address the “peer-review crisis”. Numerous accounts report a system under strain, with widespread difficulty finding human peer reviewers with appropriate expertise willing to put in careful effort. An influx of AI-generated or AI-human hybrid research seems likely to exacerbate this problem.1 AI research evaluation could also help characterize methodological strengths and weaknesses across papers, aiding research training and research direction-setting.

Furthermore, “the leaders of the biggest A.I. labs argue that artificial intelligence will usher in a new era of scientific discovery, which will help us cure diseases and accelerate our ability to address the climate crisis” – e.g., Sam Altman on cures for cancer, etc. Understanding how AI engages with research evaluations may provide a window into its values, abilities, and limitations.2

In this project, we pilot the use of The Unjournal’s evaluations as a benchmark to test the capabilities of current large language models (LLMs) and bespoke agent-based tools, considering how the research evaluations these generate compare to expert human reviews. The Unjournal systematically prioritizes ‘impactful’ research and pays for high-quality human evaluations, structured quantified ratings, claim identification and assessment, and predictions. They further labeled and classified “major consensus critiques” from a subset of these evaluations.3

We next prompt several LLMs and agent-based tools4 to review the same set of quantitative social science research evaluated by The Unjournal’s human evaluators, with the same set of criteria and guidelines. Each paper is assessed on specific dimensions – for example, the strength of its evidence, rigor of methods, clarity of communication, openness/reproducibility, relevance to global priorities, and overall quality. The LLM will provide quantitative scores (with uncertainty intervals) on these criteria and produce a written evaluation.

Our initial dataset will include the 60 research papers that have existing Unjournal human evaluations.5 For each paper, the AI will generate: (1) numeric ratings on the defined criteria, (2) identification of the paper’s key claims, and (3) a detailed review enumerating and discussing the paper’s contributions and weaknesses. We then compare the AI-generated evaluations to the published human evaluations, considering both quantitative metrics (e.g., inter-rater reliability), identification of consensus critiques, and qualitative comparison. The next phase will train and tune models, and measure their performance on research currently in The Unjournal’s pipeline, i.e., where no human evaluation has yet been made public, to enable out-of-time validation, confirmatory hypothesis tests, and less risk of model contamination.

This complements other recent work on AI-evaluation approaches and benchmarks, outlined below. Our work offers in-depth comparisons and oversight in a high-nuance and high-value context: global-priorities-relevant economics and social-science research, and leverages The Unjournal’s ongoing pipeline as a continual experimental and feedback environment.

Research Goals and Questions (overview)

  1. Measure the value of frontier AI research evaluation vs. human peer review
    • Is state-of-the-art AI research evaluation reliable and useful enough to replace human peer review in these contexts?
    • Reverse question: The Unjournal was asked – “Are human evaluations worth it”?
    • How well can AI predict what humans will report?
    • What are AI’s strengths & weaknesses relative to humans, e.g., in terms of: (1) Predicting journal outcomes, citations, etc. (2) Generating feedback deemed most useful by humans (3) Spotting research errors (4) Diagnosing different types of strengths and limitations (e.g., reasoning transparency, justification of methods and assumptions, communication) (5) Judging content without biases/statistical discrimination based on prestige labels, etc.
  2. Compare methods for AI research-evaluation, find the best approaches
    • Which approaches come closest to (the desirable features of) human evaluations and in which ways?
    • What gets closest to the ‘frontier’, doing better in terms of outcomes like those mentioned above?
  3. The nature of frontier AI ‘preferences & tastes’ over research
    • Which types of papers and approaches do default models tend to ‘like’ more/less relative to humans when given the same guidelines?
    • Do we see signs of characteristic preferences that help us understand how AIs are likely to prioritize, conduct, and evaluate research?
  4. Hybrid human-AI performance (future trials)

Our work in context

AI practical research applications6

Luo et al. (2025) survey LLM roles from idea generation to peer review, including experiment planning and automated scientific writing. They highlight opportunities (productivity, coverage of long documents) alongside governance needs (provenance, detection of LLM-generated content, standardizing tooling) and call for reliable evaluation frameworks.

Korinek (2025) considers the specific potential for using AI in economic research, offering “working examples and step-by-step code” to help economists use agents for literature reviews, econometric code, fetching and analyzing economic data, and coordinate complex research workflows. He advocates economists using AI tools to boost productivity and “to understand [these tools’] capabilities and limitations.” He calls for human researcher oversight “to ensure [these tools] pursue the right objectives and maintain research integrity”. He outlines a range of current limitations and open questions (outlined here), including hallucinations, mistakes compounding through multi-agent workflows and “brittleness to small variations in prompts”. At the higher level, he argues “the questions that will matter most—what problems deserve investigation, how should we evaluate trade-offs, what constitutes progress—will remain irreducibly human.”

Mitchener, Yiu, et al. (2025) present Kosmos, “an AI scientist for autonomous discovery”. However, evaluations reveal important limitations:

  • “Kosmos was found to be only 57% accurate in statements that required interpretation of results, likely due to its propensity to conflate statistically significant results with scientifically valuable ones”

  • “although 85% of statements derived from data analyses were accurate, our evaluations do not capture if the analyses Kosmos chose to execute were the ones most likely to yield novel or interesting scientific insights.”… “identifying valuable discoveries Kosmos made is a time-intensive process that relies on human scientists with significant domain expertise”

AI peer review, error detection, and research evaluation

Son et al. (2025) evaluates LLMs on 83 papers paired with 91 confirmed significant scientific errors. This resembles our own approach, although they focus on erratta/retraction-worthy errors, while we consider more nuanced methodological limitations. Their strongest (OpenAI o3) model identified only 21.1% of these errors, with only 6.1% precision.7

Eger et al. (2025) provide a broad review of LLMs in science and a focused discussion of AI‑assisted peer review. They argue: (i) peer‑review data is scarce and concentrated in CS/OpenReview venues; (ii) targeted assistance that preserves human autonomy is preferable to end‑to‑end reviewing; and (iii) ethics and governance (bias, provenance, detection of AI‑generated text) are first‑class constraints.

Dycke, Kuznetsov, and Gurevych (2023) present NLPEER, a “multidomain corpus” of more than 5000 papers and 11k review reports from 2012 to 2022 from five different venues”.

Zhang and Abernethy (2025) propose deploying LLMs as quality checkers to surface critical problems instead of generating full narrative reviews. Using papers from WITHDRARXIV and an automatic evaluation framework that leverages “LLM-as-judge,” they find the best performance from top reasoning models but still recommend human oversight.

Pataranutaporn et al. (2025) asked four nearly state-of-the-art LLM models (GPT-4o mini, Claude 3.5 Haiku, Gemma 3 27B, and LLaMA 3.3 70B) to consider 1220 unique papers “drawn from 110 economics journals excluded from the training data of current LLMs”. They prompted the models to act “in your capacity as a reviewer for [a top-5 economics journal]” and make a publication recommendation using a 6-point scale ranging from “1 = Definite Reject…” to “6. Accept As Is…”. They asked it to evaluate each paper on a 10-point scale for originality, rigor, scope, impact, and whether it was ‘written by AI’. They also (separately) had LLMs rate 330 papers with the authors’ identities removed, or replacing the names with fake male/female names and real elite or non-elite institutions or with prominent male or female economists attached.

They compare the LLMs’ ratings with the RePEC rankings for the journals the papers were published in, finding general alignment. They find mixed results on detecting AI-generated papers. In the names/institutions comparisons, they also find the LLMs show biases towards named high-prestige male authors relative to high-prestige female authors, as well as biases towards elite institutions and US/UK universities.

There have been several other empirical benchmarking projects, including work covered in LLM4SR: A Survey on Large Language Models for Scientific Research and Transforming Science with Large Language Models: A Survey on AI-assisted Scientific Discovery, Experimentation, Content Generation, and Evaluation. (See cited surveys for additional benchmarks.)

Zhang et al. (2025) use AI conference paper data and LLM agents for pairwise manuscript comparisons, finding this approach outperforms traditional rating-based methods for identifying high-impact papers by citation metrics, though with some evidence of biases favoring established research institutions.8

We discuss how our approach differs from prior work below.9

AI research taste and potential to guide research

A key motivation for this work stems from broader claims about AI’s potential to accelerate scientific discovery and “revolutionize science.” If AI systems are to contribute meaningfully to research—not just as tools but as evaluators and potential collaborators—we need to understand their “taste”: what they value, what they miss, and whether their priorities align with human expert judgment and real-world research impact.

Understanding LLM preferences and blind spots when evaluating research provides a window into:

  • Whether AI can identify truly impactful research vs. stylistic or superficial features
  • Potential systematic biases in how AI might direct future research priorities
  • The risks and opportunities of AI-assisted or AI-directed research agendas

This connects to broader debates about the returns to science in the presence of technological risk (Clancy 2023), where optimistic forecasts of AI-accelerated science must be weighed against potential downsides and the question of whether AI judgments align with what actually advances knowledge and improves outcomes.

Our present focus, The Unjournal opportunity

Our project distinguishes itself through several unique advantages:

Actual human evaluations: We use actual human evaluations of research in economics and adjacent fields, past and prospective, including both reports, ratings, and predictions.10 Other work has relied on collections of research and grant reviews, including NLPEER, SubstanReview, and the Swiss National Science Foundation. That data has a heavy focus on computer-science adjacent fields, and is less representative of mainstream research peer review practices in older, established academic fields. Note that The Unjournal commissions the evaluation of impactful research, often from high-prestige working paper archives like NBER, and makes all evaluations public, even if they are highly critical of the paper.

Rich, structured data: The Unjournal’s 55+ evaluation packages enable us to train and benchmark the models. Their detailed written reports and multi-dimensional percentile ratings with credible intervals allow us to compare the ‘taste’, priorities, and comparative ratings of humans relative to AI models across the different criteria and domains.

Ground truth outcomes: The ‘journal tier prediction’ outcomes provide an external ground-truth11 enabling a human-vs-LLM horse race.

Prospective pipeline: The Unjournal’s pipeline of future evaluations allows for clean out-of-training-data predictions and evaluation, serving as a “live testing lab” for a new judging panel, enabling out-of-time testing and validation from authors and human evaluators.

Hybrid evaluation potential: We are also planning multi-armed trials on these human evaluations (cf. Brodeur et al. 2025) to understand the potential for hybrid human-AI evaluation in this context.

The economics, social science, and global-impact policy context provides a unique high-value and nuanced setting to measure and understand AI tendencies and capabilities. According to Korinek (2025) “economists have a special responsibility. We understand incentives, market dynamics, and resource allocation. We grasp the subtleties of human behavior and social coordination. These insights will be crucial for the transition to an AI-augmented research”.

Next steps and future directions

This project is ongoing. Our planned next steps include:

  1. Further statistical analyses: Intuitive information theoretic and Bayesian measures and tests; multi-level modeling (random and mixed effects for raters, paper, rating category, etc.)

  2. Content-swap ‘bias tests’: Following (Pataranutaporn et al. 2025), test whether LLMs show systematic biases based on author names, institutional affiliations, or other non-content features

  3. Predict journal-outcome/bibliometrics horse-race: Compare human vs. LLM predictions against actual publication outcomes

  4. Refine LLM model, compare dimensions, test: Systematic comparison of prompts, models, and approaches

  5. Explore/analyze descriptive evaluation content: Qualitative analysis of LLM-generated evaluations

  6. Human enumerators: Employ human raters to assess whether LLMs identify consensus critiques

  7. Train ‘predict human evaluations’ model, preregister for future evaluations: Out-of-time validation with clean test set

  8. Possible hybrid human-AI trials: Test whether human-AI collaboration improves on either alone

  9. Policy measures and decision criteria: How do we decide and agree upon whether AI, Human, or Hybrid are ‘better’ at this?

  10. Implications for The Unjournal/evaluation processes: What can we learn to improve evaluation systems?

Project planning, funding, status

This benchmarking project is a collaboration with The Unjournal, a grant-funded nonprofit12 that has published 55+ detailed evaluation packages to build open, quantitative evaluation infrastructure and processes, focusing on global-priorities-relevant research in economics, policy, and social science.13

Team

Core team:

  • David Reinstein (PI): Economist, founder/director of The Unjournal. 25+ years research experience (UC Berkeley, Essex, Exeter, Rethink Priorities). Track record of building evaluation infrastructure and working with impact-focused organizations.
  • Valentin Klotzbücher (Co-lead): Economist and statistician (University of Basel). Applied econometrics, causal inference, and reproducible pipelines (R/Quarto). Co-leads current LLM–Unjournal benchmarking work.
  • Tianmai Michael Zhang (Technical collaborator): PhD student, University of Washington. Co-author of “Reviewing Scientific Papers for Critical Problems With Reasoning LLMs.” Active contributor to the Black Spatula Project.
  • Lorenzo Pacchiardi (Advisor): Research Associate, Leverhulme Centre for the Future of Intelligence, Cambridge. Leads Open Philanthropy–funded project on benchmarking LLM data science tasks. Unjournal advisory board.

We plan to leverage additional support from Unjournal’s management team, operations staff, and evaluator pool (200+ credentialed researchers).

Funding and budget

We are actively seeking funding, in conjunction with The Unjournal, to expand this work beyond the pilot scale to a more robust and rigorous scientific benchmark. The interactive budget model below is linked to this Google Sheet.

Key cost categories include: Core team time (PI + co-lead), engineering and development, payments for human research evaluators (extending The Unjournal’s pipeline), payments for enumerators/experts to label and provide human feedback on LLM evaluator claims and critiques, and compute/API costs. No model training or large-compute capabilities work—all spending focuses on evaluation, tooling, and human experimentation.

Deliverables

  1. Living benchmark dataset: Continuously updated papers + human evaluations + LLM outputs + adjudication scores, openly available throughout the project
  2. Public dashboard: Updatable website where new models can be compared against evolving benchmarks
  3. Open-source codebase: Pipelines for running AI evaluations and computing metrics
  4. Failure-mode taxonomy: Documented patterns of AI reasoning failures
  5. “Taste/alignment” analysis: Quantitative comparison of human vs. AI research preferences
  6. Academic paper(s): Targeting venues like Nature Human Behaviour, Science, or field-relevant journals
  7. Practical guidance: Playbook for when AI peer review helps and where it fails

Benchmark quality:

  • 80+ papers with complete paired human/LLM evaluations
  • 3+ model families systematically compared
  • Dataset and code publicly released with documentation

Informative results:

  • Pre-registered quantitative metrics with adequate statistical power
  • At least one publishable paper or equivalent preprint
  • Clear evidence of specific failure modes (or their absence)

Practical uptake:

  • Unjournal team finds tools useful for triage and evaluation
  • 2+ external organizations (GCR funders, journals, research users) engage with outputs
  • If tools prove not useful, we report this explicitly as a negative result
Risk Mitigation
LLMs perform poorly Negative results are valuable; focus on taxonomy and practical guidance
Human evaluations are noisy Use adjudication, curated subsets, pre-registration
Engineering bottlenecks Modular pipelines, flexible staffing, leverage existing tools
Low external uptake Build relationships with GCR organizations early; design for usability

LLMs are only now sophisticated enough to plausibly conduct research evaluation–but the window to understand their failure modes before widespread adoption is closing. The Unjournal’s structured evaluations provide a unique “ground truth” for benchmarking AI reasoning in economics and policy research; no comparable dataset exists. If AI research evaluation becomes reliable, it could reduce costs and accelerate evidence synthesis for time-sensitive decisions. If it remains unreliable, decision-makers need to know.

Project timeline

The project follows an iterative, overlapping structure rather than sequential phases. Human evaluations flow continuously from The Unjournal’s pipeline, analysis proceeds alongside data collection, and outputs are shared as living documents throughout.

Key timeline principles:

  • Pre-registration early: We specify and register a general pre-analysis plan in months 1–3 to constrain researcher degrees of freedom, with amendments documented as the project evolves.
  • Continuous human evaluations: New Unjournal evaluation packages arrive throughout the project, feeding both the benchmark dataset and (in larger scenarios) model training.
  • Analysis throughout: Statistical analysis and model refinement proceed iteratively as data accumulates—we don’t wait for “all data” before analyzing.
  • Living research: Data and results are openly available and updated continuously; there is no “final dataset,” only versioned releases with clear provenance.

For more details, see The Unjournal and our project repository.14


  1. E.g., see anectodal reports. Proposed solutions include a credit or payment-based model. (Further references to be added.).↩︎

  2. Eger et al. (2025): “When it comes to using AI tools for science, ethics is of overarching importance. This is because the AI tools exhibit various limitations, e.g., they (i) may hallucinate and fabricate content, (ii) exhibit bias, (iii) may have limited reasoning abilities and (iv) sometimes lack suitable evaluation, … among many other concerns such as risks of fake science, plagiarism, and lack of human authority. Indeed, the European Union has recently released guidelines on the responsible use of AI for science. In it, it points out that, while”[r]esearch is one of the sectors that could be most significantly disrupted by generative AI” where “AI has great potential for accelerating scientific discovery and improving the effectiveness and pace of research and verification processes”↩︎

  3. An individual rater (David Reinstein, Co-director of The Unjournal) curated a pilot version of this here, focusing on critiques identified by multiple evaluators and/or confirmed by the author or evaluation manager. We are currently (Jan 5, 2026) comparing this enumeration to the issues identified by LLM’s (see “Results: Critiques” chapter). Future work will systematically employ multiple enumerators with a pre-defined rubric.↩︎

  4. Dec 2025 focus: OpenAI’s GPT-5.2 Pro model↩︎

  5. This is slightly off as of January 2026. We only have 56 papers with official Unjournal evaluations. I suspect this data includes some papers that are either duplicated or in our pipeline for given an independent non-official evaluation↩︎

  6. We maintain a NotebookLM of the most relevant and related research here.↩︎

  7. They asked domain experts to do qualitative analysis, which found most false positives were hallucinations or student-level misconceptions↩︎

  8. Key findings from Zhang et al. (2025): (1) Uses AI conference paper data; (2) Employs LLM agents for pairwise manuscript comparisons; (3) Significantly outperforms traditional rating-based methods in identifying high-impact papers by citation metrics; (4) Shows some evidence of biases/statistical discrimination based on characteristics like ‘papers from established research institutions’.↩︎

  9. See below. Claude’s brief summary “Our approach differs from prior work by (i) focusing on structured, percentile-based quantitative ratings with credible intervals that map to decision-relevant dimensions used by The Unjournal; (ii) comparing those ratings to published human evaluations rather than using LLM-as-judge; and (iii) curating contamination-aware inputs (paper text extraction with reference-section removal and token caps), with a roadmap to add multi-modal checks when we score figure- or table-dependent criteria”.↩︎

  10. For additional context on The Unjournal’s approach and evaluation criteria, see our evaluator guidelines.↩︎

  11. These represent ground truths about verifiable publication outcomes, not about the ‘true quality’ of the paper.↩︎

  12. Funders include the Survival and Flourishing Fund, the Long Term Future Fund, and EA Funds; they make all grant information public here.↩︎

  13. This section was drafted with AI assistance and reviewed by the authors.↩︎

  14. Methodological implications we plan to adopt from the literature: - Report stability (repeat runs / seeds) and calibration diagnostics alongside point scores. - Keep error verification distinct from quality scoring; when we do verification, prefer author- or expert-validated ground truth over LLM-as-judge, or clearly disclose the judge and its limits. - Track input format effects (PDF vs LaTeX vs extracted text) and costs/latency, since these affect reproducibility and deployment.↩︎