Comparing LLM and human peer reviews and ratings of social science research using data from Unjournal.org evaluations

Authors

Affiliation

David Reinstein

The Unjournal

Valentin Klotzbücher

Published

August 30, 2025

Abstract

We will build and refine LLM tools to generate peer-reviews and ratings of impactful research, and compare these with human experts’ work (esp. from Unjournal.org): to benchmark performance, understand AI’s research taste, and develop tools to improve research evaluation and dissemination.

Project Overview

Work in progress. Pages, metrics, and comparisons are under active development. Expect rough edges and frequent updates.

Motivation and questions

Is AI good at peer-reviewing? Does it offer useful feedback? Can it predict how human experts will rate research across a range of categories? Where does it do surprisingly well (e.g., catching math errors) or poorly (e.g., judging real-world relevance)? What does its “research taste” look like compared to humans?

If AI research evaluation works, it could free up substantial scientific resources (potentially on the order of billions of dollars annually) and enable more continual, structured feedback on research. It could also help characterize methodological strengths/weaknesses across papers, aiding training and research direction-setting. Understanding how AI evaluates research is also a window into its values, abilities, and limitations.

In this project, we are testing the capabilities of current large language models (LLMs), illustrating whether they can generate research paper evaluations comparable to expert human reviews. The Unjournal systematically prioritizes ‘impactful’ research and pays for high-quality human evaluations, structured quantified ratings, claim identification and assessment, and predictions. In this project, we use an AI (OpenAI’s GPT-5 model for now) to review social science research papers under the same criteria used by human reviewers for The Unjournal.

Each paper is assessed on specific dimensions – for example, the strength of its evidence, rigor of methods, clarity of communication, openness/reproducibility, relevance to global priorities, and overall quality. The LLM will provide quantitative scores (with uncertainty intervals) on these criteria and produce a written evaluation

Our initial dataset will include research papers that have existing human evaluations. For each paper, the AI will generate: (1) numeric ratings on the defined criteria, (2) identification of the paper’s key claims, and (3) a detailed review discussing the paper’s contributions and weaknesses. We will then compare the AI-generated evaluations to the published human evaluations.

Next, we will focus on papers currently under evaluation, ie where no human evaluation exists yet and we can rule out any contamination.

# Project Overview ::: {.callout-warning} Work in progress. Pages, metrics, and comparisons are under active development. Expect rough edges and frequent updates. ::: Motivation and questions Is AI good at peer-reviewing? Does it offer useful feedback? Can it predict how human experts will rate research across a range of categories? Where does it do surprisingly well (e.g., catching math errors) or poorly (e.g., judging real-world relevance)? What does its “research taste” look like compared to humans? If AI research evaluation works, it could free up substantial scientific resources (potentially on the order of billions of dollars annually) and enable more continual, structured feedback on research. It could also help characterize methodological strengths/weaknesses across papers, aiding training and research direction-setting. Understanding how AI evaluates research is also a window into its values, abilities, and limitations. In this project, we are testing the capabilities of current large language models (LLMs), illustrating whether they can generate research paper evaluations comparable to expert human reviews. The Unjournal systematically prioritizes ‘impactful’ research and pays for high-quality human evaluations, structured quantified ratings, claim identification and assessment, and predictions. In this project, we use an AI (OpenAI's `GPT-5` model for now) to review social science research papers under the same criteria used by human reviewers for The Unjournal. Each paper is assessed on specific dimensions – for example, the strength of its evidence, rigor of methods, clarity of communication, openness/reproducibility, relevance to global priorities, and overall quality. The LLM will provide quantitative scores (with uncertainty intervals) on these criteria and produce a written evaluation Our initial dataset will include research papers that have existing human evaluations. For each paper, the AI will generate: (1) numeric ratings on the defined criteria, (2) identification of the paper’s key claims, and (3) a detailed review discussing the paper’s contributions and weaknesses. We will then compare the AI-generated evaluations to the published human evaluations. Next, we will focus on papers currently under evaluation, ie where no human evaluation exists yet and we can rule out any contamination. ## Related work @Luo2025 survey LLM roles from idea generation to peer review, including experiment planning and automated scientific writing. They highlight opportunities (productivity, coverage of long documents) alongside governance needs (provenance, detection of LLM-generated content, standardizing tooling) and call for reliable evaluation frameworks. @eger2025 provide a broad review of LLMs in science and a focused discussion of AI‑assisted peer review. They emphasize: (i) peer‑review data is scarce and concentrated in CS/OpenReview venues; (ii) targeted assistance that preserves human autonomy is preferable to end‑to‑end reviewing; and (iii) ethics and governance (bias, provenance, detection of AI‑generated text) are first‑class constraints. @Zhang2025 propose deploying LLMs as quality checkers to surface critical problems instead of generating full narrative reviews. Using papers from WITHDRARXIV and an automatic evaluation framework that leverages “LLM-as-judge,” they find the best performance from top reasoning models but still recommend human oversight.