Comparing LLM and human peer reviews and ratings of social science research using data from Unjournal.org evaluations
We will build and refine LLM tools to generate peer-reviews and ratings of impactful research, and compare these with human experts’ work (esp. from Unjournal.org): to benchmark performance, understand AI’s research taste, and develop tools to improve research evaluation and dissemination.
Project Overview
Motivation and questions
Is AI good at peer-reviewing? Does it offer useful feedback? Can it predict how human experts will rate research across a range of categories? Where does it do surprisingly well (e.g., catching math errors) or poorly (e.g., judging real-world relevance)? What does its “research taste” look like compared to humans?
If AI research evaluation works, it could free up substantial scientific resources (potentially on the order of billions of dollars annually) and enable more continual, structured feedback on research. It could also help characterize methodological strengths/weaknesses across papers, aiding training and research direction-setting. Understanding how AI evaluates research is also a window into its values, abilities, and limitations.
In this project, we are testing the capabilities of current large language models (LLMs), illustrating whether they can generate research paper evaluations comparable to expert human reviews. The Unjournal systematically prioritizes ‘impactful’ research and pays for high-quality human evaluations, structured quantified ratings, claim identification and assessment, and predictions. In this project, we use an AI (OpenAI’s GPT-5
model for now) to review social science research papers under the same criteria used by human reviewers for The Unjournal.
Each paper is assessed on specific dimensions – for example, the strength of its evidence, rigor of methods, clarity of communication, openness/reproducibility, relevance to global priorities, and overall quality. The LLM will provide quantitative scores (with uncertainty intervals) on these criteria and produce a written evaluation
Our initial dataset will include research papers that have existing human evaluations. For each paper, the AI will generate: (1) numeric ratings on the defined criteria, (2) identification of the paper’s key claims, and (3) a detailed review discussing the paper’s contributions and weaknesses. We will then compare the AI-generated evaluations to the published human evaluations.
Next, we will focus on papers currently under evaluation, ie where no human evaluation exists yet and we can rule out any contamination.