Public Research Evaluation: Unjournal.org and LLMs

David Reinstein

The Unjournal

Valentin Klotzbücher

University of Basel & University Hospital Basel

Tianmai Michael Zhang

University of Washington

2025-10-17

The Unjournal

Public Journal-Independent Evaluation of Impactful Research

  • Focus: rigorous, policy‑relevant social science and economics
  • Not a journal. No accept/reject. We publish evaluations and ratings, not papers
  • Everything citable and reusable

Detour: The Unjournal’s mission, approach, and progress

How it works

Evaluations:

  • Narrative reports focusing on credibility and usefulness

  • Claim ID & assessment

  • Structured percentile ratings: overall + 7 criteria

  • Journal tier ratings and predictions


Publish & update: PubPub package, DOIs, summaries, author responses

Quantitative ratings

Percentile ratings: For each of the criteria below, please rank this paper relative to all serious research in the same area that you have encountered in the last three years.

I. Overall assessment — Heuristic judgment of quality, credibility, importance

  • Midpoint rating
  • 90% credible interval (CI)

Quantitative ratings (0 - 100 pctl.)


  1. Claims, strength and characterization of evidence


  1. Methods: Justification, reasonableness, validity, robustness assessment


  1. Advancing knowledge & practice

Quant. ratings (0 - 100)


  1. Logic & communication


V. Open, collaborative, replicable


  1. Relevance to global priorities, usefulness for practitioners

Journal ranking tiers (0.0–5.0)


  1. Normative: Where this should be published if the process were fair, unbiased, and ~aligned to the category metrics above.


  1. Predictive: Where this will likely be published given real‑world processes.

Output: Example (mean human eval. ratings)

Our goals and questions: General

  1. Measure the value of frontier AI research-eval. vs. human peer review
  1. Compare methods for AI research-evaluation, find the best approaches
  1. The nature of frontier AI ‘preferences & tastes’ over research
  1. Hybrid human-AI performance (later)

Literature

Pataranutaporn et al. (2025)


1220 papers “from 110 economics journals excluded from the training data of current LLMs”

LLMs asked: “make a publication recommendation” … 6-point scale from “1 = Definite Reject…” to “6. Accept As Is…”.


Also, a 10-point scale for originality, rigor, scope, impact…

And 330 papers w/ various ‘remove/change names/institutions’ treatments


Results: Align w/ journal REPEC rankings, evidence of bias (gender, instn.)

Zhang et al. (2025)


AI conference paper data

LLM agents performed pairwise comparisons among manuscripts


Accept/reject decisions: “agreement between AI & human reviewers… roughly on par with the consistency observed between independent human review committees”


Again, evidence of biases/~statistical discrimination towards, e.g., “papers from established research institutions”

Our context, advantages

  • Actual human evaluations of research

  • In econ. & adjacent

  • Reports, ~benchmarked ratings, predictions (journal-tier ground-truth)


Potential:

  • Train & benchmark models on human evals

  • Future eval. pipeline: clean out-of-time predictions & evaluation

  • Multi-armed trials on human/hybrid evals (cf. Brodeur et al, ’25; Qazi et al, ’25)

Our initial LLM pipeline

Prompt

Copied from Unjournal evaluator guidelines, removed mention of “Unjournal”, added ‘role’ & ~debiasing instructions

Your role – You are an academic expert as well as a practitioner across every relevant field – use all your knowledge and insight. You are acting as an expert research evaluator/reviewer.

Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors’ names, status, or institutions in your judgment – give these ratings based on the content of the papers alone; do the assessment based on your knowledge and insights.

  • API calls via Python script

  • PDFs passed directly to model (direct ingestion, includes tables, graphs…)

  • Enforce JSON Schema: midpoint and bounds for each criteria, short rationale

  • Model: GPT-5 Pro (GPT-5)

Initial concrete questions (~exploratory/descriptive)

  1. Overall characteristics of ratings


  1. Associations between (untrained) LLM & human ratings & predictions


  1. Informal exploration of “strong LLM-human ratings gaps”

See all results-in-progress here

Overall ratings: Unjournal

Overall ratings: Unjournal vs GPT-5 Pro

Overall ratings: Unjournal vs GPT-5 Pro

  • Matched papers: 40
  • Pearson r: 0.43
  • Spearman rho: 0.67
  • Krippendorff α (interval): 0.12
  • MAE (points): 12.6

Overall characteristics of ratings

Measures of association & consistency

criteria
(metric)
mean:
Human
mean:
LLM
sd:
Human
sd:
LLM
iqr:
Human
iqr:
LLM
min:
Human
min:
LLM
adv_knowledge 73 84 15 9 15 12 25 60
claims 74 83 16 10 18 12 30 48
gp_relevance 78 85 15 10 19 14 12 55
logic_comms 75 86 13 6 9 5 30 64
methods 70 80 18 11 18 13 10 37
open_sci 72 64 16 13 18 12 10 30
overall 74 85 13 8 14 12 32 54

Measures of association & consistency

criteria n r rho MAE a_LH a_HH
adv_knowledge 39 0.32 0.49 13.22 0.05 0.19
claims 14 0.50 0.54 10.79 0.44 0.42
gp_relevance 40 0.24 0.39 13.28 0.02 0.34
logic_comms 40 0.03 0.25 13.93 -0.22 0.31
methods 39 0.34 0.55 13.36 0.18 0.54
open_sci 40 0.11 0.09 16.62 0.06 0.03
overall 40 0.43 0.67 12.59 0.12 0.52

LLM ‘rationales’,‘tastes’: informal exploration

Journal tier correlations with quality metrics (orange: LLM)

LLM-human ratings gaps (informal explorations)

Papers with highest relative human vs. AI preference

Most human-preferred (Delta > 0) Most AI-preferred (Delta < 0)
  1. The animal welfare cost of meat: evidence from a survey of… (+26.0)
  2. Willful Ignorance and Moral Behavior (+3.0)
  3. Asymmetry in Civic Information: An Experiment on Tax Participation… (+2.0)
  4. Economic vs. Epidemiological Approaches to Measuring the Human… (+2.0)
  1. Zero-Sum Thinking, the Evolution of Effort-Suppressing Beliefs, and… (-48.0)
  2. Replicability & Generalisability: A Guide to CEA discounts (-45.5)
  3. Pharmaceutical Pricing and R&D as a Global Public Good (-35.5)
  4. Accelerating Vaccine Innovation for Emerging Infectious Diseases via… (-28.0)
  5. Effects of Emigration on Rural Labor Markets (-21.0)

Example: Williams et al. ’24, Global potential…

Ratings for Williams et al. (2024)

Criterion Human rating LLM rating
overall 50 94
methods 25 90
adv_knowledge 55 93
claims 30 92
gp_relevance 75 95
logic_comms 58 90
open_sci 62 78

LLM rationale excerpts

Overall

High-impact, policy-relevant mapping with fine spatial resolution, clear findings, and strong validation. Quantifies large carbon potential and offers actionable targeting. Some caveats on socioeconomic dynamics, leakage, and input data resolution heterogeneity temper certainty but not overall value.

Methods

Robust random forest modelling on millions of points; careful variable selection; validation including autocorrelation sensitivity; clear predictor set. Conservative choice to rely on biophysical variables aids robustness. Constraints: mixed input resolutions, historical training window, potential residual spatial bias.

Humans for ‘methods’ (paraphrased)

E1: 20 (15 - 25): ~“(1) the trained model does not enable prediction into the future given currently observed conditions, (2) the predictions do not represent pure biophysical potential, and (3) regrowth longevity is not addressed.”


E2: 30 (10 - 50): ~ Data Leakage Concerns, Variables potentially derived using outcome data, Variables potentially incorporating post-2000 outcome data, Underestimation of Uncertainty, Reliance on Older GFC Gain Data, Neglect of Intensive Margin Regrowth.

But GPT-5 Pro shows potential (chat mode test, web search off)

Prompt:

Your role – You are an academic expert as well as a practitioner across every relevant field – use all your knowledge and insight. You are acting as an expert research evaluator/reviewer. … Please check for methodological error/weaknesses in particular


Identifies a number of critiques overlapping our evaluators, e.g.,

  1. Post‑treatment leakage. “Land use/land cover” is from 2015 (within the outcome window 2000–2016) and is included in the final biophysical model (variable “landuse” in Extended Data Fig. 2). Because positives often became forest by 2015 while negatives did not, this can leak outcome information into x and inflate fit. …

Both evaluators identified this leakage issue, E1 specifically mentioning the land use variable.

Our next steps (feedback invited)

Thank you!

Scan to visit
unjournal.org

References

Pataranutaporn, Pat, Nattavudh Powdthavee, Chayapatr Achiwaranguprok, and Pattie Maes, Can AI solve the peer review crisis? A large scale cross model experiment of LLMs’ performance and biases in evaluating over 1000 economics papers,” 2025.
Zhang, Yaohui, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, and Weixin Liang, From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,” arXiv preprint arXiv:2506.11343, (2025).

GPT-5 vs GPT-5 Pro

Where should this be published: Unjournal

Where should this be published: GPT-5 Pro