Public Research Evaluation: Unjournal.org and LLMs

The Unjournal

Public Journal-Independent Evaluation of Impactful Research

Focus: rigorous, policy‑relevant social science and economics
Not a journal. No accept/reject. We publish evaluations and ratings, not papers
Everything citable and reusable

Detour: The Unjournal’s mission, approach, and progress

How it works

Evaluations:

Narrative reports focusing on credibility and usefulness
Claim ID & assessment
Structured percentile ratings: overall + 7 criteria
Journal tier ratings and predictions

Publish & update: PubPub package, DOIs, summaries, author responses

Quantitative ratings

Percentile ratings: For each of the criteria below, please rank this paper relative to all serious research in the same area that you have encountered in the last three years.

I. Overall assessment — Heuristic judgment of quality, credibility, importance

Midpoint rating
90% credible interval (CI)

Quantitative ratings (0 - 100 pctl.)

Claims, strength and characterization of evidence

Methods: Justification, reasonableness, validity, robustness assessment

Advancing knowledge & practice

Quant. ratings (0 - 100)

Logic & communication

V. Open, collaborative, replicable

Relevance to global priorities, usefulness for practitioners

Journal ranking tiers (0.0–5.0)

Normative: Where this should be published if the process were fair, unbiased, and ~aligned to the category metrics above.

Predictive: Where this will likely be published given real‑world processes.

Output: Example (mean human eval. ratings)

Our goals and questions: General

Measure the value of frontier AI research-eval. vs. human peer review

Compare methods for AI research-evaluation, find the best approaches

The nature of frontier AI ‘preferences & tastes’ over research

Hybrid human-AI performance (later)

Literature

Pataranutaporn et al. (2025)

1220 papers “from 110 economics journals excluded from the training data of current LLMs”

LLMs asked: “make a publication recommendation” … 6-point scale from “1 = Definite Reject…” to “6. Accept As Is…”.

Also, a 10-point scale for originality, rigor, scope, impact…

And 330 papers w/ various ‘remove/change names/institutions’ treatments

Results: Align w/ journal REPEC rankings, evidence of bias (gender, instn.)

Zhang et al. (2025)

AI conference paper data

LLM agents performed pairwise comparisons among manuscripts

Accept/reject decisions: “agreement between AI & human reviewers… roughly on par with the consistency observed between independent human review committees”

Again, evidence of biases/~statistical discrimination towards, e.g., “papers from established research institutions”

Our context, advantages

Actual human evaluations of research
In econ. & adjacent
Reports, ~benchmarked ratings, predictions (journal-tier ground-truth)

Potential:

Train & benchmark models on human evals
Future eval. pipeline: clean out-of-time predictions & evaluation
Multi-armed trials on human/hybrid evals (cf. Brodeur et al, ’25; Qazi et al, ’25)

Our initial LLM pipeline

Prompt

Copied from Unjournal evaluator guidelines, removed mention of “Unjournal”, added ‘role’ & ~debiasing instructions

Your role – You are an academic expert as well as a practitioner across every relevant field – use all your knowledge and insight. You are acting as an expert research evaluator/reviewer.

Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors’ names, status, or institutions in your judgment – give these ratings based on the content of the papers alone; do the assessment based on your knowledge and insights.

API calls via Python script
PDFs passed directly to model (direct ingestion, includes tables, graphs…)
Enforce JSON Schema: midpoint and bounds for each criteria, short rationale
Model: GPT-5 Pro (GPT-5)

Initial concrete questions (~exploratory/descriptive)

Overall characteristics of ratings

Associations between (untrained) LLM & human ratings & predictions

Informal exploration of “strong LLM-human ratings gaps”

See all results-in-progress here

Overall ratings: Unjournal

Overall ratings: Unjournal vs GPT-5 Pro

Matched papers: 40
Pearson r: 0.43
Spearman rho: 0.67
Krippendorff α (interval): 0.12
MAE (points): 12.6

Overall characteristics of ratings

Measures of association & consistency

criteria (metric)	mean: Human	mean: LLM	sd: Human	sd: LLM	iqr: Human	iqr: LLM	min: Human	min: LLM
adv_knowledge	73	84	15	9	15	12	25	60
claims	74	83	16	10	18	12	30	48
gp_relevance	78	85	15	10	19	14	12	55
logic_comms	75	86	13	6	9	5	30	64
methods	70	80	18	11	18	13	10	37
open_sci	72	64	16	13	18	12	10	30
overall	74	85	13	8	14	12	32	54

Measures of association & consistency

criteria	n	r	rho	MAE	a_LH	a_HH
adv_knowledge	39	0.32	0.49	13.22	0.05	0.19
claims	14	0.50	0.54	10.79	0.44	0.42
gp_relevance	40	0.24	0.39	13.28	0.02	0.34
logic_comms	40	0.03	0.25	13.93	-0.22	0.31
methods	39	0.34	0.55	13.36	0.18	0.54
open_sci	40	0.11	0.09	16.62	0.06	0.03
overall	40	0.43	0.67	12.59	0.12	0.52

LLM ‘rationales’,‘tastes’: informal exploration

Journal tier correlations with quality metrics (orange: LLM)

LLM-human ratings gaps (informal explorations)

Papers with highest relative human vs. AI preference

Most human-preferred (Delta > 0)	Most AI-preferred (Delta < 0)
The animal welfare cost of meat: evidence from a survey of… (+26.0) Willful Ignorance and Moral Behavior (+3.0) Asymmetry in Civic Information: An Experiment on Tax Participation… (+2.0) Economic vs. Epidemiological Approaches to Measuring the Human… (+2.0)	Zero-Sum Thinking, the Evolution of Effort-Suppressing Beliefs, and… (-48.0) Replicability & Generalisability: A Guide to CEA discounts (-45.5) Pharmaceutical Pricing and R&D as a Global Public Good (-35.5) Accelerating Vaccine Innovation for Emerging Infectious Diseases via… (-28.0) Effects of Emigration on Rural Labor Markets (-21.0)

Example: Williams et al. ’24, Global potential…

Ratings for Williams et al. (2024)

Criterion	Human rating	LLM rating
overall	50	94
methods	25	90
adv_knowledge	55	93
claims	30	92
gp_relevance	75	95
logic_comms	58	90
open_sci	62	78

LLM rationale excerpts

Overall

High-impact, policy-relevant mapping with fine spatial resolution, clear findings, and strong validation. Quantifies large carbon potential and offers actionable targeting. Some caveats on socioeconomic dynamics, leakage, and input data resolution heterogeneity temper certainty but not overall value.

Methods

Robust random forest modelling on millions of points; careful variable selection; validation including autocorrelation sensitivity; clear predictor set. Conservative choice to rely on biophysical variables aids robustness. Constraints: mixed input resolutions, historical training window, potential residual spatial bias.

Humans for ‘methods’ (paraphrased)

E1: 20 (15 - 25): ~“(1) the trained model does not enable prediction into the future given currently observed conditions, (2) the predictions do not represent pure biophysical potential, and (3) regrowth longevity is not addressed.”

E2: 30 (10 - 50): ~ Data Leakage Concerns, Variables potentially derived using outcome data, Variables potentially incorporating post-2000 outcome data, Underestimation of Uncertainty, Reliance on Older GFC Gain Data, Neglect of Intensive Margin Regrowth.

But GPT-5 Pro shows potential (chat mode test, web search off)

Prompt:

Your role – You are an academic expert as well as a practitioner across every relevant field – use all your knowledge and insight. You are acting as an expert research evaluator/reviewer. … Please check for methodological error/weaknesses in particular

Identifies a number of critiques overlapping our evaluators, e.g.,

Post‑treatment leakage. “Land use/land cover” is from 2015 (within the outcome window 2000–2016) and is included in the final biophysical model (variable “landuse” in Extended Data Fig. 2). Because positives often became forest by 2015 while negatives did not, this can leak outcome information into x and inflate fit. …

Both evaluators identified this leakage issue, E1 specifically mentioning the land use variable.

Our next steps (feedback invited)

Thank you!

Scan to visit
unjournal.org

References

Pataranutaporn, Pat, Nattavudh Powdthavee, Chayapatr Achiwaranguprok, and Pattie Maes, “Can AI solve the peer review crisis? A large scale cross model experiment of LLMs’ performance and biases in evaluating over 1000 economics papers,” 2025.

Zhang, Yaohui, Haijing Zhang, Wenlong Ji, Tianyu Hua, Nick Haber, Hancheng Cao, and Weixin Liang, “From replication to redesign: Exploring pairwise comparisons for LLM-based peer review,” arXiv preprint arXiv:2506.11343, (2025).

GPT-5 vs GPT-5 Pro

Where should this be published: Unjournal

Where should this be published: GPT-5 Pro