2025-10-17
Public Journal-Independent Evaluation of Impactful Research
Evaluations:
Narrative reports focusing on credibility and usefulness
Claim ID & assessment
Structured percentile ratings: overall + 7 criteria
Journal tier ratings and predictions
Publish & update: PubPub package, DOIs, summaries, author responses
Percentile ratings: For each of the criteria below, please rank this paper relative to all serious research in the same area that you have encountered in the last three years.
I. Overall assessment — Heuristic judgment of quality, credibility, importance
V. Open, collaborative, replicable
Pataranutaporn et al. (2025)
1220 papers “from 110 economics journals excluded from the training data of current LLMs”
LLMs asked: “make a publication recommendation” … 6-point scale from “1 = Definite Reject…” to “6. Accept As Is…”.
Also, a 10-point scale for originality, rigor, scope, impact…
And 330 papers w/ various ‘remove/change names/institutions’ treatments
Results: Align w/ journal REPEC rankings, evidence of bias (gender, instn.)
Zhang et al. (2025)
AI conference paper data
LLM agents performed pairwise comparisons among manuscripts
Accept/reject decisions: “agreement between AI & human reviewers… roughly on par with the consistency observed between independent human review committees”
Again, evidence of biases/~statistical discrimination towards, e.g., “papers from established research institutions”
Actual human evaluations of research
In econ. & adjacent
Reports, ~benchmarked ratings, predictions (journal-tier ground-truth)
Potential:
Train & benchmark models on human evals
Future eval. pipeline: clean out-of-time predictions & evaluation
Multi-armed trials on human/hybrid evals (cf. Brodeur et al, ’25; Qazi et al, ’25)
Prompt
Copied from Unjournal evaluator guidelines, removed mention of “Unjournal”, added ‘role’ & ~debiasing instructions
Your role – You are an academic expert as well as a practitioner across every relevant field – use all your knowledge and insight. You are acting as an expert research evaluator/reviewer.
Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors’ names, status, or institutions in your judgment – give these ratings based on the content of the papers alone; do the assessment based on your knowledge and insights.
API calls via Python script
PDFs passed directly to model (direct ingestion, includes tables, graphs…)
Enforce JSON Schema: midpoint and bounds for each criteria, short rationale
Model: GPT-5 Pro (GPT-5)
See all results-in-progress here
| criteria (metric) |
mean: Human |
mean: LLM |
sd: Human |
sd: LLM |
iqr: Human |
iqr: LLM |
min: Human |
min: LLM |
|---|---|---|---|---|---|---|---|---|
| adv_knowledge | 73 | 84 | 15 | 9 | 15 | 12 | 25 | 60 |
| claims | 74 | 83 | 16 | 10 | 18 | 12 | 30 | 48 |
| gp_relevance | 78 | 85 | 15 | 10 | 19 | 14 | 12 | 55 |
| logic_comms | 75 | 86 | 13 | 6 | 9 | 5 | 30 | 64 |
| methods | 70 | 80 | 18 | 11 | 18 | 13 | 10 | 37 |
| open_sci | 72 | 64 | 16 | 13 | 18 | 12 | 10 | 30 |
| overall | 74 | 85 | 13 | 8 | 14 | 12 | 32 | 54 |
| criteria | n | r | rho | MAE | a_LH | a_HH |
|---|---|---|---|---|---|---|
| adv_knowledge | 39 | 0.32 | 0.49 | 13.22 | 0.05 | 0.19 |
| claims | 14 | 0.50 | 0.54 | 10.79 | 0.44 | 0.42 |
| gp_relevance | 40 | 0.24 | 0.39 | 13.28 | 0.02 | 0.34 |
| logic_comms | 40 | 0.03 | 0.25 | 13.93 | -0.22 | 0.31 |
| methods | 39 | 0.34 | 0.55 | 13.36 | 0.18 | 0.54 |
| open_sci | 40 | 0.11 | 0.09 | 16.62 | 0.06 | 0.03 |
| overall | 40 | 0.43 | 0.67 | 12.59 | 0.12 | 0.52 |
| Most human-preferred (Delta > 0) | Most AI-preferred (Delta < 0) |
|---|---|
|
|
| Criterion | Human rating | LLM rating |
|---|---|---|
| overall | 50 | 94 |
| methods | 25 | 90 |
| adv_knowledge | 55 | 93 |
| claims | 30 | 92 |
| gp_relevance | 75 | 95 |
| logic_comms | 58 | 90 |
| open_sci | 62 | 78 |
Overall
High-impact, policy-relevant mapping with fine spatial resolution, clear findings, and strong validation. Quantifies large carbon potential and offers actionable targeting. Some caveats on socioeconomic dynamics, leakage, and input data resolution heterogeneity temper certainty but not overall value.
Methods
Robust random forest modelling on millions of points; careful variable selection; validation including autocorrelation sensitivity; clear predictor set. Conservative choice to rely on biophysical variables aids robustness. Constraints: mixed input resolutions, historical training window, potential residual spatial bias.
Humans for ‘methods’ (paraphrased)
E1: 20 (15 - 25): ~“(1) the trained model does not enable prediction into the future given currently observed conditions, (2) the predictions do not represent pure biophysical potential, and (3) regrowth longevity is not addressed.”
E2: 30 (10 - 50): ~ Data Leakage Concerns, Variables potentially derived using outcome data, Variables potentially incorporating post-2000 outcome data, Underestimation of Uncertainty, Reliance on Older GFC Gain Data, Neglect of Intensive Margin Regrowth.
Prompt:
Your role – You are an academic expert as well as a practitioner across every relevant field – use all your knowledge and insight. You are acting as an expert research evaluator/reviewer. … Please check for methodological error/weaknesses in particular
Identifies a number of critiques overlapping our evaluators, e.g.,
Both evaluators identified this leakage issue, E1 specifically mentioning the land use variable.
Scan to visit
unjournal.org