library("here")library("janitor")library("readr")research <-read_csv(here("data", "research.csv"), show_col_types =FALSE)|>clean_names()# research <- research |># filter(status == "50_published evaluations (on PubPub, by Unjournal)")currentmodel ="GPT-5"
LLM‑generated evaluations produced by GPT-5 using a structured JSON‑schema prompt.
Unjournal.org evaluations
We use The Unjournal’s public data for baseline comparison. In The Unjournal process, each paper is typically reviewed by 1–3 expert evaluators, who provide quantitative ratings on the 0–100 percentile scale for each criterion (with 90% credible intervals) and a written evaluation. We extracted these ratings from The Unjournal’s records (an evaluator form database and published review reports). For our analysis, we aggregated the human ratings for each paper by computing the average score per criterion (and noting the range of individual scores). All selected papers had completed Unjournal reviews (meaning the authors received a full evaluation on the Unjournal platform). The sample includes 49 papers spanning 2017–2025 working papers in development economics, growth, health policy, environmental economics, and related fields that The Unjournal identified as high-impact. Each of these papers has quantitative scores from at least one human evaluator, and many have multiple (2-3) human ratings.
LLM-based evaluation
Quantitative ratings
Following The Unjournal’s standard guidelines for evaluators and their academic evaluation form, evaluators are asked to consider each paper along the following dimensions: claims & evidence, methods, logic & communication, open science, global relevance, and an overall assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric, evaluators are asked to provide a midpoint and a 90% credible interval to communicate uncertainty.
Show code
import os, pathlib, json, textwrap, reimport pdfplumber, time, tiktokenimport pandas as pd, numpy as npfrom openai import OpenAIimport altair as alt; alt.renderers.enable("html")import plotly.io as pioimport plotly.graph_objects as gofrom tqdm import tqdm from IPython.display import Markdown, display# Setup chunk:# * loads the main libraries (OpenAI SDK, Altair, Plotly, …)# * looks for your OpenAI key in **key/openai_key.txt** # * initialises a client (`gpt-5`, via `UJ_MODEL`)# * defines `pdf_to_string()` — drops the reference section and hard‑caps at 180k tokens so we stay in‑context.# **API key hygiene.** # The code only *reads* `key/openai_key.txt` if the environment variable isn’t already set. Keep this file out of version control (adjust gitignore).# ---------- API key ----------key_path = pathlib.Path("key/openai_key.txt")if os.getenv("OPENAI_API_KEY") isNoneand key_path.exists(): os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()ifnot os.getenv("OPENAI_API_KEY"):raiseValueError("No API key found. Put it in key/openai_key.txt or set OPENAI_API_KEY.")client = OpenAI()# ---------- Model (GPT‑5) ----------MODEL = os.getenv("UJ_MODEL", "gpt-5")# ---------- PDF → text with a fixed tokenizer ----------_enc = tiktoken.get_encoding("o200k_base") # large-context encoderdef pdf_to_string(path, max_tokens=180_000):"""Extract text, drop references section, hard-cap by tokens."""with pdfplumber.open(path) as pdf: text =" ".join(p.extract_text() or""for p in pdf.pages) text = re.sub(r"\s+", " ", text) m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()] toks = _enc.encode(text)iflen(toks) > max_tokens: text = _enc.decode(toks[:max_tokens])return text
Response schema & system prompt: This section defines the contract we expect back from the model:
METRICS lists the categories we score.
metric_schema specifies the shape and bounds of each metric.
response_format (JSON Schema) asks the model to return only a JSON object conforming to that schema.
SYSTEM_PROMPT explains scoring semantics (percentiles + 90% credible intervals) and how to set the overall score.
Show code
# -----------------------------# 1. Metric list# -----------------------------METRICS = ["overall","claims_evidence","methods","advancing_knowledge","logic_communication","open_science","global_relevance"]# -----------------------------# 2. JSON schema# -----------------------------metric_schema = {"type": "object","properties": {"midpoint": {"type": "integer", "minimum": 0, "maximum": 100},"lower_bound": {"type": "integer", "minimum": 0, "maximum": 100},"upper_bound": {"type": "integer", "minimum": 0, "maximum": 100},"rationale": {"type": "string"} },"required": ["midpoint", "lower_bound", "upper_bound", "rationale"],"additionalProperties": False}response_format = {"type": "json_schema","json_schema": {"name": "paper_assessment_v1","strict": True,"schema": {"type": "object","properties": {"metrics": {"type": "object","properties": {m: metric_schema for m in METRICS},"required": METRICS,"additionalProperties": False } },"required": ["metrics"],"additionalProperties": False } }}# -----------------------------# 3. System prompt# -----------------------------SYSTEM_PROMPT = textwrap.dedent(f"""You are an expert evaluator.We ask for a set of quantitative metrics. For each metric, we ask for a score and a 90% credible interval.Percentile rankingsWe ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.The reference is all serious research in the same area in the last three years.Midpoint rating and credible intervals For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. We want policymakers, researchers, funders, and managers to be able to use evaluations to update their beliefs and make better decisions. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.Overall assessment- Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.importance to knowledge production, and importance to practice. Claims, strength and characterization of evidence- Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?Methods: Justification, reasonableness, validity, robustness- Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? - Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?- Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications? Advancing our knowledge and practice- To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?- Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? - Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable.Logic and communication- Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?- Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?- Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? - Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?Open, collaborative, replicable research- Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?- Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?- Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?Relevance to global priorities, usefulness for practitioners- Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? - Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? - Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? - Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?Return STRICT JSON matching the supplied schema.Fill every key in the object `metrics`:{', '.join(METRICS)}Definitions are percentile scores (0 – 100) versus serious work in the field from the last 3 years.For `overall`: • Default = arithmetic mean of the other six midpoints (rounded). • If, in your judgment, another value is better (e.g. one metric is far more decision-relevant), choose it and explain why in `overall.rationale`.Field meanings midpoint → best-guess percentile lower_bound → 5th-percentile plausible value upper_bound → 95th-percentile plausible value rationale → ≤100 words; terse but informative.Do not wrap the JSON in markdown fences or add extra text.""").strip()
Below is the main helper that runs a single evaluation:
extracts text with pdf_to_string(),
calls Chat Completions with the strict JSON Schema in response_format,
parses the returned JSON string into a Python dict.
Show code
def evaluate_paper(pdf_path: str| pathlib.Path, model: str= MODEL) ->dict: paper_text = pdf_to_string(pdf_path) chat = client.chat.completions.create( model=model, # "gpt-5" response_format=response_format, # your JSON-Schema defined above messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ],# max_tokens=2500, # optional cap for the JSON block ) raw_json = chat.choices[0].message.contentreturn json.loads(raw_json)# Minimal probe using Chat Completions (works on gpt-5 and o3)# A quick “smoke test” to confirm the client is alive and willing to emit JSON.# This intentionally uses the simple `json_object` format for wide model compatibility.MODEL ="gpt-5"# "o3"probe = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": "Return only valid JSON: {\"ok\": true}"}, {"role": "user", "content": "Return {\"ok\": true} only."} ],# Wide compatibility: ask for a JSON object (no schema) response_format={"type": "json_object"},)print(probe.choices[0].message.content) # should print {"ok": true}
Batch-evaluate all PDFs: This loop walks every *.pdf in papers/, calls evaluate_paper(), pads a 0.8s sleep for rate‑limits, and collects the raw JSON dicts in memory (records). The nested records structure is converted to a tidy long format and written to results/metrics_long.csv.
Show code
ROOT = pathlib.Path("papers") OUT = pathlib.Path("results")OUT.mkdir(exist_ok=True)pdfs =sorted(ROOT.glob("*.pdf"))records = []for pdf in tqdm(pdfs, desc="Metrics"):try: res = evaluate_paper(pdf) # <-- API call res["paper"] = pdf.stem records.append(res) time.sleep(0.8) exceptExceptionas e:print(f"⚠️ {pdf.name}: {e}")tidy_rows = []for rec in records: paper_id = rec["paper"]for metric, vals in rec["metrics"].items(): tidy_rows.append({"paper": paper_id,"metric": metric,**vals # midpoint, lower_bound, upper_bound, rationale })tidy = pd.DataFrame(tidy_rows)tidy.to_csv(OUT /"metrics_long.csv", index=False)# tidy.head()p = pathlib.Path("results/metrics_long.csv")ifnot p.exists(): display(Markdown("::: {.callout-warning}\n**No results yet.** Run the batch evaluation to generate `results/metrics_long.csv`.\n:::"))else: tidy = pd.read_csv(p)# Basic counts n_papers = tidy["paper"].nunique() n_rows =len(tidy) n_metrics = tidy["metric"].nunique()# Overall-only slice overall = tidy[tidy["metric"] =="overall"].copy() mean_overall = overall["midpoint"].mean() ifnot overall.empty else np.nan med_overall = overall["midpoint"].median() ifnot overall.empty else np.nan display(Markdown(f"**Batch size:** {n_papers} papers \n"f"**Total metric rows:** {n_rows:,} across {n_metrics} metrics \n"f"**Overall ratings:** mean = {mean_overall:.1f}, median = {med_overall:.1f}" ))
Journal ranking tiers
What journal ranking tier should this work be published in? (0.0-5.0)
What journal ranking tier will this work be published in? (0.0-5.0)
Show code
from __future__ import annotationsimport os, pathlib, json, textwrap, re, time, math, random, hashlib, threadingfrom typing import Any, Dict, Optionalimport pdfplumber, tiktokenimport pandas as pdfrom concurrent.futures import ThreadPoolExecutor, as_completed# Reuse your existing OpenAI client + MODEL set earlier:# client = OpenAI()# MODEL = os.getenv("UJ_MODEL", "gpt-5")# ─────────────────────────────────────────────────────────────────────────────# Tokenizer + compact extractor for tiering # ─────────────────────────────────────────────────────────────────────────────_enc = tiktoken.get_encoding("o200k_base")def _truncate_tokens(txt: str, max_tokens: int) ->str: toks = _enc.encode(re.sub(r"\s+", " ", txt))iflen(toks) > max_tokens: toks = toks[:max_tokens]return _enc.decode(toks)def pdf_to_tier_text(path, max_tokens: int=25_000) ->str:""" Cheaper extract: try to keep Abstract, Introduction, Results/Discussion, Conclusion. Falls back gracefully to the whole (minus references) then truncates by tokens. """try:with pdfplumber.open(path) as pdf: full =" ".join(p.extract_text() or""for p in pdf.pages)exceptException:# Fall back to the generic extractor in your earlier chunk, if defined:return pdf_to_string(path, max_tokens=max_tokens) # type: ignore text = re.sub(r"\s+", " ", full)# Drop references/bibliography m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()]# Heuristic section pulls chunks = []def _grab(label, next_labels=("Introduction","1 ","I.","Background","Data","Model","Method", "Methods", "Approach")): pat =rf"\b{label}\b" m = re.search(pat, text, flags=re.I)ifnot m: returnNone start = m.start()# stop at next section-ish header next_pat =r"|".join([rf"\b{re.escape(nl)}\b"for nl in next_labels]) m2 = re.search(next_pat, text[m.end():], flags=re.I) end = m.end() + (m2.start() if m2 else5000) # cap a bit if no clear endreturn text[start:end]for lab in ["Abstract", "Summary"]: s = _grab(lab)if s: chunks.append(s)for lab in ["Introduction", "Motivation", "Overview"]: s = _grab(lab)if s: chunks.append(s)for lab in ["Results", "Findings", "Discussion"]: s = _grab(lab)if s: chunks.append(s)for lab in ["Conclusion", "Conclusions", "Policy Implications", "Implications"]: s = _grab(lab)if s: chunks.append(s)ifnot chunks: candidate = textelse: candidate =" ".join(chunks)return _truncate_tokens(candidate, max_tokens=max_tokens)# ─────────────────────────────────────────────────────────────────────────────# JSON schema + prompt (v2)# ─────────────────────────────────────────────────────────────────────────────TIERS_PROMPT_VERSION ="journal_tiers_v2_2025-09-14"TIERS_SCHEMA: Dict[str, Any] = {"type": "object","additionalProperties": False,"properties": {"tier_should": {"type": "object","additionalProperties": False,"properties": {"score": {"type": "number", "minimum": 0, "maximum": 5},"ci_lower": {"type": "number", "minimum": 0, "maximum": 5},"ci_upper": {"type": "number", "minimum": 0, "maximum": 5},"rationale": {"type": "string", "maxLength": 400} },"required": ["score", "ci_lower", "ci_upper", "rationale"] },"tier_will": {"type": "object","additionalProperties": False,"properties": {"score": {"type": "number", "minimum": 0, "maximum": 5},"ci_lower": {"type": "number", "minimum": 0, "maximum": 5},"ci_upper": {"type": "number", "minimum": 0, "maximum": 5},"rationale": {"type": "string", "maxLength": 400} },"required": ["score", "ci_lower", "ci_upper", "rationale"] } },"required": ["tier_should", "tier_will"]}TIERS_RESPONSE_FORMAT = {"type": "json_schema","json_schema": {"name": "journal_tiers_v2","strict": True,"schema": TIERS_SCHEMA }}TIERS_SYSTEM_PROMPT = textwrap.dedent("""You are an expert evaluator. Return STRICT JSON matching the provided schema.Scale (0–5; halves allowed): 5 = A-journal / top-five general 4 = top field or marginal A 3 = solid field 2 = niche / low-tier field 1 = working-paper outlet only 0 = not publishableDefinitions:- tier_should = where the paper deserves to publish if quality-only decides.- tier_will = realistic prediction given status/noise/connections.Rules:- Keep 0 ≤ ci_lower ≤ score ≤ ci_upper ≤ 5.- Round all scores to the nearest 0.5.- Rationale ≤ 40 words; focus on contribution, credibility, and fit.- No extra keys. No markdown. JSON only.""").strip()# ─────────────────────────────────────────────────────────────────────────────# Helpers: rounding, clamping, validation, caching# ─────────────────────────────────────────────────────────────────────────────def _round_half(x: float) ->float:returnround(float(x) *2.0) /2.0def _clamp_0_5(x: float) ->float:returnmax(0.0, min(5.0, float(x)))def _normalize_block(block: Dict[str, Any]) -> Dict[str, Any]: s = _round_half(_clamp_0_5(block.get("score", 0.0))) lo = _round_half(_clamp_0_5(block.get("ci_lower", s))) hi = _round_half(_clamp_0_5(block.get("ci_upper", s)))# enforce orderingif lo > s: lo, s = s, loif hi < s: hi, s = s, hiif lo > hi: lo, hi = hi, lo block["score"], block["ci_lower"], block["ci_upper"] = s, lo, hi# Rationale hygiene r =str(block.get("rationale", "")).strip() block["rationale"] = re.sub(r"\s+", " ", r)return blockdef _normalize_tiers(payload: Dict[str, Any]) -> Dict[str, Any]:if"tier_should"in payload: payload["tier_should"] = _normalize_block(payload["tier_should"])if"tier_will"in payload: payload["tier_will"] = _normalize_block(payload["tier_will"])return payloaddef _file_hash_bytes(path: pathlib.Path) ->bytes: h = hashlib.sha1()withopen(path, "rb") as f:for chunk initer(lambda: f.read(1<<20), b""): h.update(chunk)return h.digest()def _cache_key_for_tiers(pdf_path: pathlib.Path, model: str) ->str: base = _file_hash_bytes(pdf_path) extra =f"{model}|{TIERS_PROMPT_VERSION}".encode()return hashlib.sha1(base + extra).hexdigest()CACHE_DIR = pathlib.Path("cache") /"journal_tiers"CACHE_DIR.mkdir(parents=True, exist_ok=True)_cache_lock = threading.Lock()def evaluate_journal_tiers(pdf_path: str| pathlib.Path, model: str= MODEL, seed: Optional[int] =2025, max_tokens_extract: int=25_000, verbose: bool=False) -> Dict[str, Any]:""" Returns: dict with keys: tier_should{score,ci_lower,ci_upper,rationale}, tier_will{...}, plus bookkeeping fields. Uses caching keyed to (pdf bytes, model, prompt_version). """ pdf_path = pathlib.Path(pdf_path) key = _cache_key_for_tiers(pdf_path, model) cache_file = CACHE_DIR /f"{key}.json"# Try cacheif cache_file.exists():if verbose: print(f"cache hit: {pdf_path.name}") data = json.loads(cache_file.read_text())return data# Extract compact text for this task paper_text = pdf_to_tier_text(pdf_path, max_tokens=max_tokens_extract)# Call model with strict JSON schema backoff =1.0 last_err =Nonefor attempt inrange(6):try: chat = client.chat.completions.create( model=model, response_format=TIERS_RESPONSE_FORMAT, seed=seed, messages=[ {"role": "system", "content": TIERS_SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ], ) raw = chat.choices[0].message.content payload = json.loads(raw) payload = _normalize_tiers(payload) payload["paper"] = pdf_path.stem payload["model"] = model payload["prompt_version"] = TIERS_PROMPT_VERSION cache_file.write_text(json.dumps(payload, ensure_ascii=False))return payloadexceptExceptionas e: last_err = e# gentle jittered exponential backoff time.sleep(backoff + random.random() *0.25) backoff *=1.8raiseRuntimeError(f"Tier evaluation failed for {pdf_path.name}: {last_err}")def batch_evaluate_journal_tiers(root: str| pathlib.Path ="papers", out_csv_wide: str| pathlib.Path ="results/journal_tiers.csv", out_csv_long: str| pathlib.Path ="results/journal_tiers_long.csv", model: str= MODEL, max_workers: int=3, verbose: bool=True) -> pd.DataFrame: ROOT = pathlib.Path(root) OUT = pathlib.Path(out_csv_wide) OUT.parent.mkdir(parents=True, exist_ok=True) pdfs =sorted(ROOT.glob("*.pdf"))if verbose:print(f"Tiering {len(pdfs)} papers with {max_workers} workers...") rows = []with ThreadPoolExecutor(max_workers=max_workers) as ex: futs = {ex.submit(evaluate_journal_tiers, pdf, model): pdf for pdf in pdfs}for fut in as_completed(futs): pdf = futs[fut]try: res = fut.result() rows.append({"paper": pdf.stem,"model": res.get("model"),"prompt_version": res.get("prompt_version"),"should_score": res["tier_should"]["score"],"should_ci_lower": res["tier_should"]["ci_lower"],"should_ci_upper": res["tier_should"]["ci_upper"],"should_rationale": res["tier_should"]["rationale"],"will_score": res["tier_will"]["score"],"will_ci_lower": res["tier_will"]["ci_lower"],"will_ci_upper": res["tier_will"]["ci_upper"],"will_rationale": res["tier_will"]["rationale"], })if verbose: print(f"✓ {pdf.name}")exceptExceptionas e:print(f"⚠️ {pdf.name}: {e}") wide = pd.DataFrame(rows) wide.to_csv(OUT, index=False)# Long format (two rows per paper) long_rows = []for r in rows:for kind in ("should", "will"): long_rows.append({"paper": r["paper"],"model": r["model"],"prompt_version": r["prompt_version"],"tier_kind": kind,"score": r[f"{kind}_score"],"ci_lower": r[f"{kind}_ci_lower"],"ci_upper": r[f"{kind}_ci_upper"],"rationale": r[f"{kind}_rationale"], })long= pd.DataFrame(long_rows) pathlib.Path(out_csv_long).parent.mkdir(parents=True, exist_ok=True)long.to_csv(out_csv_long, index=False)return wide
---title: "Data and methods"---```{r}#| label: setup-r#| eval: true#| code-fold: truelibrary("here")library("janitor")library("readr")research <-read_csv(here("data", "research.csv"), show_col_types =FALSE)|>clean_names()# research <- research |># filter(status == "50_published evaluations (on PubPub, by Unjournal)")currentmodel ="GPT-5"```We draw on two main sources:1) Human evaluations from [The Unjournal’s public evaluation data](https://unjournal.github.io/unjournaldata/index.html) (PubPub reports and the Coda evaluation form export). 2) LLM‑generated evaluations produced by `r currentmodel` using a structured JSON‑schema prompt.## Unjournal.org evaluationsWe use The Unjournal’s public data for baseline comparison. In The Unjournal process, each paper is typically reviewed by 1–3 expert evaluators, who provide quantitative ratings on the 0–100 percentile scale for each criterion (with 90% credible intervals) and a written evaluation. We extracted these ratings from The Unjournal’s records (an evaluator form database and published review reports). For our analysis, we aggregated the human ratings for each paper by computing the average score per criterion (and noting the range of individual scores). All selected papers had completed Unjournal reviews (meaning the authors received a full evaluation on the Unjournal platform). The sample includes 49 papers spanning 2017–2025 working papers in development economics, growth, health policy, environmental economics, and related fields that The Unjournal identified as high-impact. Each of these papers has quantitative scores from at least one human evaluator, and many have multiple (2-3) human ratings.## LLM-based evaluation### Quantitative ratingsFollowing The Unjournal's [standard guidelines for evaluators](https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators) and their [academic evaluation form](https://coda.io/form/Unjournal-Evaluation-form-academic-stream-Coda-updated-version_dGjfMZ1yXME), evaluators are asked to consider each paper along the following dimensions: **claims & evidence**, **methods**, **logic & communication**, **open science**, **global relevance**, and an **overall** assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric, evaluators are asked to provide a midpoint and a 90% credible interval to communicate uncertainty.```{python}#| label: setup-env#| eval: false#| code-fold: trueimport os, pathlib, json, textwrap, reimport pdfplumber, time, tiktokenimport pandas as pd, numpy as npfrom openai import OpenAIimport altair as alt; alt.renderers.enable("html")import plotly.io as pioimport plotly.graph_objects as gofrom tqdm import tqdm from IPython.display import Markdown, display# Setup chunk:# * loads the main libraries (OpenAI SDK, Altair, Plotly, …)# * looks for your OpenAI key in **key/openai_key.txt** # * initialises a client (`gpt-5`, via `UJ_MODEL`)# * defines `pdf_to_string()` — drops the reference section and hard‑caps at 180k tokens so we stay in‑context.# **API key hygiene.** # The code only *reads* `key/openai_key.txt` if the environment variable isn’t already set. Keep this file out of version control (adjust gitignore).# ---------- API key ----------key_path = pathlib.Path("key/openai_key.txt")if os.getenv("OPENAI_API_KEY") isNoneand key_path.exists(): os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()ifnot os.getenv("OPENAI_API_KEY"):raiseValueError("No API key found. Put it in key/openai_key.txt or set OPENAI_API_KEY.")client = OpenAI()# ---------- Model (GPT‑5) ----------MODEL = os.getenv("UJ_MODEL", "gpt-5")# ---------- PDF → text with a fixed tokenizer ----------_enc = tiktoken.get_encoding("o200k_base") # large-context encoderdef pdf_to_string(path, max_tokens=180_000):"""Extract text, drop references section, hard-cap by tokens."""with pdfplumber.open(path) as pdf: text =" ".join(p.extract_text() or""for p in pdf.pages) text = re.sub(r"\s+", " ", text) m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()] toks = _enc.encode(text)iflen(toks) > max_tokens: text = _enc.decode(toks[:max_tokens])return text```Response schema & system prompt: This section defines the *contract* we expect back from the model:- `METRICS` lists the categories we score.- `metric_schema` specifies the shape and bounds of each metric.- `response_format` (JSON Schema) asks the model to return only a JSON object conforming to that schema.- `SYSTEM_PROMPT` explains scoring semantics (percentiles + 90% credible intervals) and how to set the `overall` score.```{python}#| label: llm-setup#| eval: false#| code-fold: show# -----------------------------# 1. Metric list# -----------------------------METRICS = ["overall","claims_evidence","methods","advancing_knowledge","logic_communication","open_science","global_relevance"]# -----------------------------# 2. JSON schema# -----------------------------metric_schema = {"type": "object","properties": {"midpoint": {"type": "integer", "minimum": 0, "maximum": 100},"lower_bound": {"type": "integer", "minimum": 0, "maximum": 100},"upper_bound": {"type": "integer", "minimum": 0, "maximum": 100},"rationale": {"type": "string"} },"required": ["midpoint", "lower_bound", "upper_bound", "rationale"],"additionalProperties": False}response_format = {"type": "json_schema","json_schema": {"name": "paper_assessment_v1","strict": True,"schema": {"type": "object","properties": {"metrics": {"type": "object","properties": {m: metric_schema for m in METRICS},"required": METRICS,"additionalProperties": False } },"required": ["metrics"],"additionalProperties": False } }}# -----------------------------# 3. System prompt# -----------------------------SYSTEM_PROMPT = textwrap.dedent(f"""You are an expert evaluator.We ask for a set of quantitative metrics. For each metric, we ask for a score and a 90% credible interval.Percentile rankingsWe ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.The reference is all serious research in the same area in the last three years.Midpoint rating and credible intervals For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. We want policymakers, researchers, funders, and managers to be able to use evaluations to update their beliefs and make better decisions. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.Overall assessment- Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.importance to knowledge production, and importance to practice. Claims, strength and characterization of evidence- Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?Methods: Justification, reasonableness, validity, robustness- Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? - Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?- Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications? Advancing our knowledge and practice- To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?- Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? - Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable.Logic and communication- Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?- Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?- Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? - Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?Open, collaborative, replicable research- Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?- Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?- Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?Relevance to global priorities, usefulness for practitioners- Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? - Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? - Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? - Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?Return STRICT JSON matching the supplied schema.Fill every key in the object `metrics`:{', '.join(METRICS)}Definitions are percentile scores (0 – 100) versus serious work in the field from the last 3 years.For `overall`: • Default = arithmetic mean of the other six midpoints (rounded). • If, in your judgment, another value is better (e.g. one metric is far more decision-relevant), choose it and explain why in `overall.rationale`.Field meanings midpoint → best-guess percentile lower_bound → 5th-percentile plausible value upper_bound → 95th-percentile plausible value rationale → ≤100 words; terse but informative.Do not wrap the JSON in markdown fences or add extra text.""").strip()```Below is the main helper that runs a single evaluation:- extracts text with `pdf_to_string()`,- calls Chat Completions with the strict JSON Schema in `response_format`,- parses the returned JSON string into a Python dict.```{python}#| label: eval-helper-function#| eval: false#| code-fold: truedef evaluate_paper(pdf_path: str| pathlib.Path, model: str= MODEL) ->dict: paper_text = pdf_to_string(pdf_path) chat = client.chat.completions.create( model=model, # "gpt-5" response_format=response_format, # your JSON-Schema defined above messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ],# max_tokens=2500, # optional cap for the JSON block ) raw_json = chat.choices[0].message.contentreturn json.loads(raw_json)# Minimal probe using Chat Completions (works on gpt-5 and o3)# A quick “smoke test” to confirm the client is alive and willing to emit JSON.# This intentionally uses the simple `json_object` format for wide model compatibility.MODEL ="gpt-5"# "o3"probe = client.chat.completions.create( model=MODEL, messages=[ {"role": "system", "content": "Return only valid JSON: {\"ok\": true}"}, {"role": "user", "content": "Return {\"ok\": true} only."} ],# Wide compatibility: ask for a JSON object (no schema) response_format={"type": "json_object"},)print(probe.choices[0].message.content) # should print {"ok": true}```Batch-evaluate all PDFs: This loop walks every `*.pdf` in `papers/`, calls `evaluate_paper()`,pads a 0.8s sleep for rate‑limits, and collects the raw JSON dicts in memory (`records`).The nested `records` structure is converted to a tidy long format and written to `results/metrics_long.csv`.```{python}#| label: eval-many-metrics#| eval: false#| code-fold: trueROOT = pathlib.Path("papers") OUT = pathlib.Path("results")OUT.mkdir(exist_ok=True)pdfs =sorted(ROOT.glob("*.pdf"))records = []for pdf in tqdm(pdfs, desc="Metrics"):try: res = evaluate_paper(pdf) # <-- API call res["paper"] = pdf.stem records.append(res) time.sleep(0.8) exceptExceptionas e:print(f"⚠️ {pdf.name}: {e}")tidy_rows = []for rec in records: paper_id = rec["paper"]for metric, vals in rec["metrics"].items(): tidy_rows.append({"paper": paper_id,"metric": metric,**vals # midpoint, lower_bound, upper_bound, rationale })tidy = pd.DataFrame(tidy_rows)tidy.to_csv(OUT /"metrics_long.csv", index=False)# tidy.head()p = pathlib.Path("results/metrics_long.csv")ifnot p.exists(): display(Markdown("::: {.callout-warning}\n**No results yet.** Run the batch evaluation to generate `results/metrics_long.csv`.\n:::"))else: tidy = pd.read_csv(p)# Basic counts n_papers = tidy["paper"].nunique() n_rows =len(tidy) n_metrics = tidy["metric"].nunique()# Overall-only slice overall = tidy[tidy["metric"] =="overall"].copy() mean_overall = overall["midpoint"].mean() ifnot overall.empty else np.nan med_overall = overall["midpoint"].median() ifnot overall.empty else np.nan display(Markdown(f"**Batch size:** {n_papers} papers \n"f"**Total metric rows:** {n_rows:,} across {n_metrics} metrics \n"f"**Overall ratings:** mean = {mean_overall:.1f}, median = {med_overall:.1f}" ))```### Journal ranking tiers- What journal ranking tier should this work be published in? (0.0-5.0) - What journal ranking tier will this work be published in? (0.0-5.0)```{python}#| label: tiers-setup#| eval: false#| code-fold: truefrom __future__ import annotationsimport os, pathlib, json, textwrap, re, time, math, random, hashlib, threadingfrom typing import Any, Dict, Optionalimport pdfplumber, tiktokenimport pandas as pdfrom concurrent.futures import ThreadPoolExecutor, as_completed# Reuse your existing OpenAI client + MODEL set earlier:# client = OpenAI()# MODEL = os.getenv("UJ_MODEL", "gpt-5")# ─────────────────────────────────────────────────────────────────────────────# Tokenizer + compact extractor for tiering # ─────────────────────────────────────────────────────────────────────────────_enc = tiktoken.get_encoding("o200k_base")def _truncate_tokens(txt: str, max_tokens: int) ->str: toks = _enc.encode(re.sub(r"\s+", " ", txt))iflen(toks) > max_tokens: toks = toks[:max_tokens]return _enc.decode(toks)def pdf_to_tier_text(path, max_tokens: int=25_000) ->str:""" Cheaper extract: try to keep Abstract, Introduction, Results/Discussion, Conclusion. Falls back gracefully to the whole (minus references) then truncates by tokens. """try:with pdfplumber.open(path) as pdf: full =" ".join(p.extract_text() or""for p in pdf.pages)exceptException:# Fall back to the generic extractor in your earlier chunk, if defined:return pdf_to_string(path, max_tokens=max_tokens) # type: ignore text = re.sub(r"\s+", " ", full)# Drop references/bibliography m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()]# Heuristic section pulls chunks = []def _grab(label, next_labels=("Introduction","1 ","I.","Background","Data","Model","Method", "Methods", "Approach")): pat =rf"\b{label}\b" m = re.search(pat, text, flags=re.I)ifnot m: returnNone start = m.start()# stop at next section-ish header next_pat =r"|".join([rf"\b{re.escape(nl)}\b"for nl in next_labels]) m2 = re.search(next_pat, text[m.end():], flags=re.I) end = m.end() + (m2.start() if m2 else5000) # cap a bit if no clear endreturn text[start:end]for lab in ["Abstract", "Summary"]: s = _grab(lab)if s: chunks.append(s)for lab in ["Introduction", "Motivation", "Overview"]: s = _grab(lab)if s: chunks.append(s)for lab in ["Results", "Findings", "Discussion"]: s = _grab(lab)if s: chunks.append(s)for lab in ["Conclusion", "Conclusions", "Policy Implications", "Implications"]: s = _grab(lab)if s: chunks.append(s)ifnot chunks: candidate = textelse: candidate =" ".join(chunks)return _truncate_tokens(candidate, max_tokens=max_tokens)# ─────────────────────────────────────────────────────────────────────────────# JSON schema + prompt (v2)# ─────────────────────────────────────────────────────────────────────────────TIERS_PROMPT_VERSION ="journal_tiers_v2_2025-09-14"TIERS_SCHEMA: Dict[str, Any] = {"type": "object","additionalProperties": False,"properties": {"tier_should": {"type": "object","additionalProperties": False,"properties": {"score": {"type": "number", "minimum": 0, "maximum": 5},"ci_lower": {"type": "number", "minimum": 0, "maximum": 5},"ci_upper": {"type": "number", "minimum": 0, "maximum": 5},"rationale": {"type": "string", "maxLength": 400} },"required": ["score", "ci_lower", "ci_upper", "rationale"] },"tier_will": {"type": "object","additionalProperties": False,"properties": {"score": {"type": "number", "minimum": 0, "maximum": 5},"ci_lower": {"type": "number", "minimum": 0, "maximum": 5},"ci_upper": {"type": "number", "minimum": 0, "maximum": 5},"rationale": {"type": "string", "maxLength": 400} },"required": ["score", "ci_lower", "ci_upper", "rationale"] } },"required": ["tier_should", "tier_will"]}TIERS_RESPONSE_FORMAT = {"type": "json_schema","json_schema": {"name": "journal_tiers_v2","strict": True,"schema": TIERS_SCHEMA }}TIERS_SYSTEM_PROMPT = textwrap.dedent("""You are an expert evaluator. Return STRICT JSON matching the provided schema.Scale (0–5; halves allowed): 5 = A-journal / top-five general 4 = top field or marginal A 3 = solid field 2 = niche / low-tier field 1 = working-paper outlet only 0 = not publishableDefinitions:- tier_should = where the paper deserves to publish if quality-only decides.- tier_will = realistic prediction given status/noise/connections.Rules:- Keep 0 ≤ ci_lower ≤ score ≤ ci_upper ≤ 5.- Round all scores to the nearest 0.5.- Rationale ≤ 40 words; focus on contribution, credibility, and fit.- No extra keys. No markdown. JSON only.""").strip()# ─────────────────────────────────────────────────────────────────────────────# Helpers: rounding, clamping, validation, caching# ─────────────────────────────────────────────────────────────────────────────def _round_half(x: float) ->float:returnround(float(x) *2.0) /2.0def _clamp_0_5(x: float) ->float:returnmax(0.0, min(5.0, float(x)))def _normalize_block(block: Dict[str, Any]) -> Dict[str, Any]: s = _round_half(_clamp_0_5(block.get("score", 0.0))) lo = _round_half(_clamp_0_5(block.get("ci_lower", s))) hi = _round_half(_clamp_0_5(block.get("ci_upper", s)))# enforce orderingif lo > s: lo, s = s, loif hi < s: hi, s = s, hiif lo > hi: lo, hi = hi, lo block["score"], block["ci_lower"], block["ci_upper"] = s, lo, hi# Rationale hygiene r =str(block.get("rationale", "")).strip() block["rationale"] = re.sub(r"\s+", " ", r)return blockdef _normalize_tiers(payload: Dict[str, Any]) -> Dict[str, Any]:if"tier_should"in payload: payload["tier_should"] = _normalize_block(payload["tier_should"])if"tier_will"in payload: payload["tier_will"] = _normalize_block(payload["tier_will"])return payloaddef _file_hash_bytes(path: pathlib.Path) ->bytes: h = hashlib.sha1()withopen(path, "rb") as f:for chunk initer(lambda: f.read(1<<20), b""): h.update(chunk)return h.digest()def _cache_key_for_tiers(pdf_path: pathlib.Path, model: str) ->str: base = _file_hash_bytes(pdf_path) extra =f"{model}|{TIERS_PROMPT_VERSION}".encode()return hashlib.sha1(base + extra).hexdigest()CACHE_DIR = pathlib.Path("cache") /"journal_tiers"CACHE_DIR.mkdir(parents=True, exist_ok=True)_cache_lock = threading.Lock()def evaluate_journal_tiers(pdf_path: str| pathlib.Path, model: str= MODEL, seed: Optional[int] =2025, max_tokens_extract: int=25_000, verbose: bool=False) -> Dict[str, Any]:""" Returns: dict with keys: tier_should{score,ci_lower,ci_upper,rationale}, tier_will{...}, plus bookkeeping fields. Uses caching keyed to (pdf bytes, model, prompt_version). """ pdf_path = pathlib.Path(pdf_path) key = _cache_key_for_tiers(pdf_path, model) cache_file = CACHE_DIR /f"{key}.json"# Try cacheif cache_file.exists():if verbose: print(f"cache hit: {pdf_path.name}") data = json.loads(cache_file.read_text())return data# Extract compact text for this task paper_text = pdf_to_tier_text(pdf_path, max_tokens=max_tokens_extract)# Call model with strict JSON schema backoff =1.0 last_err =Nonefor attempt inrange(6):try: chat = client.chat.completions.create( model=model, response_format=TIERS_RESPONSE_FORMAT, seed=seed, messages=[ {"role": "system", "content": TIERS_SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ], ) raw = chat.choices[0].message.content payload = json.loads(raw) payload = _normalize_tiers(payload) payload["paper"] = pdf_path.stem payload["model"] = model payload["prompt_version"] = TIERS_PROMPT_VERSION cache_file.write_text(json.dumps(payload, ensure_ascii=False))return payloadexceptExceptionas e: last_err = e# gentle jittered exponential backoff time.sleep(backoff + random.random() *0.25) backoff *=1.8raiseRuntimeError(f"Tier evaluation failed for {pdf_path.name}: {last_err}")def batch_evaluate_journal_tiers(root: str| pathlib.Path ="papers", out_csv_wide: str| pathlib.Path ="results/journal_tiers.csv", out_csv_long: str| pathlib.Path ="results/journal_tiers_long.csv", model: str= MODEL, max_workers: int=3, verbose: bool=True) -> pd.DataFrame: ROOT = pathlib.Path(root) OUT = pathlib.Path(out_csv_wide) OUT.parent.mkdir(parents=True, exist_ok=True) pdfs =sorted(ROOT.glob("*.pdf"))if verbose:print(f"Tiering {len(pdfs)} papers with {max_workers} workers...") rows = []with ThreadPoolExecutor(max_workers=max_workers) as ex: futs = {ex.submit(evaluate_journal_tiers, pdf, model): pdf for pdf in pdfs}for fut in as_completed(futs): pdf = futs[fut]try: res = fut.result() rows.append({"paper": pdf.stem,"model": res.get("model"),"prompt_version": res.get("prompt_version"),"should_score": res["tier_should"]["score"],"should_ci_lower": res["tier_should"]["ci_lower"],"should_ci_upper": res["tier_should"]["ci_upper"],"should_rationale": res["tier_should"]["rationale"],"will_score": res["tier_will"]["score"],"will_ci_lower": res["tier_will"]["ci_lower"],"will_ci_upper": res["tier_will"]["ci_upper"],"will_rationale": res["tier_will"]["rationale"], })if verbose: print(f"✓ {pdf.name}")exceptExceptionas e:print(f"⚠️ {pdf.name}: {e}") wide = pd.DataFrame(rows) wide.to_csv(OUT, index=False)# Long format (two rows per paper) long_rows = []for r in rows:for kind in ("should", "will"): long_rows.append({"paper": r["paper"],"model": r["model"],"prompt_version": r["prompt_version"],"tier_kind": kind,"score": r[f"{kind}_score"],"ci_lower": r[f"{kind}_ci_lower"],"ci_upper": r[f"{kind}_ci_upper"],"rationale": r[f"{kind}_rationale"], })long= pd.DataFrame(long_rows) pathlib.Path(out_csv_long).parent.mkdir(parents=True, exist_ok=True)long.to_csv(out_csv_long, index=False)return wide``````{python}#| eval: false#| code-fold: truebatch_evaluate_journal_tiers( root="papers", out_csv_wide="results/journal_tiers.csv", out_csv_long="results/journal_tiers_long.csv", model=MODEL, max_workers=3, verbose=True)```### Qualitative assessments```{python}#| label: prompt-fulleval#| eval: false#| code-fold: trueEVAL_SCHEMA: Dict[str, Any] = {"type": "object","additionalProperties": False,"properties": {"summary": {"type": "string"},"major_strengths": {"type": "string"},"major_weaknesses": {"type": "string"},"methodological_notes": {"type": "string"},"policy_relevance": {"type": "string"},"recommendations": {"type": "string"} },"required": ["summary","major_strengths","major_weaknesses","methodological_notes","policy_relevance","recommendations"]}RESPONSE_FORMAT = {"type": "json_schema","json_schema": {"name": "written_evaluation_v1","strict": True,"schema": EVAL_SCHEMA }}SYSTEM_PROMPT = textwrap.dedent("""You are writing a referee report for The Unjournal.Return STRICT JSON that matches the schema `written_evaluation_v1`.Audience → (i) research users & policy-makers, (ii) departments & funders,(iii) authors.Sections (≤250 words each): • summary → clear one-paragraph overview of purpose & findings • major_strengths → bullet list or short paragraph • major_weaknesses → idem • methodological_notes → focus on justification, assumptions, robustness, open/science aspects • policy_relevance → discuss how (and how strongly) results matter for global priorities / practitioners • recommendations → concrete, numbered advice to improve the workWrite in direct, concise style; no hedging beyond necessary uncertaintyquantification. No markdown fences, no extra keys.""").strip()```---