Include global setup and parameters
source("setup_params.R")
#get the choices printed

We draw on two main sources:

  1. Human evaluations from The Unjournal’s public evaluation data
  2. LLM‑generated evaluations using a structured JSON‑schema prompt resembling The Unjournal’s guidelines, comparing several models including gpt-5-pro-2025-10-06 (knowledge cut-off: 30 September 2024).

Unjournal.org evaluations

We use The Unjournal’s public data for a baseline comparison. At The Unjournal each paper is typically evaluated (aka ‘reviewed’) by two expert evaluators1 who provide quantitative ratings on a 0–100 percentile scale for each of seven criteria (with 90% credible intervals),2 two “journal tier” ratings on a 0.0 - 5.0 scale,3 a written evaluation (resembling a referee report for a journal), and identification and assessment of the paper’s “main claim”. For our initial analysis, we extracted these human ratings and aggregated them, taking the average score per criterion across evaluators (and noting the range of individual scores).

All papers have completed The Unjournal’s evaluation process (meaning the authors received a full evaluation on the Unjournal platform, which has been publicly posted at unjournal.pubpub.org). The sample includes papers spanning 2017–2025 working papers in development economics, growth, health policy, environmental economics, and related fields that The Unjournal identified as high-impact. Each of these papers has quantitative scores from at least one human evaluator, and many have multiple (2-3) human ratings.

LLM-based evaluation

Following The Unjournal’s standard guidelines for evaluators and their academic evaluation form, evaluators are asked to consider each paper along the following dimensions: claims & evidence, methods, logic & communication, open science, global relevance, and an overall assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric, evaluators are asked for the midpoint of their beliefs and their 90% credible interval, to communicate their uncertainty. For the journal rankings measure, we ask both “what journal ranking tier should this work be published in? (0.0-5.0)” and “what journal ranking tier will this work be published in? (0.0-5.0)”, with some further explanation.The full prompt can be seen in the code below – essentially copied from the Unjournal’s guidelines page.

We captured the versions of each paper that was evaluated by The Unjournal’s human evaluators, downloading from the links provided in The Unjournal’s Coda database.

We evaluate each paper by passing the PDF directly to the model and requiring a strict, machine‑readable JSON output. This keeps the assessment tied to the document the authors wrote. Direct ingestion preserves tables, figures, equations, and sectioning, which ad‑hoc text scraping can mangle. It also avoids silent trimming or segmentation choices that would bias what the model sees.

LLM evaluation pipeline setup
import os, time, json, random, hashlib
import pathlib
from typing import Any, Dict, Optional, Union

import pandas as pd
import numpy as np

import openai
from openai import OpenAI

# ---------- Configuration (in-file, no external deps)
API_KEY_PATH = pathlib.Path(os.getenv("OPENAI_KEY_PATH", "key/openai_key.txt"))
MODEL        = os.getenv("OPENAI_MODEL", "gpt-5-pro-2025-10-06")
FILE_PURPOSE = "assistants"  # for Responses API file inputs

# Run ID for organizing outputs - change this for each new evaluation run
# Set via environment variable or modify directly here
RUN_ID       = os.getenv("UJ_RUN_ID", "gpt5_pro_updated_jan2026")
RESULTS_DIR  = pathlib.Path("results")
RESULTS_DIR.mkdir(exist_ok=True)
RUN_DIR      = RESULTS_DIR / RUN_ID
RUN_DIR.mkdir(exist_ok=True)
FILE_CACHE   = RESULTS_DIR / ".file_cache.json"  # shared across runs

# ---------- API key bootstrap
if os.getenv("OPENAI_API_KEY") is None and API_KEY_PATH.exists():
    os.environ["OPENAI_API_KEY"] = API_KEY_PATH.read_text().strip()
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("No API key. Set OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()

# ---------- Small utilities (inlined replacements for llm_utils)

def _resp_as_dict(resp: Any) -> Dict[str, Any]:
    if isinstance(resp, dict):
        return resp
    for attr in ("to_dict", "model_dump", "dict", "json"):
        if hasattr(resp, attr):
            try:
                val = getattr(resp, attr)()
                if isinstance(val, (str, bytes)):
                    try:
                        return json.loads(val)
                    except Exception:
                        pass
                if isinstance(val, dict):
                    return val
            except Exception:
                pass
    # last resort
    try:
        return json.loads(str(resp))
    except Exception:
        return {"_raw": str(resp)}

def _get_output_text(resp: Any) -> str:
    d = _resp_as_dict(resp)
    if "output_text" in d and isinstance(d["output_text"], str):
        return d["output_text"]
    out = d.get("output") or []
    chunks = []
    for item in out:
        if not isinstance(item, dict): continue
        if item.get("type") == "message":
            for c in item.get("content") or []:
                if isinstance(c, dict):
                    if "text" in c and isinstance(c["text"], str):
                        chunks.append(c["text"])
                    elif "output_text" in c and isinstance(c["output_text"], str):
                        chunks.append(c["output_text"])
    # Also check legacy top-level choices-like structures
    if not chunks:
        for k in ("content", "message"):
            v = d.get(k)
            if isinstance(v, str):
                chunks.append(v)
    return "\n".join(chunks).strip()

def _extract_json(s: str) -> Dict[str, Any]:
    """Robustly extract first top-level JSON object from a string."""
    if not s:
        raise ValueError("empty output text")
    # Fast path
    s_stripped = s.strip()
    if s_stripped.startswith("{") and s_stripped.endswith("}"):
        return json.loads(s_stripped)

    # Find first balanced {...} while respecting strings
    start = s.find("{")
    if start == -1:
        raise ValueError("no JSON object start found")
    i = start
    depth = 0
    in_str = False
    esc = False
    for i in range(start, len(s)):
        ch = s[i]
        if in_str:
            if esc:
                esc = False
            elif ch == "\\":
                esc = True
            elif ch == '"':
                in_str = False
        else:
            if ch == '"':
                in_str = True
            elif ch == "{":
                depth += 1
            elif ch == "}":
                depth -= 1
                if depth == 0:
                    candidate = s[start:i+1]
                    return json.loads(candidate)
    raise ValueError("no balanced JSON object found")

def call_with_retries(fn, max_tries: int = 6, base_delay: float = 0.8, max_delay: float = 8.0):
    ex = None
    for attempt in range(1, max_tries + 1):
        try:
            return fn()
        except (openai.RateLimitError, openai.APIError, openai.APIConnectionError, openai.APITimeoutError, Exception) as e:
            ex = e
            sleep = min(max_delay, base_delay * (1.8 ** (attempt - 1))) * (1 + 0.25 * random.random())
            time.sleep(sleep)
    raise ex

def _load_cache() -> Dict[str, Any]:
    if FILE_CACHE.exists():
        try:
            return json.loads(FILE_CACHE.read_text())
        except Exception:
            return {}
    return {}

def _save_cache(cache: Dict[str, Any]) -> None:
    FILE_CACHE.write_text(json.dumps(cache, ensure_ascii=False, indent=2))

def _file_sig(p: pathlib.Path) -> Dict[str, Any]:
    st = p.stat()
    return {"size": st.st_size, "mtime": int(st.st_mtime)}

def get_file_id(path: Union[str, pathlib.Path], client: OpenAI) -> str:
    p = pathlib.Path(path)
    if not p.exists():
        raise FileNotFoundError(p)
    cache = _load_cache()
    key = str(p.resolve())
    sig = _file_sig(p)
    meta = cache.get(key)
    if meta and meta.get("size") == sig["size"] and meta.get("mtime") == sig["mtime"] and meta.get("file_id"):
        return meta["file_id"]
    # Upload fresh
    with open(p, "rb") as fh:
      f = call_with_retries(lambda: client.files.create(file=fh, purpose=FILE_PURPOSE))
    fd = _resp_as_dict(f)
    fid = fd.get("id")
    if not fid:
        raise RuntimeError(f"Upload did not return file id: {fd}")
    cache[key] = {"file_id": fid, **sig}
    _save_cache(cache)
    return fid

def _reasoning_meta(resp) -> Dict[str, Any]:
    d = _resp_as_dict(resp)
    rid, summary_text = None, None
    out = d.get("output") or []
    if out and isinstance(out, list) and out[0].get("type") == "reasoning":
        rid = out[0].get("id")
        summ = out[0].get("summary") or []
        if summ and isinstance(summ, list):
            summary_text = summ[0].get("text")
    usage = d.get("usage") or {}
    odet  = usage.get("output_tokens_details") or {}
    return {
        "response_id": d.get("id"),
        "reasoning_id": rid,
        "reasoning_summary": summary_text,
        "input_tokens": usage.get("input_tokens"),
        "output_tokens": usage.get("output_tokens"),
        "reasoning_tokens": odet.get("reasoning_tokens"),
    }
    

def read_csv_or_empty(path, columns=None, **kwargs):
    p = pathlib.Path(path)
    if not p.exists():
        return pd.DataFrame(columns=columns or [])
    try:
        df = pd.read_csv(p, **kwargs)
        if df is None or getattr(df, "shape", (0,0))[1] == 0:
            return pd.DataFrame(columns=columns or [])
        return df
    except (pd.errors.EmptyDataError, pd.errors.ParserError, OSError, ValueError):
        return pd.DataFrame(columns=columns or [])    

JSON Schema

We enforce a strict JSON Schema for the output. The model must return one object for each metric, including a midpoint rating and a 90% credible interval. This guarantees that every paper is scored on the same fields with the same types and bounds. We request credible intervals (as we do for human evaluators) to allow the model to communicate its uncertainty rather than suggest false precision.

JSON Schema definition
METRICS = [
    "overall",
    "claims_evidence",
    "methods",
    "advancing_knowledge",
    "logic_communication",
    "open_science",
    "global_relevance",
]

metric_schema = {
    "type": "object",
    "properties": {
        "midpoint":    {"type": "number", "minimum": 0, "maximum": 100},
        "lower_bound": {"type": "number", "minimum": 0, "maximum": 100},
        "upper_bound": {"type": "number", "minimum": 0, "maximum": 100},
    },
    "required": ["midpoint", "lower_bound", "upper_bound"],
    "additionalProperties": False,
}

TIER_METRIC_SCHEMA = {
    "type": "object",
    "properties": {
        "score":   {"type": "number", "minimum": 0, "maximum": 5},
        "ci_lower":{"type": "number", "minimum": 0, "maximum": 5},
        "ci_upper":{"type": "number", "minimum": 0, "maximum": 5},
    },
    "required": ["score", "ci_lower", "ci_upper"],
    "additionalProperties": False,
}

COMBINED_SCHEMA = {
    "type": "object",
    "properties": {
        "assessment_summary": {"type": "string"},
        "metrics": {
            "type": "object",
            "properties": {
                **{m: metric_schema for m in METRICS},
                "tier_should": TIER_METRIC_SCHEMA,
                "tier_will":   TIER_METRIC_SCHEMA,
            },
            "required": METRICS + ["tier_should", "tier_will"],
            "additionalProperties": False,
        },
    },
    "required": ["assessment_summary", "metrics"],
    "additionalProperties": False,
}

TEXT_FORMAT_COMBINED = {
    "type": "json_schema",
    "name": "paper_assessment_with_tiers_v2",
    "strict": True,
    "schema": COMBINED_SCHEMA,
}

System Prompt

The system prompt is built from modular sections, each addressing a specific aspect of the evaluation task. We present each section below with brief explanations.

We instruct the model to act as an expert evaluator while explicitly preventing it from using author reputation, publication venue, or any prior knowledge of the paper’s reception as evidence of quality.

Role and debiasing instructions
PROMPT_ROLE = """
Your role -- You are an academic expert as well as a practitioner across every relevant field -- use all your knowledge and insight. You are acting as an expert research evaluator/reviewer.
"""

PROMPT_DEBIASING = """
Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors' names, status, or institutions in your judgment; ignore where (or whether) the work is published, the prestige of any venue, and how much attention it has received. Do not use this as evidence about quality. You must base all judgments entirely on the content of the PDF.
"""

Before scoring, we ask for a written assessment identifying key issues in the manuscript. This “think first, score second” approach encourages the model to ground its ratings in specific observations.

Diagnostic summary instructions
PROMPT_DIAGNOSTIC = """
Diagnostic summary (Aim for about 1000 words, based only on the PDF):
Provide a compact paragraph that identifies the most important issues you detect in the manuscript itself (e.g., identification threats, data limitations, misinterpretations, internal inconsistencies, missing robustness, replication barriers). Be specific, neutral, and concrete. This summary should precede any scoring and should guide your uncertainty. Output this text in the JSON field `assessment_summary`.
"""

We define what percentile rankings mean and establish the reference group: “serious research in the same area encountered in the last three years.” This anchors the model’s judgments to a consistent baseline.

Percentile scale and reference group
PROMPT_PERCENTILE_INTRO = """
We ask for a set of quantitative metrics, based on your insights. For each metric, we ask for a score and a 90% credible interval. We describe these in detail below.

Percentile rankings relative to a reference group: For some questions, we ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on. Here the population of papers should be all serious research in the same area that you have encountered in the last three years. *Unless this work is in our 'applied and policy stream', in which case the reference group should be "all applied and policy research you have read that is aiming at a similar audience, and that has similar goals".
"""

PROMPT_REFERENCE_GROUP = """
"Serious" research? Academic research?
Here, we are mainly considering research done by professional researchers with high levels of training, experience, and familiarity with recent practice, who have time and resources to devote months or years to each such research project or paper.
These will typically be written as 'working papers' and presented at academic seminars before being submitted to standard academic journals. Although no credential is required, this typically includes people with PhD degrees (or upper-level PhD students). Most of this sort of research is done by full-time academics (professors, post-docs, academic staff, etc.) with a substantial research remit, as well as research staff at think tanks and research institutions (but there may be important exceptions).

What counts as the "same area"?
This is a judgment call. Some criteria to consider... First, does the work come from the same academic field and research subfield, and does it address questions that might be addressed using similar methods? Second, does it deal with the same substantive research question, or a closely related one? If the research you are evaluating is in a very niche topic, the comparison reference group should be expanded to consider work in other areas.

"Research that you have encountered"
We are aiming for comparability across evaluators. If you suspect you are particularly exposed to higher-quality work in this category, compared to other likely evaluators, you may want to adjust your reference group downwards. (And of course vice-versa, if you suspect you are particularly exposed to lower-quality work.)
"""

We define each of the seven percentile metrics, closely following The Unjournal’s guidelines for evaluators. Note the emphasis on global priorities and practical relevance over pure academic novelty.

Metric definitions
PROMPT_METRICS = """
Midpoint rating and credible intervals: For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty.

- "overall" - Overall assessment - Percentile ranking (0-100%): Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness, importance to knowledge production, and importance to practice.

- "claims_evidence" - Claims, strength and characterization of evidence (0-100%): Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?

- "methods" - Justification, reasonableness, validity, robustness (0-100%): Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this? Did the authors take steps to reduce bias from opportunistic reporting and questionable research practices?

- "advancing_knowledge" - Advancing our knowledge and practice (0-100%): To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions? (Applied stream: please focus on 'improvements that are actually helpful'.) Less weight to "originality and cleverness": Originality and cleverness should be weighted less than the typical journal, because we focus on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on 'contribution to GP' than on 'contribution to the academic field'.
    Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions?
    Does the project add useful value to other impactful research?
    We don't require surprising results; sound and well-presented null results can also be valuable.

- "logic_communication" - Logic and communication (0-100%): Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced? Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow? Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?

- "open_science" - Open, collaborative, replicable research (0-100%): This covers several considerations:
    - Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided? Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?
    - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?
    - Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?

- "global_relevance" - Relevance to global priorities, usefulness for practitioners: Are the paper's chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?
"""

We explain what credible intervals mean and how to construct them. This guidance helps ensure the model produces well-calibrated uncertainty estimates rather than artificially narrow intervals.

Credible interval guidance
PROMPT_UNCERTAINTY = """
The midpoint and 'credible intervals': expressing uncertainty - What are we looking for and why?
- We want policymakers, researchers, funders, and managers to be able to use The Unjournal's evaluations to update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other and other sources of information. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. But it's hard to quantify statements like "very certain" or "somewhat uncertain" – different people may use the same phrases to mean different things. That's why we're asking for a more precise measure: your credible intervals. These metrics are particularly useful for meta-science and meta-analysis. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.
- How do I come up with these intervals? (Discussion and guidance): You may understand the concepts of uncertainty and credible intervals, but you might be unfamiliar with applying them in a situation like this one. You may have a certain best guess for the "Methods..." criterion. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a method you are not familiar with, etc. Your uncertainty over this could be described by some distribution, representing your beliefs about the true value of this criterion. Your "best guess" should be the central mass point of this distribution. For some questions, the "true value" refers to something objective, e.g. will this work be published in a top-ranked journal? In other cases, like the percentile rankings, the true value means "if you had complete evidence, knowledge, and wisdom, what value would you choose?" If you are well calibrated your 90% credible intervals should contain the true value 90% of the time. Consider the midpoint as the 'median of your belief distribution'.
- We also ask for the 'midpoint', the center dot on that slider. Essentially, we are asking for the median of your belief distribution. By this we mean the percentile ranking such that you believe "there's a 50% chance that the paper's true rank is higher than this, and a 50% chance that it actually ranks lower than this."
"""

In addition to percentile metrics, we ask for journal tier predictions on a 0-5 scale. These provide an external benchmark and, for papers that eventually publish, allow us to assess predictive accuracy.

Journal tier predictions
PROMPT_TIERS = """
Additionally, we ask: What journal ranking tier should and will this work be published in?

To help universities and policymakers make sense of our evaluations, we want to benchmark them against how research is currently judged. So, we would like you to assess the paper in terms of journal rankings. We ask for two assessments:
1. a normative judgment about 'how well the research should publish';
2. a prediction about where the research will be published.
As before, we ask for a 90% credible interval.

Journal ranking tiers are on a 0-5 scale, as follows:
    0/5: "Won't publish/little to no value". Unlikely to be cited by credible researchers
    1/5: OK/Somewhat valuable journal
    2/5: Marginal B-journal/Decent field journal
    3/5: Top B-journal/Strong field journal
    4/5: Marginal A-Journal/Top field journal
    5/5: A-journal/Top journal

- We encourage you to consider a non-integer score, e.g. 4.6 or 2.2. If a paper would be most likely to be (or merits being) published in a journal that would rank about halfway between a top tier 'A journal' and a second tier (4/5) journal, you should rate it a 4.5. Similarly, if you think it has an 80% chance of (being/meriting) publication in a 'marginal B-journal' and a 20% chance of a Top B-journal, you should rate it 2.2. Please also use this continuous scale for providing credible intervals.

- Journal ranking tier "should" (0.0-5.0)
    Assess this paper on the journal ranking scale described above, considering only its merit, giving some weight to the category metrics we discussed above. Equivalently, where would this paper be published if:
    1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn't matter;
    2. journals assessed research according to the category metrics we discussed above.

- Journal ranking tier "will" (0.0-5.0)
    What if this work has already been peer reviewed and published? If this work has already been published, and you know where, please report the prediction you would have given absent that knowledge.
"""

Finally, we include instructions for self-consistency checks and the required JSON output format. The validation section encourages the model to ensure its scores align with its written assessment.

Validation and output format
PROMPT_VALIDATION = """
When you set the quantitative metrics:
- Treat `midpoint` as your 50% belief (the value such that you think there is a 50% chance the true value is higher and 50% chance it is lower).
- Treat `lower_bound` and `upper_bound` as an honest 90% credible interval (roughly the 5th and 95th percentiles of your belief distribution).

For all percentile metrics (0–100 scale):
- You must always satisfy: lower_bound < midpoint < upper_bound.

For the journal tier metrics (0.0–5.0):
- You must always satisfy: ci_lower < score < ci_upper.

Before finalising your JSON:
- Check that your numeric scores are consistent with your own assessment_summary. If your summary describes serious or fundamental problems with methods, evidence, or interpretation, your scores for those metrics (and for "overall") should clearly reflect that.
- Conversely, if you assign very high scores in any metric, your summary should explicitly justify why that aspect of the paper is unusually strong relative to other serious work in the field.
- If you find yourself about to make the lower and upper bounds equal to the midpoint, adjust them so they form a non-degenerate interval that honestly reflects your uncertainty. Do not be afraid to use wide credible intervals when you are genuinely uncertain.
"""

PROMPT_OUTPUT = """
Fill both top-level keys:
- `assessment_summary`: about 1000 words.
- `metrics`: object containing all required metrics.

Field names:
- Percentile metrics → `midpoint`, `lower_bound`, `upper_bound`.
- Tier metrics → `score`, `ci_lower`, `ci_upper`.

Return STRICT JSON matching the supplied schema. No preamble. No markdown. No extra text.
"""

The sections above are concatenated to form the complete system prompt:

Prompt assembly
SYSTEM_PROMPT_COMBINED = "\n".join([
    PROMPT_ROLE,
    PROMPT_DEBIASING,
    PROMPT_DIAGNOSTIC,
    PROMPT_PERCENTILE_INTRO,
    PROMPT_REFERENCE_GROUP,
    PROMPT_METRICS,
    PROMPT_UNCERTAINTY,
    PROMPT_TIERS,
    PROMPT_VALIDATION,
    PROMPT_OUTPUT,
]).strip()

The evaluate_paper function uploads a PDF to the API and submits a background job for evaluation:

Evaluation function
def evaluate_paper(pdf_path: Union[str, pathlib.Path],
                   model: Optional[str] = None,
                   use_reasoning: bool = True) -> Dict[str, Any]:
    model = model or MODEL
    fid = get_file_id(pdf_path, client)

    def _payload():
        p = dict(
            model=model,
            text={"format": TEXT_FORMAT_COMBINED},
            input=[
                {"role": "system", "content": [
                    {"type": "input_text", "text": SYSTEM_PROMPT_COMBINED}
                ]},
                {"role": "user", "content": [
                    {"type": "input_file", "file_id": fid},
                    {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
                ]},
            ],
            max_output_tokens=12000,
            background=True,
            store=True,
        )
        if use_reasoning:
            p["reasoning"] = {"effort": "high", "summary": "auto"}
        return p

    kickoff = call_with_retries(lambda: client.responses.create(**_payload()))
    kd = _resp_as_dict(kickoff)
    return {
        "response_id": kd.get("id"),
        "file_id": fid,
        "status": kd.get("status") or "queued",
        "model": model,
        "created_at": kd.get("created_at"),
    }

Relying on GPT-5 Pro, we use a single‑step call with a reasoning model that supports file input. One step avoids hand‑offs and summary loss from a separate “ingestion” stage. The model reads the whole PDF and produces the JSON defined above. We do not retrieve external sources or cross‑paper material for these scores; the evaluation is anchored in the manuscript itself.

The Python pipeline uploads each PDF once and caches the returned file id keyed by path, size, and modification time. We submit one background job per PDF to the OpenAI Responses API with “high” reasoning effort and server‑side JSON‑Schema enforcement. Submissions record the response id, model id, file id, status, and timestamps.

Kick off background jobs → results/jobs_index.csv
import pathlib, time

ROOT = pathlib.Path(os.getenv("UJ_PAPERS_DIR", "papers")).expanduser()
# Use RUN_DIR for isolated outputs, but jobs_index stays in results/ for monitoring
IDX  = RESULTS_DIR / "jobs_index.csv"

pdfs = sorted(ROOT.glob("*.pdf"))
print("Found PDFs:", [p.name for p in pdfs])

cols = ["paper","pdf","response_id","file_id","model","status","created_at","last_update","collected","error"]
idx = read_csv_or_empty(IDX, columns=cols)
for c in cols:
    if c not in idx.columns: idx[c] = pd.NA

existing = dict(zip(idx["paper"], idx["status"])) if not idx.empty else {}
started = []

for pdf in pdfs:
    paper = pdf.stem
    if existing.get(paper) in ("queued","in_progress","incomplete","requires_action"):
        print(f"skip {pdf.name}: job already running")
        continue
    try:
        job = evaluate_paper(pdf, model=MODEL, use_reasoning=True)
        started.append({
            "paper": paper,
            "pdf": str(pdf),
            "response_id": job.get("response_id"),
            "file_id": job.get("file_id"),
            "model": job.get("model"),
            "status": job.get("status"),
            "created_at": job.get("created_at") or pd.Timestamp.utcnow().isoformat(),
            "last_update": pd.Timestamp.utcnow().isoformat(),
            "collected": False,
            "error": pd.NA,
        })
        print(f"✓ Started job for {pdf.name}, waiting 90s before next submission...")
        time.sleep(90)  # Wait 90s between submissions to avoid TPM rate limits
    except Exception as e:
        print(f"⚠️ kickoff failed for {pdf.name}: {e}")

if started:
    idx = pd.concat([idx, pd.DataFrame(started)], ignore_index=True)
    idx.drop_duplicates(subset=["paper"], keep="last", inplace=True)
    idx.to_csv(IDX, index=False)
    print(f"Started {len(started)} jobs → {IDX}")
else:
    print("No new jobs started.")

We then polls job status and, for each completed job, retrieve the raw JSON object, and write the responses to disk.

Poll status, collect completed outputs, save raw JSON only
import json, pathlib, pandas as pd

# Use RUN_DIR for outputs, jobs_index in RESULTS_DIR for monitoring
IDX = RESULTS_DIR / "jobs_index.csv"
JSN = RUN_DIR / "json"; JSN.mkdir(exist_ok=True)

def _safe_read_csv(path, columns):
    p = pathlib.Path(path)
    if not p.exists():
        return pd.DataFrame(columns=columns)
    try:
        df = pd.read_csv(p, dtype={'error': 'object', 'reasoning_id': 'object'})
    except Exception:
        return pd.DataFrame(columns=columns)
    for c in columns:
        if c not in df.columns:
            df[c] = pd.NA
    return df

cols = [
    "paper","pdf","response_id","file_id","model","status",
    "created_at","last_update","collected","error",
    "reasoning_id","input_tokens","output_tokens","reasoning_tokens",
    "reasoning_summary"
]

idx = _safe_read_csv(IDX, cols)

if idx.empty:
    print("Index is empty.")
else:
    term = {"completed","failed","cancelled","expired"}

    # 1) Refresh statuses
    for i, row in idx.iterrows():
        if str(row.get("status")) in term:
            continue
        try:
            r = client.responses.retrieve(str(row["response_id"]))
            d = _resp_as_dict(r)
            idx.at[i, "status"] = d.get("status")
            idx.at[i, "last_update"] = pd.Timestamp.utcnow().isoformat()
            if d.get("status") in term and d.get("status") != "completed":
                idx.at[i, "error"] = json.dumps(d.get("incomplete_details") or {})
        except Exception as e:
            idx.at[i, "error"] = str(e)

    # 2) Collect fresh completed outputs
    newly_done = idx[(idx["status"] == "completed") & (idx["collected"] == False)]
    print(f"Completed and pending collection: {len(newly_done)}")

    for i, row in newly_done.iterrows():
        rid   = str(row["response_id"])
        paper = str(row["paper"])
        try:
            r = client.responses.retrieve(rid)

            # save full raw response JSON
            with open(JSN / f"{paper}.response.json", "w", encoding="utf-8") as f:
                f.write(json.dumps(_resp_as_dict(r), ensure_ascii=False))

            # optional: stash reasoning meta in jobs_index
            m = _reasoning_meta(r)
            idx.at[i, "collected"]         = True
            idx.at[i, "error"]             = pd.NA
            idx.at[i, "reasoning_id"]      = m.get("reasoning_id")
            idx.at[i, "input_tokens"]      = m.get("input_tokens")
            idx.at[i, "output_tokens"]     = m.get("output_tokens")
            idx.at[i, "reasoning_tokens"]  = m.get("reasoning_tokens")
            idx.at[i, "reasoning_summary"] = m.get("reasoning_summary")

        except Exception as e:
            idx.at[i, "error"] = f"collect: {e}"

    # 3) Save updated index and print progress
    idx.to_csv(IDX, index=False)
    counts = idx["status"].value_counts(dropna=False).to_dict()
    print("Status counts:", counts)
    print(f"Progress: {counts.get('completed', 0)}/{len(idx)} completed")

Multi-Model Evaluation

To compare model performance, we collect evaluations from multiple providers. This allows us to assess whether different models exhibit systematic biases or calibration differences.

For OpenAI models without reasoning (e.g., GPT-4o), we use a simpler synchronous call:

OpenAI multi-model evaluation (GPT-4o, etc.)
OPENAI_MODELS = [
    # "gpt-4o-2024-11-20",
    "gpt-4o-mini-2024-07-18",  # cheaper, faster
]

def evaluate_paper_sync(pdf_path: pathlib.Path, model: str) -> Dict[str, Any]:
    """Synchronous evaluation for models without background/reasoning support."""
    fid = get_file_id(pdf_path, client)

    resp = call_with_retries(lambda: client.responses.create(
        model=model,
        text={"format": TEXT_FORMAT_COMBINED},
        input=[
            {"role": "system", "content": [
                {"type": "input_text", "text": SYSTEM_PROMPT_COMBINED}
            ]},
            {"role": "user", "content": [
                {"type": "input_file", "file_id": fid},
                {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
            ]},
        ],
        max_output_tokens=12000,
    ))

    return {
        "response_id": resp.id,
        "model": model,
        "output_text": _get_output_text(resp),
        "usage": _resp_as_dict(resp).get("usage", {}),
    }

def run_openai_models(pdfs: list, models: list = OPENAI_MODELS, out_dir: pathlib.Path = None):
    """Run evaluation across multiple OpenAI models."""
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    results = []

    for model in models:
        model_dir = out_dir / model.replace("-", "_")
        model_dir.mkdir(parents=True, exist_ok=True)
        json_dir = model_dir / "json"
        json_dir.mkdir(exist_ok=True)

        for pdf in pdfs:
            paper = pdf.stem
            out_file = json_dir / f"{paper}.response.json"
            if out_file.exists():
                print(f"skip {paper} ({model}): exists")
                continue
            try:
                print(f"Evaluating {paper} with {model}...")
                result = evaluate_paper_sync(pdf, model)
                with open(out_file, "w") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                results.append({"paper": paper, "model": model, "status": "completed"})
                time.sleep(2)
            except Exception as e:
                print(f"Error {paper} ({model}): {e}")
                results.append({"paper": paper, "model": model, "status": "failed", "error": str(e)})

    return pd.DataFrame(results)

Anthropic’s API differs from OpenAI: PDFs are passed as base64-encoded content, and calls are synchronous (no background jobs). We use Claude’s native PDF support.

Anthropic Claude API evaluation
import anthropic
import base64

ANTHROPIC_KEY_PATH = pathlib.Path("key/anthropic_key.txt")
if ANTHROPIC_KEY_PATH.exists():
    os.environ["ANTHROPIC_API_KEY"] = ANTHROPIC_KEY_PATH.read_text().strip()

anthropic_client = anthropic.Anthropic()

ANTHROPIC_MODELS = [
    "claude-sonnet-4-20250514",
    # "claude-3-5-haiku-20241022",  # faster, cheaper
]

def evaluate_paper_anthropic(pdf_path: pathlib.Path, model: str) -> Dict[str, Any]:
    """Evaluate using Anthropic's Claude API with native PDF support."""
    pdf_base64 = base64.standard_b64encode(pdf_path.read_bytes()).decode("utf-8")

    resp = call_with_retries(lambda: anthropic_client.messages.create(
        model=model,
        max_tokens=12000,
        system=SYSTEM_PROMPT_COMBINED,
        messages=[{
            "role": "user",
            "content": [
                {
                    "type": "document",
                    "source": {"type": "base64", "media_type": "application/pdf", "data": pdf_base64},
                },
                {"type": "text", "text": "Return STRICT JSON per schema. No extra text."}
            ],
        }],
    ))

    output_text = resp.content[0].text if resp.content else ""
    return {
        "model": model,
        "output_text": output_text,
        "input_tokens": resp.usage.input_tokens,
        "output_tokens": resp.usage.output_tokens,
    }

def run_anthropic_models(pdfs: list, models: list = ANTHROPIC_MODELS, out_dir: pathlib.Path = None):
    """Run evaluation across Anthropic models."""
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    results = []

    for model in models:
        model_dir = out_dir / model.replace("-", "_")
        model_dir.mkdir(parents=True, exist_ok=True)
        json_dir = model_dir / "json"
        json_dir.mkdir(exist_ok=True)

        for pdf in pdfs:
            paper = pdf.stem
            out_file = json_dir / f"{paper}.response.json"
            if out_file.exists():
                print(f"skip {paper} ({model}): exists")
                continue
            try:
                print(f"Evaluating {paper} with {model}...")
                result = evaluate_paper_anthropic(pdf, model)
                result["parsed"] = _extract_json(result["output_text"])
                with open(out_file, "w") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                results.append({"paper": paper, "model": model, "status": "completed"})
                time.sleep(2)
            except Exception as e:
                print(f"Error {paper} ({model}): {e}")
                results.append({"paper": paper, "model": model, "status": "failed", "error": str(e)})

    return pd.DataFrame(results)

Google’s Gemini API supports PDF via file upload:

Google Gemini API evaluation
import google.generativeai as genai

GOOGLE_KEY_PATH = pathlib.Path("key/google_key.txt")
if GOOGLE_KEY_PATH.exists():
    genai.configure(api_key=GOOGLE_KEY_PATH.read_text().strip())

GOOGLE_MODELS = [
    "gemini-2.0-flash",
    # "gemini-1.5-pro",
]

def evaluate_paper_google(pdf_path: pathlib.Path, model_name: str) -> Dict[str, Any]:
    """Evaluate using Google's Gemini API."""
    uploaded_file = genai.upload_file(pdf_path, mime_type="application/pdf")

    model = genai.GenerativeModel(model_name=model_name, system_instruction=SYSTEM_PROMPT_COMBINED)
    resp = call_with_retries(lambda: model.generate_content(
        [uploaded_file, "Return STRICT JSON per schema. No extra text."],
        generation_config=genai.GenerationConfig(max_output_tokens=12000, response_mime_type="application/json"),
    ))

    try:
        genai.delete_file(uploaded_file.name)
    except Exception:
        pass

    return {
        "model": model_name,
        "output_text": resp.text or "",
        "input_tokens": getattr(resp.usage_metadata, "prompt_token_count", None),
        "output_tokens": getattr(resp.usage_metadata, "candidates_token_count", None),
    }

def run_google_models(pdfs: list, models: list = GOOGLE_MODELS, out_dir: pathlib.Path = None):
    """Run evaluation across Google models."""
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    results = []

    for model in models:
        model_dir = out_dir / model.replace("-", "_")
        model_dir.mkdir(parents=True, exist_ok=True)
        json_dir = model_dir / "json"
        json_dir.mkdir(exist_ok=True)

        for pdf in pdfs:
            paper = pdf.stem
            out_file = json_dir / f"{paper}.response.json"
            if out_file.exists():
                print(f"skip {paper} ({model}): exists")
                continue
            try:
                print(f"Evaluating {paper} with {model}...")
                result = evaluate_paper_google(pdf, model)
                result["parsed"] = _extract_json(result["output_text"])
                with open(out_file, "w") as f:
                    json.dump(result, f, ensure_ascii=False, indent=2)
                results.append({"paper": paper, "model": model, "status": "completed"})
                time.sleep(2)
            except Exception as e:
                print(f"Error {paper} ({model}): {e}")
                results.append({"paper": paper, "model": model, "status": "failed", "error": str(e)})

    return pd.DataFrame(results)

A unified runner collects evaluations from all providers:

Run all providers
def run_all_models(pdfs: list = None, out_dir: pathlib.Path = None):
    """Run evaluation across all configured models and providers."""
    pdfs = pdfs or sorted(pathlib.Path("papers").glob("*.pdf"))
    out_dir = out_dir or RESULTS_DIR  # multi-model outputs go to results/{model}/
    all_results = []

    for name, runner in [("openai", run_openai_models), ("anthropic", run_anthropic_models), ("google", run_google_models)]:
        print(f"\n=== {name.upper()} ===")
        try:
            df = runner(pdfs, out_dir=out_dir)
            df["provider"] = name
            all_results.append(df)
        except Exception as e:
            print(f"{name} failed: {e}")

    if all_results:
        combined = pd.concat(all_results, ignore_index=True)
        combined.to_csv(out_dir / "multi_model_summary.csv", index=False)
        return combined
    return pd.DataFrame()




run_all_models()

GPT-5.2 Pro Focal Run (with Key Issues)

This section evaluates 14 focal papers using GPT-5.2 Pro with an extended schema that outputs a numbered list of key issues alongside the standard metrics.

Focal run configuration
FOCAL_MODEL = "gpt-5.2-pro-2025-12-11"
FOCAL_RUN_DIR = RESULTS_DIR / "gpt52_pro_focal_jan2026"
FOCAL_RUN_DIR.mkdir(exist_ok=True)
(FOCAL_RUN_DIR / "json").mkdir(exist_ok=True)

FOCAL_PAPERS = [
    "Acemoglu_et_al._2024",
    "Adena_and_Hager_2024",
    "Benabou_et_al._2023",
    "Bilal_and_Kaenzig_2024",
    "Blimpo_and_Castaneda-Dower_2025",
    "Bruers_2021",
    "Clancy_2024",
    "Dullaghan_and_Zhang_2022",
    "Frech_et_al._2023",
    "Green_et_al._2025",
    "McGuire_et_al._2024",
    "Peterman_et_al._2025",
    "Weaver_et_al._2025",
    "Williams_et_al._2024",
]

# Extended schema with key_issues array
COMBINED_SCHEMA_WITH_ISSUES = {
    "type": "object",
    "properties": {
        "assessment_summary": {"type": "string"},
        "key_issues": {
            "type": "array",
            "items": {"type": "string"},
        },
        "metrics": {
            "type": "object",
            "properties": {
                **{m: metric_schema for m in METRICS},
                "tier_should": TIER_METRIC_SCHEMA,
                "tier_will":   TIER_METRIC_SCHEMA,
            },
            "required": METRICS + ["tier_should", "tier_will"],
            "additionalProperties": False,
        },
    },
    "required": ["assessment_summary", "key_issues", "metrics"],
    "additionalProperties": False,
}

TEXT_FORMAT_WITH_ISSUES = {
    "type": "json_schema",
    "name": "paper_assessment_with_key_issues_v1",
    "strict": True,
    "schema": COMBINED_SCHEMA_WITH_ISSUES,
}

# Extended output prompt with key_issues instruction
PROMPT_OUTPUT_WITH_ISSUES = """
Fill all three top-level keys:
- `assessment_summary`: about 1000 words.
- `key_issues`: a numbered list (array of strings) identifying the most important methodological, interpretive, or evidential issues in the paper. Each item should be a concise statement (1-2 sentences) that a reader could use as a checklist. Aim for 5-15 issues depending on the paper. Order from most to least important.
- `metrics`: object containing all required metrics.

Field names:
- Percentile metrics → `midpoint`, `lower_bound`, `upper_bound`.
- Tier metrics → `score`, `ci_lower`, `ci_upper`.

Return STRICT JSON matching the supplied schema. No preamble. No markdown. No extra text.
"""

SYSTEM_PROMPT_WITH_ISSUES = "\n".join([
    PROMPT_ROLE,
    PROMPT_DEBIASING,
    PROMPT_DIAGNOSTIC,
    PROMPT_PERCENTILE_INTRO,
    PROMPT_REFERENCE_GROUP,
    PROMPT_METRICS,
    PROMPT_UNCERTAINTY,
    PROMPT_TIERS,
    PROMPT_VALIDATION,
    PROMPT_OUTPUT_WITH_ISSUES,
]).strip()
Kick off focal paper jobs → gpt52_pro_focal_jan2026/jobs_index.csv
import time

FOCAL_IDX = FOCAL_RUN_DIR / "jobs_index.csv"
cols = ["paper","pdf","response_id","file_id","model","status","created_at","last_update","collected","error"]
idx = read_csv_or_empty(FOCAL_IDX, columns=cols)

existing = dict(zip(idx["paper"], idx["status"])) if not idx.empty else {}
started = []

for paper_name in FOCAL_PAPERS:
    pdf_path = pathlib.Path("papers") / f"{paper_name}.pdf"

    if not pdf_path.exists():
        print(f"⚠️ PDF not found: {pdf_path}")
        continue

    if existing.get(paper_name) in ("queued", "in_progress", "incomplete"):
        print(f"⏭️ Skip {paper_name}: job already running")
        continue

    if existing.get(paper_name) == "completed":
        print(f"✅ Skip {paper_name}: already completed")
        continue

    try:
        fid = get_file_id(pdf_path, client)

        kickoff = call_with_retries(lambda: client.responses.create(
            model=FOCAL_MODEL,
            text={"format": TEXT_FORMAT_WITH_ISSUES},
            input=[
                {"role": "system", "content": [
                    {"type": "input_text", "text": SYSTEM_PROMPT_WITH_ISSUES}
                ]},
                {"role": "user", "content": [
                    {"type": "input_file", "file_id": fid},
                    {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
                ]},
            ],
            max_output_tokens=15000,
            background=True,
            store=True,
            reasoning={"effort": "high", "summary": "detailed"},
        ))
        kd = _resp_as_dict(kickoff)

        started.append({
            "paper": paper_name,
            "pdf": str(pdf_path),
            "response_id": kd.get("id"),
            "file_id": fid,
            "model": FOCAL_MODEL,
            "status": kd.get("status") or "queued",
            "created_at": kd.get("created_at") or pd.Timestamp.utcnow().isoformat(),
            "last_update": pd.Timestamp.utcnow().isoformat(),
            "collected": False,
            "error": pd.NA,
        })
        print(f"✓ Started job for {paper_name}")
        time.sleep(90)  # Wait between submissions

    except Exception as e:
        print(f"❌ Failed {paper_name}: {e}")

if started:
    idx = pd.concat([idx, pd.DataFrame(started)], ignore_index=True)
    idx.drop_duplicates(subset=["paper"], keep="last", inplace=True)
    idx.to_csv(FOCAL_IDX, index=False)
    print(f"\n✓ Started {len(started)} jobs → {FOCAL_IDX}")
else:
    print("No new jobs started.")
Collect focal paper results
FOCAL_IDX = FOCAL_RUN_DIR / "jobs_index.csv"
FOCAL_JSON = FOCAL_RUN_DIR / "json"

idx = pd.read_csv(FOCAL_IDX, dtype={'error': 'object'})
print(f"Polling {len(idx)} focal jobs...")

for i, row in idx.iterrows():
    paper = row["paper"]
    resp_id = row["response_id"]

    if pd.isna(resp_id):
        continue

    json_path = FOCAL_JSON / f"{paper}.response.json"
    if json_path.exists() and row.get("collected") == True:
        continue

    try:
        resp = client.responses.retrieve(resp_id)
        rd = _resp_as_dict(resp)
        status = rd.get("status", "unknown")
        idx.at[i, "status"] = status
        idx.at[i, "last_update"] = pd.Timestamp.utcnow().isoformat()

        if status == "completed":
            with open(json_path, "w") as f:
                json.dump(rd, f, ensure_ascii=False, indent=2, default=str)
            idx.at[i, "collected"] = True

            m = _reasoning_meta(resp)
            idx.at[i, "input_tokens"] = m.get("input_tokens")
            idx.at[i, "output_tokens"] = m.get("output_tokens")
            idx.at[i, "reasoning_tokens"] = m.get("reasoning_tokens")
            print(f"✓ Collected: {paper}")

        elif status == "failed":
            idx.at[i, "error"] = rd.get("error", "Unknown")
            print(f"✗ Failed: {paper}")

        else:
            print(f"⏳ {status}: {paper}")

    except Exception as e:
        print(f"⚠️ Error {paper}: {e}")

idx.to_csv(FOCAL_IDX, index=False)
counts = idx["status"].value_counts().to_dict()
print(f"\nStatus: {counts}")

Key Issues Comparison with Human Critiques

To validate how well the LLM identifies substantive issues, we compare its key_issues output against human expert critiques from The Unjournal’s Coda database. These human critiques were written by domain experts during the standard evaluation process, providing a ground-truth benchmark for issue identification.

The comparison proceeds in two stages: first, we extract and align the data sources; then, optionally, we use an LLM to systematically assess the degree of alignment between machine-generated and human-identified issues.

Key issues comparison: parse markdown and run LLM assessment
import re

# =============================================================================
# Configuration
# =============================================================================
KEY_ISSUES_MD_INPUT = RESULTS_DIR / "key_issues_comparison.md"
KEY_ISSUES_OUTPUT = RESULTS_DIR / "key_issues_comparison.json"

# =============================================================================
# Markdown Parser
# =============================================================================
def parse_key_issues_markdown(md_path):
    """Parse key_issues_comparison.md to extract paper data.

    The markdown has this structure for each paper:
    ## PaperName
    **Coda title:** Title
    ### GPT-5.2 Pro Key Issues
    - Issue 1
    - Issue 2
    ### Human Expert Critiques (Coda)
    Critique text...
    ---
    """
    with open(md_path, 'r', encoding='utf-8') as f:
        content = f.read()

    # Split by paper sections (## followed by paper name)
    # Pattern handles names like "Blimpo_and_Castaneda-Dower_2025" (with hyphens)
    paper_sections = re.split(r'\n## ([A-Za-z_0-9-]+(?:_et_al\.)?_\d{4})\n', content)

    matched_data = []
    for i in range(1, len(paper_sections), 2):
        if i + 1 >= len(paper_sections):
            break

        paper_name = paper_sections[i].strip()
        section_content = paper_sections[i + 1]

        # Extract Coda title
        coda_match = re.search(r'\*\*Coda title:\*\*\s*(.+?)(?:\n|$)', section_content)
        coda_title = coda_match.group(1).strip() if coda_match else ""

        # Extract GPT key issues (bullet points after "### GPT" header)
        gpt_section = re.search(
            r'### GPT[^\n]*Key Issues\s*\n(.*?)(?=\n### Human|$)',
            section_content,
            re.DOTALL
        )
        gpt_issues = []
        if gpt_section:
            bullets = re.findall(r'^- (.+)$', gpt_section.group(1), re.MULTILINE)
            gpt_issues = [b.strip() for b in bullets if b.strip()]

        # Extract human critiques (everything after "### Human Expert Critiques")
        human_section = re.search(
            r'### Human Expert Critiques[^\n]*\n(.*?)(?=\n---|\Z)',
            section_content,
            re.DOTALL
        )
        human_critique = human_section.group(1).strip() if human_section else ""

        if gpt_issues or human_critique:
            matched_data.append({
                "gpt_paper": paper_name,
                "coda_title": coda_title,
                "gpt_key_issues": gpt_issues,
                "coda_critique": human_critique,
                "num_gpt_issues": len(gpt_issues),
                "coda_critique_length": len(human_critique),
            })

    return matched_data

# =============================================================================
# LLM Comparison Function
# =============================================================================
COMPARISON_PROMPT_TEMPLATE = """You are evaluating how well an LLM's identified issues align with expert human critiques of a research paper.

## Task
Compare the GPT Key Issues against the Human Expert Critiques below. Assess:

1. **Coverage**: What proportion of the substantive issues raised by human experts are captured by the GPT key issues? (Give percentage estimate)
2. **Precision**: Are the GPT issues relevant and substantive, or does it include spurious/irrelevant issues? (Give percentage of GPT issues that are genuinely relevant)
3. **Missed Issues**: List the most important issues raised by human experts that GPT missed entirely
4. **Extra Issues**: List any important issues GPT identified that humans didn't mention (these could be valid additions or false positives)
5. **Overall Assessment**: Rate alignment as Excellent/Good/Moderate/Poor with 1-2 sentence justification

## Human Expert Critiques
{human_critique}

## GPT Key Issues
{gpt_issues}

Respond in JSON format with these exact fields:
- coverage_pct: number 0-100
- precision_pct: number 0-100
- missed_issues: array of strings
- extra_issues: array of strings
- overall_rating: one of "Excellent", "Good", "Moderate", "Poor"
- overall_justification: 1-2 sentences
- detailed_notes: any additional observations
"""

def compare_issues_with_llm(paper_name, coda_critique, gpt_issues):
    """Use LLM to compare the critiques and assess alignment."""
    # Input validation
    if not gpt_issues:
        return {"error": "No GPT issues", "coverage_pct": None, "precision_pct": None, "overall_rating": "N/A"}
    if not coda_critique or len(coda_critique.strip()) < 20:
        return {"error": "No human critique", "coverage_pct": None, "precision_pct": None, "overall_rating": "N/A"}

    # Format inputs
    gpt_issues_text = "\n".join("- " + issue for issue in gpt_issues)
    prompt_text = COMPARISON_PROMPT_TEMPLATE.format(
        human_critique=coda_critique,
        gpt_issues=gpt_issues_text
    )

    # Call LLM using Responses API (same as main evaluation pipeline)
    response = None
    error_msg = None
    try:
        response = call_with_retries(lambda: client.responses.create(
            model=FOCAL_MODEL,
            text={"format": {"type": "json_object"}},
            input=[
                {"role": "user", "content": [
                    {"type": "input_text", "text": prompt_text}
                ]}
            ],
            reasoning={"effort": "medium", "summary": "auto"},
            max_output_tokens=4000,
        ))
    except Exception as exc:
        error_msg = str(exc)
        print(f"  LLM error: {error_msg}")

    # Parse response - extract output text from responses API format
    if response is not None:
        try:
            output_text = None
            for block in response.output:
                if block.type == "message":
                    for content in block.content:
                        if content.type == "output_text":
                            output_text = content.text
                            break
            if output_text:
                return json.loads(output_text)
            else:
                error_msg = "No output text in response"
                print(f"  {error_msg}")
        except Exception as parse_exc:
            error_msg = f"Parse error: {parse_exc}"
            print(f"  {error_msg}")

    return {"error": error_msg or "Unknown error", "coverage_pct": None, "precision_pct": None, "overall_rating": "Error"}

# =============================================================================
# Step 1: Parse markdown and save JSON
# =============================================================================
if not KEY_ISSUES_MD_INPUT.exists():
    raise FileNotFoundError(f"Input markdown not found: {KEY_ISSUES_MD_INPUT}")

matched_data = parse_key_issues_markdown(KEY_ISSUES_MD_INPUT)

print(f"Parsed {len(matched_data)} papers from {KEY_ISSUES_MD_INPUT.name}")
for item in matched_data:
    print(f"  - {item['gpt_paper']}: {item['num_gpt_issues']} GPT issues, {item['coda_critique_length']} chars human critique")

KEY_ISSUES_OUTPUT.parent.mkdir(parents=True, exist_ok=True)
with open(KEY_ISSUES_OUTPUT, 'w') as f:
    json.dump(matched_data, f, indent=2, ensure_ascii=False)
print(f"\nJSON saved to: {KEY_ISSUES_OUTPUT}")

# =============================================================================
# Step 2: Run LLM comparison on each paper
# =============================================================================
print("\n" + "="*60)
print("Running LLM comparison...")
comparison_results = []

for item in matched_data:
    paper = item['gpt_paper']
    print(f"\nComparing: {paper}")

    comparison = compare_issues_with_llm(
        paper, item['coda_critique'], item['gpt_key_issues']
    )

    comparison_results.append({
        **item,
        "comparison": comparison
    })

    if comparison.get('coverage_pct') is not None:
        print(f"  Coverage: {comparison['coverage_pct']}%, Precision: {comparison['precision_pct']}%, Rating: {comparison['overall_rating']}")
    else:
        print(f"  Skipped or error: {comparison.get('error', 'unknown')}")

    time.sleep(2)  # Rate limiting

# Save results with comparison
comparison_output = KEY_ISSUES_OUTPUT.with_name('key_issues_comparison_results.json')
with open(comparison_output, 'w') as f:
    json.dump(comparison_results, f, indent=2, ensure_ascii=False)
print(f"\nComparison results saved to: {comparison_output}")

# =============================================================================
# Step 3: Summary statistics
# =============================================================================
valid_results = [r for r in comparison_results if r['comparison'].get('coverage_pct') is not None]
if valid_results:
    avg_coverage = sum(r['comparison']['coverage_pct'] for r in valid_results) / len(valid_results)
    avg_precision = sum(r['comparison']['precision_pct'] for r in valid_results) / len(valid_results)
    ratings = [r['comparison']['overall_rating'] for r in valid_results]

    print(f"\n{'='*60}")
    print(f"SUMMARY ({len(valid_results)}/{len(comparison_results)} papers with valid comparisons)")
    print(f"Average Coverage: {avg_coverage:.1f}%")
    print(f"Average Precision: {avg_precision:.1f}%")
    print(f"Rating distribution: {dict((r, ratings.count(r)) for r in set(ratings))}")
else:
    print("\nNo valid comparison results to summarize.")

The comparison pipeline uses a manually curated input and produces structured outputs:

Input: - results/key_issues_comparison.md: A manually curated markdown document pairing GPT-identified issues with human expert critiques for each paper.

Outputs: - results/key_issues_comparison.json: Parsed data in JSON format for programmatic use. - results/key_issues_comparison_results.json: LLM-assessed alignment metrics including coverage, precision, missed issues, and overall ratings.

The coverage metric estimates what fraction of human-identified issues the LLM captured; precision estimates what fraction of LLM issues are genuinely relevant. Together with the qualitative ratings, these provide a structured assessment of how well the LLM’s issue identification aligns with expert judgment.


  1. Occasionally they use 1 or 3 evaluators.↩︎

  2. See their guidelines here; these criteria include “Overall assessment”, “Claims, strength and characterization of evidence”, “Methods: Justification, reasonableness, validity, robustness”, “Advancing knowledge and practice”, “Logic and communication”, “Open, collaborative, replicable science”, and “Relevance to global priorities, usefulness for practitioners”↩︎

  3. “a normative judgment about ‘how well the research should publish’” and “a prediction about where the research will be published”↩︎