Data and methods – LLM-based evaluation

Include global setup and parameters

source("setup_params.R")
#get the choices printed

We draw on two main sources:

Human evaluations from The Unjournal’s public evaluation data (PubPub reports and the Coda evaluation form export).
LLM‑generated evaluations using a structured JSON‑schema prompt with gpt-5-pro-2025-10-06 (knowledge cut-off: 30 September 2024).

Unjournal.org evaluations

We use The Unjournal’s public data for a baseline comparison. At The Unjournal each paper is typically evaluated (aka ‘reviewed’) by two expert evaluators¹ who provide quantitative ratings on a 0–100 percentile scale for each of seven criteria (with 90% credible intervals),² two “journal tier” ratings on a 0.0 - 5.0 scale,³ a written evaluation (resembling a referee report for a journal), and identification and assessment of the paper’s “main claim”. For our initial analysis, we extracted these human ratings and aggregated them, taking the average score per criterion across evaluators (and noting the range of individual scores).

All papers have completed The Unjournal’s evaluation process (meaning the authors received a full evaluation on the Unjournal platform, which has been publicly posted at unjournal.pubpub.org). The sample includes papers spanning 2017–2025 working papers in development economics, growth, health policy, environmental economics, and related fields that The Unjournal identified as high-impact. Each of these papers has quantitative scores from at least one human evaluator, and many have multiple (2-3) human ratings.

LLM-based evaluation

Quantitative ratings and journal-ranking tiers

Following The Unjournal’s standard guidelines for evaluators and their academic evaluation form, evaluators are asked to consider each paper along the following dimensions: claims & evidence, methods, logic & communication, open science, global relevance, and an overall assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric, evaluators are asked for the midpoint of their beliefs and their 90% credible interval, to communicate their uncertainty. For the journal rankings measure, we ask both “what journal ranking tier should this work be published in? (0.0-5.0)” and “what journal ranking tier will this work be published in? (0.0-5.0)”, with some further explanation.The full prompt can be seen in the code below – essentially copied from the Unjournal’s guidelines page.

We captured the versions of each paper that was evaluated by The Unjournal’s human evaluators, downloading from the links provided in The Unjournal’s Coda database.

We evaluate each paper by passing the PDF directly to the model and requiring a strict, machine‑readable JSON output. This keeps the assessment tied to the document the authors wrote. Direct ingestion preserves tables, figures, equations, and sectioning, which ad‑hoc text scraping can mangle. It also avoids silent trimming or segmentation choices that would bias what the model sees.

LLM evaluation pipeline setup

import os, time, json, random, hashlib
import pathlib
from typing import Any, Dict, Optional, Union

import pandas as pd
import numpy as np

import openai
from openai import OpenAI

# ---------- Configuration (in-file, no external deps)
API_KEY_PATH = pathlib.Path(os.getenv("OPENAI_KEY_PATH", "key/openai_key.txt"))
MODEL        = os.getenv("OPENAI_MODEL", "gpt-5-pro")
FILE_PURPOSE = "assistants"  # for Responses API file inputs
RESULTS_DIR  = pathlib.Path("results")
RESULTS_DIR.mkdir(exist_ok=True)
FILE_CACHE   = RESULTS_DIR / ".file_cache.json"

# ---------- API key bootstrap
if os.getenv("OPENAI_API_KEY") is None and API_KEY_PATH.exists():
    os.environ["OPENAI_API_KEY"] = API_KEY_PATH.read_text().strip()
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("No API key. Set OPENAI_API_KEY or create key/openai_key.txt")

client = OpenAI()

# ---------- Small utilities (inlined replacements for llm_utils)

def _resp_as_dict(resp: Any) -> Dict[str, Any]:
    if isinstance(resp, dict):
        return resp
    for attr in ("to_dict", "model_dump", "dict", "json"):
        if hasattr(resp, attr):
            try:
                val = getattr(resp, attr)()
                if isinstance(val, (str, bytes)):
                    try:
                        return json.loads(val)
                    except Exception:
                        pass
                if isinstance(val, dict):
                    return val
            except Exception:
                pass
    # last resort
    try:
        return json.loads(str(resp))
    except Exception:
        return {"_raw": str(resp)}

def _get_output_text(resp: Any) -> str:
    d = _resp_as_dict(resp)
    if "output_text" in d and isinstance(d["output_text"], str):
        return d["output_text"]
    out = d.get("output") or []
    chunks = []
    for item in out:
        if not isinstance(item, dict): continue
        if item.get("type") == "message":
            for c in item.get("content") or []:
                if isinstance(c, dict):
                    if "text" in c and isinstance(c["text"], str):
                        chunks.append(c["text"])
                    elif "output_text" in c and isinstance(c["output_text"], str):
                        chunks.append(c["output_text"])
    # Also check legacy top-level choices-like structures
    if not chunks:
        for k in ("content", "message"):
            v = d.get(k)
            if isinstance(v, str):
                chunks.append(v)
    return "\n".join(chunks).strip()

def _extract_json(s: str) -> Dict[str, Any]:
    """Robustly extract first top-level JSON object from a string."""
    if not s:
        raise ValueError("empty output text")
    # Fast path
    s_stripped = s.strip()
    if s_stripped.startswith("{") and s_stripped.endswith("}"):
        return json.loads(s_stripped)

    # Find first balanced {...} while respecting strings
    start = s.find("{")
    if start == -1:
        raise ValueError("no JSON object start found")
    i = start
    depth = 0
    in_str = False
    esc = False
    for i in range(start, len(s)):
        ch = s[i]
        if in_str:
            if esc:
                esc = False
            elif ch == "\\":
                esc = True
            elif ch == '"':
                in_str = False
        else:
            if ch == '"':
                in_str = True
            elif ch == "{":
                depth += 1
            elif ch == "}":
                depth -= 1
                if depth == 0:
                    candidate = s[start:i+1]
                    return json.loads(candidate)
    raise ValueError("no balanced JSON object found")

def call_with_retries(fn, max_tries: int = 6, base_delay: float = 0.8, max_delay: float = 8.0):
    ex = None
    for attempt in range(1, max_tries + 1):
        try:
            return fn()
        except (openai.RateLimitError, openai.APIError, openai.APIConnectionError, openai.APITimeoutError, Exception) as e:
            ex = e
            sleep = min(max_delay, base_delay * (1.8 ** (attempt - 1))) * (1 + 0.25 * random.random())
            time.sleep(sleep)
    raise ex

def _load_cache() -> Dict[str, Any]:
    if FILE_CACHE.exists():
        try:
            return json.loads(FILE_CACHE.read_text())
        except Exception:
            return {}
    return {}

def _save_cache(cache: Dict[str, Any]) -> None:
    FILE_CACHE.write_text(json.dumps(cache, ensure_ascii=False, indent=2))

def _file_sig(p: pathlib.Path) -> Dict[str, Any]:
    st = p.stat()
    return {"size": st.st_size, "mtime": int(st.st_mtime)}

def get_file_id(path: Union[str, pathlib.Path], client: OpenAI) -> str:
    p = pathlib.Path(path)
    if not p.exists():
        raise FileNotFoundError(p)
    cache = _load_cache()
    key = str(p.resolve())
    sig = _file_sig(p)
    meta = cache.get(key)
    if meta and meta.get("size") == sig["size"] and meta.get("mtime") == sig["mtime"] and meta.get("file_id"):
        return meta["file_id"]
    # Upload fresh
    with open(p, "rb") as fh:
      f = call_with_retries(lambda: client.files.create(file=fh, purpose=FILE_PURPOSE))
    fd = _resp_as_dict(f)
    fid = fd.get("id")
    if not fid:
        raise RuntimeError(f"Upload did not return file id: {fd}")
    cache[key] = {"file_id": fid, **sig}
    _save_cache(cache)
    return fid

def _reasoning_meta(resp) -> Dict[str, Any]:
    d = _resp_as_dict(resp)
    rid, summary_text = None, None
    out = d.get("output") or []
    if out and isinstance(out, list) and out[0].get("type") == "reasoning":
        rid = out[0].get("id")
        summ = out[0].get("summary") or []
        if summ and isinstance(summ, list):
            summary_text = summ[0].get("text")
    usage = d.get("usage") or {}
    odet  = usage.get("output_tokens_details") or {}
    return {
        "response_id": d.get("id"),
        "reasoning_id": rid,
        "reasoning_summary": summary_text,
        "input_tokens": usage.get("input_tokens"),
        "output_tokens": usage.get("output_tokens"),
        "reasoning_tokens": odet.get("reasoning_tokens"),
    }
    

def read_csv_or_empty(path, columns=None, **kwargs):
    p = pathlib.Path(path)
    if not p.exists():
        return pd.DataFrame(columns=columns or [])
    try:
        df = pd.read_csv(p, **kwargs)
        if df is None or getattr(df, "shape", (0,0))[1] == 0:
            return pd.DataFrame(columns=columns or [])
        return df
    except (pd.errors.EmptyDataError, pd.errors.ParserError, OSError, ValueError):
        return pd.DataFrame(columns=columns or [])

We enforce a JSON Schema for the results. The model must return one object for each of the named criteria including a midpoint rating and a 90% interval for each rating. This guarantees that every paper is scored on the same fields with the same types and bounds. It makes the analysis reproducible and comparisons clean.

We request credible intervals (as we do for human evaluators) to allow the model to communicate its uncertainty rather than suggest false precision; these can also be incorporated into our metrics, penalizing a model’s inaccuracy more when it’s stated with high confidence.

Schema, prompt, evaluator

# --- Metrics and schema
METRICS = [
    "overall",
    "claims_evidence",
    "methods",
    "advancing_knowledge",
    "logic_communication",
    "open_science",
    "global_relevance",
]

metric_schema = {
    "type": "object",
    "properties": {
        "midpoint":    {"type": "number", "minimum": 0, "maximum": 100},
        "lower_bound": {"type": "number", "minimum": 0, "maximum": 100},
        "upper_bound": {"type": "number", "minimum": 0, "maximum": 100},
    },
    "required": ["midpoint", "lower_bound", "upper_bound"],
    "additionalProperties": False,
}

TIER_METRIC_SCHEMA = {
    "type": "object",
    "properties": {
        "score":   {"type": "number", "minimum": 0, "maximum": 5},
        "ci_lower":{"type": "number", "minimum": 0, "maximum": 5},
        "ci_upper":{"type": "number", "minimum": 0, "maximum": 5},
    },
    "required": ["score", "ci_lower", "ci_upper"],
    "additionalProperties": False,
}

COMBINED_SCHEMA = {
    "type": "object",
    "properties": {
        "assessment_summary": {"type": "string","maxLength": 5000},
        "metrics": {
            "type": "object",
            "properties": {
                **{m: metric_schema for m in METRICS},
                "tier_should": TIER_METRIC_SCHEMA,
                "tier_will":   TIER_METRIC_SCHEMA,
            },
            "required": METRICS + ["tier_should", "tier_will"],
            "additionalProperties": False,
        },
    },
    "required": ["assessment_summary", "metrics"],
    "additionalProperties": False,
}

TEXT_FORMAT_COMBINED = {
    "type": "json_schema",
    "name": "paper_assessment_with_tiers_v2",
    "strict": True,
    "schema": COMBINED_SCHEMA,
}

#Todo -- adjust the 'diagnostic summary' below to take into account more aspects of our criteria

SYSTEM_PROMPT_COMBINED = f"""

Your role -- You are an academic expert as well as a practitioner across every relevant field -- use all your knowledge and insight. You are acting as an expert research evaluator/reviewer. 
Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors' names, status, or institutions in your judgment -- give your assessment based on the *content* of the papers alone; do it based on your knowledge and insights. 

Diagnostic summary (≤1000 words, based only on the PDF):
Provide a compact paragraph that identifies the most important issues you detect in the manuscript itself (e.g., identification threats, data limitations, misinterpretations, internal inconsistencies, missing robustness, replication barriers). Be specific, neutral, and concrete. This summary should precede any scoring and should guide your uncertainty. Output this text in the JSON field `assessment_summary`.

We ask for a set of quantitative metrics, based on your insights. For each metric, we ask for a score and a 90% credible interval. We describe these in detail below.

Percentile rankings relative to a reference group: For some questions, we ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on. Here the population of papers should be all serious research in the same area that you have encountered in the last three years.  *Unless this work is in our 'applied and policy stream', in which case the reference group should be "all applied and policy research you have read that is aiming at a similar audience, and that has similar goals".

"Serious" research? Academic research? 
Here, we are mainly considering research done by professional researchers with high levels of training, experience, and familiarity with recent practice, who have time and resources to devote months or years to each such research project or paper. 
These will typically be written as 'working papers' and presented at academic seminars before being submitted to standard academic journals. Although no credential is required, this typically includes people with PhD degrees (or upper-level PhD students). Most of this sort of research is done by full-time academics (professors, post-docs, academic staff, etc.) with a substantial research remit, as well as research staff at think tanks and research institutions (but there may be important exceptions).

What counts as the "same area"?
This is a judgment call. Some criteria to consider... First, does the work come from the same academic field and research subfield, and does it address questions that might be addressed using similar methods? Second, does it deal with the same substantive research question, or a closely related one? If the research you are evaluating is in a very niche topic, the comparison reference group should be expanded to consider work in other areas.

"Research that you have encountered"
We are aiming for comparability across evaluators. If you suspect you are particularly exposed to higher-quality work in this category, compared to other likely evaluators, you may want to adjust your reference group downwards. (And of course vice-versa, if you suspect you are particularly exposed to lower-quality work.)

Midpoint rating and credible intervals: For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty.

    - "overall" - Overall assessment - Percentile ranking (0-100%): Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness, importance to knowledge production, and importance to practice.

    - "claims_evidence" - Claims, strength and characterization of evidence (0-100%): Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?

    - "methods" - Justification, reasonableness, validity, robustness (0-100%): Are the methods[^7] used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this? Did the authors take steps to reduce bias from opportunistic reporting and questionable research practices?

    - "advancing_knowledge" - Advancing our knowledge and practice (0-100%): To what extent does the project contribute to the field or to practice, particularly in ways that are relevant[^10] to global priorities and impactful interventions? (Applied stream: please focus on ‘improvements that are actually helpful’.) Less weight to "originality and cleverness’: Originality and cleverness should be weighted less than the typical journal, because we focus on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on 'contribution to GP' than on 'contribution to the academic field'.
            Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions?
            Does the project add useful value to other impactful research?
            We don't require surprising results; sound and well-presented null results can also be valuable.

    - "logic_communication" - "Logic and communication (0-100%): Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced? Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow? Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?

    - "open_science" - Open, collaborative, replicable research (0-100%): This covers several considerations: 
        - Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided? Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained??
        - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?  
        - Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?

    - "global_relevance" - Relevance to global priorities, usefulness for practitioners: Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? Do they communicate (at least in the abstract or introduction)  in ways policymakers and decision-makers can understand, without misleading or oversimplifying?


The midpoint and 'credible intervals': expressing uncertainty - What are we looking for and why? 
    - We want policymakers, researchers, funders, and managers to be able to use The Unjournal'&#x73; evaluations to update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other and other sources of information. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. But it's hard to quantify statements like "very certain" or "somewhat uncertain" – different people may use the same phrases to mean different things. That's why we're asking for you a more precise measure, your credible intervals. These metrics are particularly useful for meta-science and meta-analysis. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.
    - How do I come up with these intervals? (Discussion and guidance): You may understand the concepts of uncertainty and credible intervals, but you might be unfamiliar with applying them in a situation like this one. You may have a certain best guess for the "Methods..." criterion. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a method you are not familiar with, etc. Your uncertainty over this could be described by some distribution, representing your beliefs about the true value of this criterion. Your "'best guess" should be the central mass point of this distribution. For some questions, the "true value" refers to something objective, e.g. will this work be published in a top-ranked journal? In other cases, like the percentile rankings, the true value means "if you had complete evidence, knowledge, and wisdom, what value would you choose?" If you are well calibrated your 90% credible intervals should contain the true value 90% of the time. Consider the midpoint as the 'median of your belief distribution'
    - We also ask for the 'midpoint', the center dot on that slider. Essentially, we are asking for the median of your belief distribution. By this we mean the percentile ranking such that you believe "there's a 50% chance that  the paper's true rank is higher than this, and a 50% chance that it actually ranks lower than this."


Additionally, we ask: What journal ranking tier should and will this work be published in?

To help universities and policymakers make sense of our evaluations, we want to benchmark them against how research is currently judged. So, we would like you to assess the paper in terms of journal rankings. We ask for two assessments:

    1. a normative judgment about 'how well the research should publish';
    2. a prediction about where the research will be published.
    As before, we ask for a 90% credible interval.

    Journal ranking tiers are on a 0-5 scale, as follows:
        0/5: "Won't publish/little to no value".  Unlikely to be cited by credible researchers
        1/5: OK/Somewhat valuable journal
        2/5: Marginal B-journal/Decent field journal
        3/5: Top B-journal/Strong field journal
        4/5: Marginal A-Journal/Top field journal
        5/5: A-journal/Top journal

    - We encourage you to consider a non-integer score, e.g. 4.6 or 2.2. If a paper/project would be most likely to be (or merits being) published in a journal that would rank about halfway between a top tier 'A journal' and a second tier (4/5) journal, you should rate it a 4.5. Similarly, if you think it has an 80%  chance of (being/meriting) publication in a 'marginal B-journal' and a 20% chance of a Top B-journal, you should rate it 2.2. Please also use this continuous scale for providing credible intervals. If a paper/project would be most likely to be (or merits being) published in a journal that would rank about halfway between a top tier 'A journal' and a second tier (4/5) journal, you should rate it a 4.5.

    - Journal ranking tier "should" (0.0-5.0)
        Schema: tiershould: Assess this paper on the journal ranking scale described above, considering only its merit, giving some weight to the category metrics we discussed above. Equivalently, where would this paper be published if: 
        1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn’t matter;
        2. journals assessed research according to the category metrics we discussed above.

    - Journal ranking tier "will" (0.0-5.0) 
        Schema: tierwill: What if this work has already been peer reviewed and published? If this work has already been published, and you know where, please report the prediction you would have given absent that knowledge.

Return STRICT JSON matching the supplied schema. No preamble. No markdown. No extra text.

Fill both top-level keys:
- `assessment_summary`: one paragraph, ≤200 words.
- `metrics`: object containing all required metrics.

Field names
- Percentile metrics → `midpoint`, `lower_bound`, `upper_bound`.
- Tier metrics → `score`, `ci_lower`, `ci_upper`.

Bounds
- Percentiles in [0, 100] with lower_bound ≤ midpoint ≤ upper_bound.
- Tiers in [0, 5] with ci_lower ≤ score ≤ ci_upper.

Do not include citations, URLs, author identity, or any external information.
 Percentiles in [0, 100] with lower_bound ≤ midpoint ≤ upper_bound.
- Tiers in [0, 5] with ci_lower ≤ score ≤ ci_upper.

Do not include citations, URLs, author identity, or any external information.
""".strip()

# Async-by-default kickoff: submit and return job metadata. No waiting.
def evaluate_paper(pdf_path: Union[str, pathlib.Path],
                   model: Optional[str] = None,
                   use_reasoning: bool = True) -> Dict[str, Any]:
    model = model or MODEL
    fid = get_file_id(pdf_path, client)

    def _payload():
        p = dict(
            model=model,
            text={"format": TEXT_FORMAT_COMBINED, "verbosity": "medium"},
            input=[
                {"role": "system", "content": [
                    {"type": "input_text", "text": SYSTEM_PROMPT_COMBINED}
                ]},
                {"role": "user", "content": [
                    {"type": "input_file", "file_id": fid},
                    {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."}
                ]},
            ],
            max_output_tokens=8000,
            background=True,
            store=True,
        )
        if use_reasoning:
            p["reasoning"] = {"effort": "high", "summary": "auto"}
        return p

    kickoff = call_with_retries(lambda: client.responses.create(**_payload()))
    kd = _resp_as_dict(kickoff)
    return {
        "response_id": kd.get("id"),
        "file_id": fid,
        "status": kd.get("status") or "queued",
        "model": model,
        "created_at": kd.get("created_at"),
    }

Relying on GPT-5 Pro, we use a single‑step call with a reasoning model that supports file input. One step avoids hand‑offs and summary loss from a separate “ingestion” stage. The model reads the whole PDF and produces the JSON defined above. We do not retrieve external sources or cross‑paper material for these scores; the evaluation is anchored in the manuscript itself.

The Python pipeline uploads each PDF once and caches the returned file id keyed by path, size, and modification time. We submit one background job per PDF to the OpenAI Responses API with “high” reasoning effort and server‑side JSON‑Schema enforcement. Submissions record the response id, model id, file id, status, and timestamps.

Kick off background jobs → results/jobs_index.csv

import pathlib, time

ROOT = pathlib.Path(os.getenv("UJ_PAPERS_DIR", "papers")).expanduser()
OUT  = pathlib.Path("results"); OUT.mkdir(exist_ok=True)
IDX  = OUT / "jobs_index.csv"

pdfs = sorted(ROOT.glob("*.pdf"))
print("Found PDFs:", [p.name for p in pdfs])

cols = ["paper","pdf","response_id","file_id","model","status","created_at","last_update","collected","error"]
idx = read_csv_or_empty(IDX, columns=cols)
for c in cols:
    if c not in idx.columns: idx[c] = pd.NA

existing = dict(zip(idx["paper"], idx["status"])) if not idx.empty else {}
started = []

for pdf in pdfs:
    paper = pdf.stem
    if existing.get(paper) in ("queued","in_progress","incomplete","requires_action"):
        print(f"skip {pdf.name}: job already running")
        continue
    try:
        job = evaluate_paper(pdf, model=MODEL, use_reasoning=True)
        started.append({
            "paper": paper,
            "pdf": str(pdf),
            "response_id": job.get("response_id"),
            "file_id": job.get("file_id"),
            "model": job.get("model"),
            "status": job.get("status"),
            "created_at": job.get("created_at") or pd.Timestamp.utcnow().isoformat(),
            "last_update": pd.Timestamp.utcnow().isoformat(),
            "collected": False,
            "error": pd.NA,
        })
        print(f"✓ Started job for {pdf.name}, waiting 90s before next submission...")
        time.sleep(90)  # Wait 90s between submissions to avoid TPM rate limits
    except Exception as e:
        print(f"⚠️ kickoff failed for {pdf.name}: {e}")

if started:
    idx = pd.concat([idx, pd.DataFrame(started)], ignore_index=True)
    idx.drop_duplicates(subset=["paper"], keep="last", inplace=True)
    idx.to_csv(IDX, index=False)
    print(f"Started {len(started)} jobs → {IDX}")
else:
    print("No new jobs started.")

A separate script polls job status and, for each completed job, retrieves the raw response, extracts the first balanced top‑level JSON object, and writes both the raw response and parsed outputs to disk.

Poll status, collect completed outputs, write per-paper and combined CSVs

import json, pathlib, pandas as pd

OUT = pathlib.Path("results")
IDX = OUT / "jobs_index.csv"
PER = OUT / "per_paper"; PER.mkdir(exist_ok=True)
JSN = OUT / "json"; JSN.mkdir(exist_ok=True)

def _safe_read_csv(path, columns):
    p = pathlib.Path(path)
    if not p.exists():
        return pd.DataFrame(columns=columns)
    try:
        # Set dtype='object' for string columns to avoid dtype warnings
        df = pd.read_csv(p, dtype={'error': 'object', 'reasoning_id': 'object'})
    except Exception:
        return pd.DataFrame(columns=columns)
    for c in columns:
        if c not in df.columns:
            df[c] = pd.NA
    return df

cols = ["paper","pdf","response_id","file_id","model","status","created_at",
        "last_update","collected","error","reasoning_id","input_tokens",
        "output_tokens","reasoning_tokens"]
idx = _safe_read_csv(IDX, cols)

if idx.empty:
    print("Index is empty.")
else:
    term = {"completed","failed","cancelled","expired"}
    for i, row in idx.iterrows():
        if str(row.get("status")) in term:
            continue
        try:
            r = client.responses.retrieve(str(row["response_id"]))
            d = _resp_as_dict(r)
            idx.at[i,"status"] = d.get("status")
            idx.at[i,"last_update"] = pd.Timestamp.utcnow().isoformat()
            if d.get("status") in term and d.get("status") != "completed":
                idx.at[i,"error"] = json.dumps(d.get("incomplete_details") or {})
        except Exception as e:
            idx.at[i,"error"] = str(e)

    newly_done = idx[(idx["status"]=="completed") & (idx["collected"]==False)]
    print(f"Completed and pending collection: {len(newly_done)}")

    rows_accum, summaries = [], []
    for i, row in newly_done.iterrows():
        rid   = str(row["response_id"])
        paper = str(row["paper"])
        try:
            r = client.responses.retrieve(rid)

            with open(JSN / f"{paper}.response.json", "w", encoding="utf-8") as f:
                f.write(json.dumps(_resp_as_dict(r), ensure_ascii=False))

            jtxt = _get_output_text(r)
            j    = _extract_json(jtxt)

            for metric, vals in (j.get("metrics") or {}).items():
                if metric in ("tier_should","tier_will"):
                    rows_accum.append({
                        "paper": paper, "metric": metric, "metric_type": "tier",
                        "value": vals.get("score"), "lo": vals.get("ci_lower"), "hi": vals.get("ci_upper"),
                        "scale_min": 0, "scale_max": 5,
                    })
                else:
                    rows_accum.append({
                        "paper": paper, "metric": metric, "metric_type": "percentile",
                        "value": vals.get("midpoint"), "lo": vals.get("lower_bound"), "hi": vals.get("upper_bound"),
                        "scale_min": 0, "scale_max": 100,
                    })

            if "assessment_summary" in j:
                summaries.append({"paper": paper, "assessment_summary": j["assessment_summary"]})

            per_df = pd.DataFrame([r for r in rows_accum if r["paper"]==paper])
            per_df.to_csv(PER / f"{paper}_long.csv", index=False, encoding="utf-8")

            m = _reasoning_meta(r)
            idx.at[i,"collected"] = True
            idx.at[i,"error"] = pd.NA
            idx.at[i,"reasoning_id"] = m.get("reasoning_id")
            idx.at[i,"input_tokens"] = m.get("input_tokens")
            idx.at[i,"output_tokens"] = m.get("output_tokens")
            idx.at[i,"reasoning_tokens"] = m.get("reasoning_tokens")

        except Exception as e:
            idx.at[i,"error"] = f"collect: {e}"

    if rows_accum:
        combined = pd.DataFrame(rows_accum)

        # merge with any previous combined_long.csv
        comb_path = OUT / "combined_long.csv"
        prev_cols = ["paper","metric","metric_type","value","lo","hi","scale_min","scale_max"]
        prev = _safe_read_csv(comb_path, prev_cols)
        if not prev.empty:
            prev = prev[~prev["paper"].isin(newly_done["paper"])]
            combined = pd.concat([prev, combined], ignore_index=True)
        combined.to_csv(comb_path, index=False, encoding="utf-8")

        # metrics_long.csv (no leading-dot chaining)
        metrics_df = combined[combined["metric_type"]=="percentile"].copy()
        metrics_df = metrics_df.rename(columns={"value":"midpoint","lo":"lower_bound","hi":"upper_bound"})
        metrics_df = metrics_df.drop(columns=["metric_type","scale_min","scale_max"])
        metrics_df.to_csv(OUT / "metrics_long.csv", index=False, encoding="utf-8")

        # tiers_long.csv
        tiers_df = combined[combined["metric_type"]=="tier"].copy()
        tiers_df = tiers_df.rename(columns={"metric":"tier_kind","value":"score"})
        tiers_df = tiers_df.drop(columns=["metric_type","scale_min","scale_max"])
        tiers_df.to_csv(OUT / "tiers_long.csv", index=False, encoding="utf-8")

    # assessment_summaries.csv
    if summaries:
        s_path = OUT / "assessment_summaries.csv"
        s_df = pd.DataFrame(summaries)
        prev_s = _safe_read_csv(s_path, ["paper","assessment_summary"])
        if not prev_s.empty:
            prev_s = prev_s[~prev_s["paper"].isin(newly_done["paper"])]
            s_df = pd.concat([prev_s, s_df], ignore_index=True)
        s_df.to_csv(s_path, index=False, encoding="utf-8")

    idx.to_csv(IDX, index=False)
    counts = idx["status"].value_counts(dropna=False).to_dict()
    print("Status counts:", counts)
    print(f"Progress: {counts.get('completed',0)}/{len(idx)} completed")

Occasionally they use 1 or 3 evaluators.↩︎
See their guidelines here; these criteria include “Overall assessment”, “Claims, strength and characterization of evidence”, “Methods: Justification, reasonableness, validity, robustness”, “Advancing knowledge and practice”, “Logic and communication”, “Open, collaborative, replicable science”, and “Relevance to global priorities, usefulness for practitioners”↩︎
“a normative judgment about ‘how well the research should publish’” and “a prediction about where the research will be published”↩︎

--- title: "Data and methods -- LLM-based evaluation" --- ```{r} #| label: setup call #| code-summary: "Include global setup and parameters" source("setup_params.R") #get the choices printed ``` We draw on two main sources: 1) Human evaluations from [The Unjournal’s public evaluation data](https://unjournal.github.io/unjournaldata/index.html) (PubPub reports and the Coda evaluation form export). 2) LLM‑generated evaluations using a structured JSON‑schema prompt with `gpt-5-pro-2025-10-06` (knowledge cut-off: 30 September 2024). ## Unjournal.org evaluations We use The Unjournal's public data for a baseline comparison. At The Unjournal each paper is typically evaluated (aka 'reviewed') by two expert evaluators^[Occasionally they use 1 or 3 evaluators.] who provide quantitative ratings on a 0–100 percentile scale for each of seven criteria (with 90% credible intervals),^[See their guidelines [here](https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators#quantitative-metrics); these criteria include "Overall assessment", "Claims, strength and characterization of evidence", "Methods: Justification, reasonableness, validity, robustness", "Advancing knowledge and practice", "Logic and communication", "Open, collaborative, replicable science", and "Relevance to global priorities, usefulness for practitioners"] two "journal tier" ratings on a 0.0 - 5.0 scale,^["a normative judgment about 'how well the research should publish'" and "a prediction about where the research will be published"] a written evaluation (resembling a referee report for a journal), and identification and assessment of the paper's "main claim". For our initial analysis, we extracted these human ratings and aggregated them, taking the average score per criterion across evaluators (and noting the range of individual scores). All papers have completed The Unjournal's evaluation process (meaning the authors received a full evaluation on the Unjournal platform, which has been publicly posted at unjournal.pubpub.org). The sample includes papers spanning 2017–2025 working papers in development economics, growth, health policy, environmental economics, and related fields that The Unjournal identified as high-impact. Each of these papers has quantitative scores from at least one human evaluator, and many have multiple (2-3) human ratings. ## LLM-based evaluation ### Quantitative ratings and journal-ranking tiers Following The Unjournal's [standard guidelines for evaluators](https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators) and their [academic evaluation form](https://coda.io/form/Unjournal-Evaluation-form-academic-stream-Coda-updated-version_dGjfMZ1yXME), evaluators are asked to consider each paper along the following dimensions: **claims & evidence**, **methods**, **logic & communication**, **open science**, **global relevance**, and an **overall** assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric, evaluators are asked for the midpoint of their beliefs and their 90% credible interval, to communicate their uncertainty. For the journal rankings measure, we ask both "what journal ranking tier should this work be published in? (0.0-5.0)" and "what journal ranking tier will this work be published in? (0.0-5.0)", with some further explanation.The full prompt can be seen in the code below -- essentially copied from the Unjournal's guidelines page.  We captured the versions of each paper that was evaluated by The Unjournal's human evaluators, downloading from the links provided in The Unjournal's Coda database.  We evaluate each paper by passing the PDF directly to the model and requiring a strict, machine‑readable JSON output. This keeps the assessment tied to the document the authors wrote. Direct ingestion preserves tables, figures, equations, and sectioning, which ad‑hoc text scraping can mangle. It also avoids silent trimming or segmentation choices that would bias what the model sees. ```{python} #| label: llm-setup-all #| eval: false #| code-fold: true #| code-summary: "LLM evaluation pipeline setup" import os, time, json, random, hashlib import pathlib from typing import Any, Dict, Optional, Union import pandas as pd import numpy as np import openai from openai import OpenAI # ---------- Configuration (in-file, no external deps) API_KEY_PATH = pathlib.Path(os.getenv("OPENAI_KEY_PATH", "key/openai_key.txt")) MODEL = os.getenv("OPENAI_MODEL", "gpt-5-pro") FILE_PURPOSE = "assistants" # for Responses API file inputs RESULTS_DIR = pathlib.Path("results") RESULTS_DIR.mkdir(exist_ok=True) FILE_CACHE = RESULTS_DIR / ".file_cache.json" # ---------- API key bootstrap if os.getenv("OPENAI_API_KEY") is None and API_KEY_PATH.exists(): os.environ["OPENAI_API_KEY"] = API_KEY_PATH.read_text().strip() if not os.getenv("OPENAI_API_KEY"): raise ValueError("No API key. Set OPENAI_API_KEY or create key/openai_key.txt") client = OpenAI() # ---------- Small utilities (inlined replacements for llm_utils) def _resp_as_dict(resp: Any) -> Dict[str, Any]: if isinstance(resp, dict): return resp for attr in ("to_dict", "model_dump", "dict", "json"): if hasattr(resp, attr): try: val = getattr(resp, attr)() if isinstance(val, (str, bytes)): try: return json.loads(val) except Exception: pass if isinstance(val, dict): return val except Exception: pass # last resort try: return json.loads(str(resp)) except Exception: return {"_raw": str(resp)} def _get_output_text(resp: Any) -> str: d = _resp_as_dict(resp) if "output_text" in d and isinstance(d["output_text"], str): return d["output_text"] out = d.get("output") or [] chunks = [] for item in out: if not isinstance(item, dict): continue if item.get("type") == "message": for c in item.get("content") or []: if isinstance(c, dict): if "text" in c and isinstance(c["text"], str): chunks.append(c["text"]) elif "output_text" in c and isinstance(c["output_text"], str): chunks.append(c["output_text"]) # Also check legacy top-level choices-like structures if not chunks: for k in ("content", "message"): v = d.get(k) if isinstance(v, str): chunks.append(v) return "\n".join(chunks).strip() def _extract_json(s: str) -> Dict[str, Any]: """Robustly extract first top-level JSON object from a string.""" if not s: raise ValueError("empty output text") # Fast path s_stripped = s.strip() if s_stripped.startswith("{") and s_stripped.endswith("}"): return json.loads(s_stripped) # Find first balanced {...} while respecting strings start = s.find("{") if start == -1: raise ValueError("no JSON object start found") i = start depth = 0 in_str = False esc = False for i in range(start, len(s)): ch = s[i] if in_str: if esc: esc = False elif ch == "\\": esc = True elif ch == '"': in_str = False else: if ch == '"': in_str = True elif ch == "{": depth += 1 elif ch == "}": depth -= 1 if depth == 0: candidate = s[start:i+1] return json.loads(candidate) raise ValueError("no balanced JSON object found") def call_with_retries(fn, max_tries: int = 6, base_delay: float = 0.8, max_delay: float = 8.0): ex = None for attempt in range(1, max_tries + 1): try: return fn() except (openai.RateLimitError, openai.APIError, openai.APIConnectionError, openai.APITimeoutError, Exception) as e: ex = e sleep = min(max_delay, base_delay * (1.8 ** (attempt - 1))) * (1 + 0.25 * random.random()) time.sleep(sleep) raise ex def _load_cache() -> Dict[str, Any]: if FILE_CACHE.exists(): try: return json.loads(FILE_CACHE.read_text()) except Exception: return {} return {} def _save_cache(cache: Dict[str, Any]) -> None: FILE_CACHE.write_text(json.dumps(cache, ensure_ascii=False, indent=2)) def _file_sig(p: pathlib.Path) -> Dict[str, Any]: st = p.stat() return {"size": st.st_size, "mtime": int(st.st_mtime)} def get_file_id(path: Union[str, pathlib.Path], client: OpenAI) -> str: p = pathlib.Path(path) if not p.exists(): raise FileNotFoundError(p) cache = _load_cache() key = str(p.resolve()) sig = _file_sig(p) meta = cache.get(key) if meta and meta.get("size") == sig["size"] and meta.get("mtime") == sig["mtime"] and meta.get("file_id"): return meta["file_id"] # Upload fresh with open(p, "rb") as fh: f = call_with_retries(lambda: client.files.create(file=fh, purpose=FILE_PURPOSE)) fd = _resp_as_dict(f) fid = fd.get("id") if not fid: raise RuntimeError(f"Upload did not return file id: {fd}") cache[key] = {"file_id": fid, **sig} _save_cache(cache) return fid def _reasoning_meta(resp) -> Dict[str, Any]: d = _resp_as_dict(resp) rid, summary_text = None, None out = d.get("output") or [] if out and isinstance(out, list) and out[0].get("type") == "reasoning": rid = out[0].get("id") summ = out[0].get("summary") or [] if summ and isinstance(summ, list): summary_text = summ[0].get("text") usage = d.get("usage") or {} odet = usage.get("output_tokens_details") or {} return { "response_id": d.get("id"), "reasoning_id": rid, "reasoning_summary": summary_text, "input_tokens": usage.get("input_tokens"), "output_tokens": usage.get("output_tokens"), "reasoning_tokens": odet.get("reasoning_tokens"), } def read_csv_or_empty(path, columns=None, **kwargs): p = pathlib.Path(path) if not p.exists(): return pd.DataFrame(columns=columns or []) try: df = pd.read_csv(p, **kwargs) if df is None or getattr(df, "shape", (0,0))[1] == 0: return pd.DataFrame(columns=columns or []) return df except (pd.errors.EmptyDataError, pd.errors.ParserError, OSError, ValueError): return pd.DataFrame(columns=columns or []) ``` We enforce a JSON Schema for the results. The model must return one object for each of the named criteria including a midpoint rating and a 90% interval for each rating. This guarantees that every paper is scored on the same fields with the same types and bounds. It makes the analysis reproducible and comparisons clean. We request credible intervals (as we do for human evaluators) to allow the model to communicate its uncertainty rather than suggest false precision; these can also be incorporated into our metrics, penalizing a model's inaccuracy more when it's stated with high confidence. ```{python} #| label: schema-prompt #| eval: false #| code-fold: true #| code-summary: "Schema, prompt, evaluator" # --- Metrics and schema METRICS = [ "overall", "claims_evidence", "methods", "advancing_knowledge", "logic_communication", "open_science", "global_relevance", ] metric_schema = { "type": "object", "properties": { "midpoint": {"type": "number", "minimum": 0, "maximum": 100}, "lower_bound": {"type": "number", "minimum": 0, "maximum": 100}, "upper_bound": {"type": "number", "minimum": 0, "maximum": 100}, }, "required": ["midpoint", "lower_bound", "upper_bound"], "additionalProperties": False, } TIER_METRIC_SCHEMA = { "type": "object", "properties": { "score": {"type": "number", "minimum": 0, "maximum": 5}, "ci_lower":{"type": "number", "minimum": 0, "maximum": 5}, "ci_upper":{"type": "number", "minimum": 0, "maximum": 5}, }, "required": ["score", "ci_lower", "ci_upper"], "additionalProperties": False, } COMBINED_SCHEMA = { "type": "object", "properties": { "assessment_summary": {"type": "string","maxLength": 5000}, "metrics": { "type": "object", "properties": { **{m: metric_schema for m in METRICS}, "tier_should": TIER_METRIC_SCHEMA, "tier_will": TIER_METRIC_SCHEMA, }, "required": METRICS + ["tier_should", "tier_will"], "additionalProperties": False, }, }, "required": ["assessment_summary", "metrics"], "additionalProperties": False, } TEXT_FORMAT_COMBINED = { "type": "json_schema", "name": "paper_assessment_with_tiers_v2", "strict": True, "schema": COMBINED_SCHEMA, } #Todo -- adjust the 'diagnostic summary' below to take into account more aspects of our criteria SYSTEM_PROMPT_COMBINED = f""" Your role -- You are an academic expert as well as a practitioner across every relevant field -- use all your knowledge and insight. You are acting as an expert research evaluator/reviewer. Do not look at any existing ratings or evaluations of these papers you might find on the internet or in your corpus, do not use the authors' names, status, or institutions in your judgment -- give your assessment based on the *content* of the papers alone; do it based on your knowledge and insights. Diagnostic summary (≤1000 words, based only on the PDF): Provide a compact paragraph that identifies the most important issues you detect in the manuscript itself (e.g., identification threats, data limitations, misinterpretations, internal inconsistencies, missing robustness, replication barriers). Be specific, neutral, and concrete. This summary should precede any scoring and should guide your uncertainty. Output this text in the JSON field `assessment_summary`. We ask for a set of quantitative metrics, based on your insights. For each metric, we ask for a score and a 90% credible interval. We describe these in detail below. Percentile rankings relative to a reference group: For some questions, we ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on. Here the population of papers should be all serious research in the same area that you have encountered in the last three years. *Unless this work is in our 'applied and policy stream', in which case the reference group should be "all applied and policy research you have read that is aiming at a similar audience, and that has similar goals". "Serious" research? Academic research? Here, we are mainly considering research done by professional researchers with high levels of training, experience, and familiarity with recent practice, who have time and resources to devote months or years to each such research project or paper. These will typically be written as 'working papers' and presented at academic seminars before being submitted to standard academic journals. Although no credential is required, this typically includes people with PhD degrees (or upper-level PhD students). Most of this sort of research is done by full-time academics (professors, post-docs, academic staff, etc.) with a substantial research remit, as well as research staff at think tanks and research institutions (but there may be important exceptions). What counts as the "same area"? This is a judgment call. Some criteria to consider... First, does the work come from the same academic field and research subfield, and does it address questions that might be addressed using similar methods? Second, does it deal with the same substantive research question, or a closely related one? If the research you are evaluating is in a very niche topic, the comparison reference group should be expanded to consider work in other areas. "Research that you have encountered" We are aiming for comparability across evaluators. If you suspect you are particularly exposed to higher-quality work in this category, compared to other likely evaluators, you may want to adjust your reference group downwards. (And of course vice-versa, if you suspect you are particularly exposed to lower-quality work.) Midpoint rating and credible intervals: For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. - "overall" - Overall assessment - Percentile ranking (0-100%): Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness, importance to knowledge production, and importance to practice. - "claims_evidence" - Claims, strength and characterization of evidence (0-100%): Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence? - "methods" - Justification, reasonableness, validity, robustness (0-100%): Are the methods[^7] used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this? Did the authors take steps to reduce bias from opportunistic reporting and questionable research practices? - "advancing_knowledge" - Advancing our knowledge and practice (0-100%): To what extent does the project contribute to the field or to practice, particularly in ways that are relevant[^10] to global priorities and impactful interventions? (Applied stream: please focus on ‘improvements that are actually helpful’.) Less weight to "originality and cleverness’: Originality and cleverness should be weighted less than the typical journal, because we focus on impact. Papers that apply existing techniques and frameworks more rigorously than previous work or apply them to new areas in ways that provide practical insights for GP (global priorities) and interventions should be highly valued. More weight should be placed on 'contribution to GP' than on 'contribution to the academic field'. Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable. - "logic_communication" - "Logic and communication (0-100%): Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced? Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow? Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)? - "open_science" - Open, collaborative, replicable research (0-100%): This covers several considerations: - Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided? Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper? - Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis? - "global_relevance" - Relevance to global priorities, usefulness for practitioners: Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying? The midpoint and 'credible intervals': expressing uncertainty - What are we looking for and why? - We want policymakers, researchers, funders, and managers to be able to use The Unjournal's evaluations to update their beliefs and make better decisions. To do this well, they need to weigh multiple evaluations against each other and other sources of information. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. But it's hard to quantify statements like "very certain" or "somewhat uncertain" – different people may use the same phrases to mean different things. That's why we're asking for you a more precise measure, your credible intervals. These metrics are particularly useful for meta-science and meta-analysis. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value. - How do I come up with these intervals? (Discussion and guidance): You may understand the concepts of uncertainty and credible intervals, but you might be unfamiliar with applying them in a situation like this one. You may have a certain best guess for the "Methods..." criterion. Still, even an expert can never be certain. E.g., you may misunderstand some aspect of the paper, there may be a method you are not familiar with, etc. Your uncertainty over this could be described by some distribution, representing your beliefs about the true value of this criterion. Your "'best guess" should be the central mass point of this distribution. For some questions, the "true value" refers to something objective, e.g. will this work be published in a top-ranked journal? In other cases, like the percentile rankings, the true value means "if you had complete evidence, knowledge, and wisdom, what value would you choose?" If you are well calibrated your 90% credible intervals should contain the true value 90% of the time. Consider the midpoint as the 'median of your belief distribution' - We also ask for the 'midpoint', the center dot on that slider. Essentially, we are asking for the median of your belief distribution. By this we mean the percentile ranking such that you believe "there's a 50% chance that the paper's true rank is higher than this, and a 50% chance that it actually ranks lower than this." Additionally, we ask: What journal ranking tier should and will this work be published in? To help universities and policymakers make sense of our evaluations, we want to benchmark them against how research is currently judged. So, we would like you to assess the paper in terms of journal rankings. We ask for two assessments: 1. a normative judgment about 'how well the research should publish'; 2. a prediction about where the research will be published. As before, we ask for a 90% credible interval. Journal ranking tiers are on a 0-5 scale, as follows: 0/5: "Won't publish/little to no value". Unlikely to be cited by credible researchers 1/5: OK/Somewhat valuable journal 2/5: Marginal B-journal/Decent field journal 3/5: Top B-journal/Strong field journal 4/5: Marginal A-Journal/Top field journal 5/5: A-journal/Top journal - We encourage you to consider a non-integer score, e.g. 4.6 or 2.2. If a paper/project would be most likely to be (or merits being) published in a journal that would rank about halfway between a top tier 'A journal' and a second tier (4/5) journal, you should rate it a 4.5. Similarly, if you think it has an 80% chance of (being/meriting) publication in a 'marginal B-journal' and a 20% chance of a Top B-journal, you should rate it 2.2. Please also use this continuous scale for providing credible intervals. If a paper/project would be most likely to be (or merits being) published in a journal that would rank about halfway between a top tier 'A journal' and a second tier (4/5) journal, you should rate it a 4.5. - Journal ranking tier "should" (0.0-5.0) Schema: tiershould: Assess this paper on the journal ranking scale described above, considering only its merit, giving some weight to the category metrics we discussed above. Equivalently, where would this paper be published if: 1. the journal process was fair, unbiased, and free of noise, and that status, social connections, and lobbying to get the paper published didn’t matter; 2. journals assessed research according to the category metrics we discussed above. - Journal ranking tier "will" (0.0-5.0) Schema: tierwill: What if this work has already been peer reviewed and published? If this work has already been published, and you know where, please report the prediction you would have given absent that knowledge. Return STRICT JSON matching the supplied schema. No preamble. No markdown. No extra text. Fill both top-level keys: - `assessment_summary`: one paragraph, ≤200 words. - `metrics`: object containing all required metrics. Field names - Percentile metrics → `midpoint`, `lower_bound`, `upper_bound`. - Tier metrics → `score`, `ci_lower`, `ci_upper`. Bounds - Percentiles in [0, 100] with lower_bound ≤ midpoint ≤ upper_bound. - Tiers in [0, 5] with ci_lower ≤ score ≤ ci_upper. Do not include citations, URLs, author identity, or any external information. Percentiles in [0, 100] with lower_bound ≤ midpoint ≤ upper_bound. - Tiers in [0, 5] with ci_lower ≤ score ≤ ci_upper. Do not include citations, URLs, author identity, or any external information. """.strip() # Async-by-default kickoff: submit and return job metadata. No waiting. def evaluate_paper(pdf_path: Union[str, pathlib.Path], model: Optional[str] = None, use_reasoning: bool = True) -> Dict[str, Any]: model = model or MODEL fid = get_file_id(pdf_path, client) def _payload(): p = dict( model=model, text={"format": TEXT_FORMAT_COMBINED, "verbosity": "medium"}, input=[ {"role": "system", "content": [ {"type": "input_text", "text": SYSTEM_PROMPT_COMBINED} ]}, {"role": "user", "content": [ {"type": "input_file", "file_id": fid}, {"type": "input_text", "text": "Return STRICT JSON per schema. No extra text."} ]}, ], max_output_tokens=8000, background=True, store=True, ) if use_reasoning: p["reasoning"] = {"effort": "high", "summary": "auto"} return p kickoff = call_with_retries(lambda: client.responses.create(**_payload())) kd = _resp_as_dict(kickoff) return { "response_id": kd.get("id"), "file_id": fid, "status": kd.get("status") or "queued", "model": model, "created_at": kd.get("created_at"), } ``` Relying on GPT-5 Pro, we use a single‑step call with a reasoning model that supports file input. One step avoids hand‑offs and summary loss from a separate "ingestion" stage. The model reads the whole PDF and produces the JSON defined above. We do not retrieve external sources or cross‑paper material for these scores; the evaluation is anchored in the manuscript itself. The Python pipeline uploads each PDF once and caches the returned file id keyed by path, size, and modification time. We submit one background job per PDF to the OpenAI Responses API with “high” reasoning effort and server‑side JSON‑Schema enforcement. Submissions record the response id, model id, file id, status, and timestamps. ```{python} #| label: llm-kickoff #| eval: false #| code-summary: "Kick off background jobs → results/jobs_index.csv" import pathlib, time ROOT = pathlib.Path(os.getenv("UJ_PAPERS_DIR", "papers")).expanduser() OUT = pathlib.Path("results"); OUT.mkdir(exist_ok=True) IDX = OUT / "jobs_index.csv" pdfs = sorted(ROOT.glob("*.pdf")) print("Found PDFs:", [p.name for p in pdfs]) cols = ["paper","pdf","response_id","file_id","model","status","created_at","last_update","collected","error"] idx = read_csv_or_empty(IDX, columns=cols) for c in cols: if c not in idx.columns: idx[c] = pd.NA existing = dict(zip(idx["paper"], idx["status"])) if not idx.empty else {} started = [] for pdf in pdfs: paper = pdf.stem if existing.get(paper) in ("queued","in_progress","incomplete","requires_action"): print(f"skip {pdf.name}: job already running") continue try: job = evaluate_paper(pdf, model=MODEL, use_reasoning=True) started.append({ "paper": paper, "pdf": str(pdf), "response_id": job.get("response_id"), "file_id": job.get("file_id"), "model": job.get("model"), "status": job.get("status"), "created_at": job.get("created_at") or pd.Timestamp.utcnow().isoformat(), "last_update": pd.Timestamp.utcnow().isoformat(), "collected": False, "error": pd.NA, }) print(f"✓ Started job for {pdf.name}, waiting 90s before next submission...") time.sleep(90) # Wait 90s between submissions to avoid TPM rate limits except Exception as e: print(f"⚠️ kickoff failed for {pdf.name}: {e}") if started: idx = pd.concat([idx, pd.DataFrame(started)], ignore_index=True) idx.drop_duplicates(subset=["paper"], keep="last", inplace=True) idx.to_csv(IDX, index=False) print(f"Started {len(started)} jobs → {IDX}") else: print("No new jobs started.") ``` A separate script polls job status and, for each completed job, retrieves the raw response, extracts the first balanced top‑level JSON object, and writes both the raw response and parsed outputs to disk. ```{python} #| label: llm-status-collect #| eval: false #| code-summary: "Poll status, collect completed outputs, write per-paper and combined CSVs" import json, pathlib, pandas as pd OUT = pathlib.Path("results") IDX = OUT / "jobs_index.csv" PER = OUT / "per_paper"; PER.mkdir(exist_ok=True) JSN = OUT / "json"; JSN.mkdir(exist_ok=True) def _safe_read_csv(path, columns): p = pathlib.Path(path) if not p.exists(): return pd.DataFrame(columns=columns) try: # Set dtype='object' for string columns to avoid dtype warnings df = pd.read_csv(p, dtype={'error': 'object', 'reasoning_id': 'object'}) except Exception: return pd.DataFrame(columns=columns) for c in columns: if c not in df.columns: df[c] = pd.NA return df cols = ["paper","pdf","response_id","file_id","model","status","created_at", "last_update","collected","error","reasoning_id","input_tokens", "output_tokens","reasoning_tokens"] idx = _safe_read_csv(IDX, cols) if idx.empty: print("Index is empty.") else: term = {"completed","failed","cancelled","expired"} for i, row in idx.iterrows(): if str(row.get("status")) in term: continue try: r = client.responses.retrieve(str(row["response_id"])) d = _resp_as_dict(r) idx.at[i,"status"] = d.get("status") idx.at[i,"last_update"] = pd.Timestamp.utcnow().isoformat() if d.get("status") in term and d.get("status") != "completed": idx.at[i,"error"] = json.dumps(d.get("incomplete_details") or {}) except Exception as e: idx.at[i,"error"] = str(e) newly_done = idx[(idx["status"]=="completed") & (idx["collected"]==False)] print(f"Completed and pending collection: {len(newly_done)}") rows_accum, summaries = [], [] for i, row in newly_done.iterrows(): rid = str(row["response_id"]) paper = str(row["paper"]) try: r = client.responses.retrieve(rid) with open(JSN / f"{paper}.response.json", "w", encoding="utf-8") as f: f.write(json.dumps(_resp_as_dict(r), ensure_ascii=False)) jtxt = _get_output_text(r) j = _extract_json(jtxt) for metric, vals in (j.get("metrics") or {}).items(): if metric in ("tier_should","tier_will"): rows_accum.append({ "paper": paper, "metric": metric, "metric_type": "tier", "value": vals.get("score"), "lo": vals.get("ci_lower"), "hi": vals.get("ci_upper"), "scale_min": 0, "scale_max": 5, }) else: rows_accum.append({ "paper": paper, "metric": metric, "metric_type": "percentile", "value": vals.get("midpoint"), "lo": vals.get("lower_bound"), "hi": vals.get("upper_bound"), "scale_min": 0, "scale_max": 100, }) if "assessment_summary" in j: summaries.append({"paper": paper, "assessment_summary": j["assessment_summary"]}) per_df = pd.DataFrame([r for r in rows_accum if r["paper"]==paper]) per_df.to_csv(PER / f"{paper}_long.csv", index=False, encoding="utf-8") m = _reasoning_meta(r) idx.at[i,"collected"] = True idx.at[i,"error"] = pd.NA idx.at[i,"reasoning_id"] = m.get("reasoning_id") idx.at[i,"input_tokens"] = m.get("input_tokens") idx.at[i,"output_tokens"] = m.get("output_tokens") idx.at[i,"reasoning_tokens"] = m.get("reasoning_tokens") except Exception as e: idx.at[i,"error"] = f"collect: {e}" if rows_accum: combined = pd.DataFrame(rows_accum) # merge with any previous combined_long.csv comb_path = OUT / "combined_long.csv" prev_cols = ["paper","metric","metric_type","value","lo","hi","scale_min","scale_max"] prev = _safe_read_csv(comb_path, prev_cols) if not prev.empty: prev = prev[~prev["paper"].isin(newly_done["paper"])] combined = pd.concat([prev, combined], ignore_index=True) combined.to_csv(comb_path, index=False, encoding="utf-8") # metrics_long.csv (no leading-dot chaining) metrics_df = combined[combined["metric_type"]=="percentile"].copy() metrics_df = metrics_df.rename(columns={"value":"midpoint","lo":"lower_bound","hi":"upper_bound"}) metrics_df = metrics_df.drop(columns=["metric_type","scale_min","scale_max"]) metrics_df.to_csv(OUT / "metrics_long.csv", index=False, encoding="utf-8") # tiers_long.csv tiers_df = combined[combined["metric_type"]=="tier"].copy() tiers_df = tiers_df.rename(columns={"metric":"tier_kind","value":"score"}) tiers_df = tiers_df.drop(columns=["metric_type","scale_min","scale_max"]) tiers_df.to_csv(OUT / "tiers_long.csv", index=False, encoding="utf-8") # assessment_summaries.csv if summaries: s_path = OUT / "assessment_summaries.csv" s_df = pd.DataFrame(summaries) prev_s = _safe_read_csv(s_path, ["paper","assessment_summary"]) if not prev_s.empty: prev_s = prev_s[~prev_s["paper"].isin(newly_done["paper"])] s_df = pd.concat([prev_s, s_df], ignore_index=True) s_df.to_csv(s_path, index=False, encoding="utf-8") idx.to_csv(IDX, index=False) counts = idx["status"].value_counts(dropna=False).to_dict() print("Status counts:", counts) print(f"Progress: {counts.get('completed',0)}/{len(idx)} completed") ```