We evaluate each paper along the following dimensions: claims & evidence, methods, logic & communication, open science, global relevance, and an overall assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric the model returns a midpoint and a 90% credible interval to communicate uncertainty.
import os, pathlib, json, textwrap, reimport pdfplumber, time, tiktokenimport pandas as pd, numpy as npfrom openai import OpenAIimport altair as alt; alt.renderers.enable("html")import plotly.io as pioimport plotly.graph_objects as gofrom tqdm import tqdm from IPython.display import Markdown, display# Setup chunk:# * loads the main libraries (OpenAI SDK, Altair, Plotly, …)# * looks for your OpenAI key in **key/openai_key.txt** # * initialises a client (`gpt-5`, via `UJ_MODEL`)# * defines `pdf_to_string()` — drops the reference section and hard‑caps at 180k tokens so we stay in‑context.# **API key hygiene.** # The code only *reads* `key/openai_key.txt` if the environment variable isn’t already set. Keep this file out of version control (adjust gitignore).# ---------- API key ----------key_path = pathlib.Path("key/openai_key.txt")if os.getenv("OPENAI_API_KEY") isNoneand key_path.exists(): os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()ifnot os.getenv("OPENAI_API_KEY"):raiseValueError("No API key found. Put it in key/openai_key.txt or set OPENAI_API_KEY.")client = OpenAI()# ---------- Model (GPT‑5) ----------MODEL = os.getenv("UJ_MODEL", "gpt-5")# ---------- PDF → text with a fixed tokenizer ----------_enc = tiktoken.get_encoding("o200k_base") # large-context encoderdef pdf_to_string(path, max_tokens=180_000):"""Extract text, drop references section, hard-cap by tokens."""with pdfplumber.open(path) as pdf: text =" ".join(p.extract_text() or""for p in pdf.pages) text = re.sub(r"\s+", " ", text) m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()] toks = _enc.encode(text)iflen(toks) > max_tokens: text = _enc.decode(toks[:max_tokens])return text# Response schema & system prompt:# This section defines the *contract* we expect back from the model.# - **`METRICS`** lists the categories we score.# - **`metric_schema`** specifies the shape and bounds of each metric.# - **`response_format`** (JSON Schema) asks the model to return **only** a JSON# object conforming to that schema.# - **`SYSTEM_PROMPT`** explains scoring semantics (percentiles + 90% credible# intervals) and how to set the `overall` score.# -----------------------------# 1. Metric list# -----------------------------METRICS = ["overall","claims_evidence","methods","advancing_knowledge","logic_communication","open_science","global_relevance"]# -----------------------------# 2. JSON schema# -----------------------------metric_schema = {"type": "object","properties": {"midpoint": {"type": "integer", "minimum": 0, "maximum": 100},"lower_bound": {"type": "integer", "minimum": 0, "maximum": 100},"upper_bound": {"type": "integer", "minimum": 0, "maximum": 100},"rationale": {"type": "string"} },"required": ["midpoint", "lower_bound", "upper_bound", "rationale"],"additionalProperties": False}response_format = {"type": "json_schema","json_schema": {"name": "paper_assessment_v1","strict": True,"schema": {"type": "object","properties": {"metrics": {"type": "object","properties": {m: metric_schema for m in METRICS},"required": METRICS,"additionalProperties": False } },"required": ["metrics"],"additionalProperties": False } }}# -----------------------------# 3. System prompt# -----------------------------SYSTEM_PROMPT = textwrap.dedent(f"""You are an expert evaluator.We ask for a set of quantitative metrics. For each metric, we ask for a score and a 90% credible interval.Percentile rankingsWe ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.The population of papers should be all serious research in the same area that you have encountered in the last three years.Midpoint rating and credible intervals For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. We want policymakers, researchers, funders, and managers to be able to use evaluations to update their beliefs and make better decisions. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.Overall assessment- Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.importance to knowledge production, and importance to practice. Claims, strength and characterization of evidence- Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?Methods: Justification, reasonableness, validity, robustness- Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? - Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?- Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications? Advancing our knowledge and practice- To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?- Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? - Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable.Logic and communication- Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?- Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?- Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? - Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?Open, collaborative, replicable research- Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?- Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?- Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?Relevance to global priorities, usefulness for practitioners- Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? - Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? - Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? - Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?Return STRICT JSON matching the supplied schema.Fill every key in the object `metrics`:{', '.join(METRICS)}Definitions are percentile scores (0 – 100) versus serious work in the field from the last 3 years.For `overall`: • Default = arithmetic mean of the other six midpoints (rounded). • If, in your judgment, another value is better (e.g. one metric is far more decision-relevant), choose it **and explain why** in `overall.rationale`.Field meanings midpoint → best-guess percentile lower_bound → 5th-percentile plausible value upper_bound → 95th-percentile plausible value rationale → ≤100 words; terse but informative.Do **not** wrap the JSON in markdown fences or add extra text.""").strip()# Below is the main helper that runs a single evaluation:# - extracts text with `pdf_to_string()`,# - calls **Chat Completions** with the strict JSON Schema in `response_format`,# - parses the returned JSON string into a Python dict.def evaluate_paper(pdf_path: str| pathlib.Path, model: str= MODEL) ->dict: paper_text = pdf_to_string(pdf_path) chat = client.chat.completions.create( model=model, # "gpt-5" response_format=response_format, # your JSON-Schema defined above messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ],# max_tokens=2500, # optional cap for the JSON block ) raw_json = chat.choices[0].message.contentreturn json.loads(raw_json)# Minimal probe using Chat Completions (works on gpt-5 and o3)# A quick “smoke test” to confirm the client is alive and willing to emit JSON.# This intentionally uses the simple `json_object` format for wide model compatibility.# MODEL = "gpt-5" # "o3" # probe = client.chat.completions.create(# model=MODEL,# messages=[# {"role": "system", "content": "Return only valid JSON: {\"ok\": true}"},# {"role": "user", "content": "Return {\"ok\": true} only."}# ],# # Wide compatibility: ask for a JSON object (no schema)# response_format={"type": "json_object"},# )# print(probe.choices[0].message.content) # should print {"ok": true}
:::
We iterate over papers/*.pdf, collect raw results and write a long CSV to results/metrics_long.csv.
Show code
# Batch-evaluate all PDFs# This loop walks every `*.pdf` in `papers/`, calls `evaluate_paper()`,# pads a **0.8 s** sleep for rate‑limits, and collects the raw JSON dicts in memory# (`records`). We *don’t* write one JSON file per paper here; instead, the nested `records` structure is converted to a tidy long format and# written to `results/metrics_long.csv`.ROOT = pathlib.Path("papers") # put your PDFs hereOUT = pathlib.Path("results")OUT.mkdir(exist_ok=True)pdfs =sorted(ROOT.glob("*.pdf"))records = []for pdf in tqdm(pdfs, desc="Metrics"):try: res = evaluate_paper(pdf) # <-- res["paper"] = pdf.stem records.append(res) time.sleep(0.8) exceptExceptionas e:print(f"⚠️ {pdf.name}: {e}")tidy_rows = []for rec in records: paper_id = rec["paper"]for metric, vals in rec["metrics"].items(): tidy_rows.append({"paper": paper_id,"metric": metric,**vals # midpoint, lower_bound, upper_bound, rationale })tidy = pd.DataFrame(tidy_rows)tidy.to_csv(OUT /"metrics_long.csv", index=False)# tidy.head()p = pathlib.Path("results/metrics_long.csv")ifnot p.exists(): display(Markdown("::: {.callout-warning}\n**No results yet.** Run the batch evaluation to generate `results/metrics_long.csv`.\n:::"))else: tidy = pd.read_csv(p)# Basic counts n_papers = tidy["paper"].nunique() n_rows =len(tidy) n_metrics = tidy["metric"].nunique()# Overall-only slice overall = tidy[tidy["metric"] =="overall"].copy() mean_overall = overall["midpoint"].mean() ifnot overall.empty else np.nan med_overall = overall["midpoint"].median() ifnot overall.empty else np.nan display(Markdown(f"**Batch size:** {n_papers} papers \n"f"**Total metric rows:** {n_rows:,} across {n_metrics} metrics \n"f"**Overall ratings:** mean = {mean_overall:.1f}, median = {med_overall:.1f}" ))
# Numerical ratingsWe evaluate each paper along the following dimensions: **claims & evidence**, **methods**, **logic & communication**, **open science**, **global relevance**, and an **overall** assessment. Ratings are interpreted as percentiles relative to serious recent work in the same area. For each metric the model returns a midpoint and a 90% credible interval to communicate uncertainty.```{python}#| label: "API setup (Chat Completions + safe tokenizer)"import os, pathlib, json, textwrap, reimport pdfplumber, time, tiktokenimport pandas as pd, numpy as npfrom openai import OpenAIimport altair as alt; alt.renderers.enable("html")import plotly.io as pioimport plotly.graph_objects as gofrom tqdm import tqdm from IPython.display import Markdown, display# Setup chunk:# * loads the main libraries (OpenAI SDK, Altair, Plotly, …)# * looks for your OpenAI key in **key/openai_key.txt** # * initialises a client (`gpt-5`, via `UJ_MODEL`)# * defines `pdf_to_string()` — drops the reference section and hard‑caps at 180k tokens so we stay in‑context.# **API key hygiene.** # The code only *reads* `key/openai_key.txt` if the environment variable isn’t already set. Keep this file out of version control (adjust gitignore).# ---------- API key ----------key_path = pathlib.Path("key/openai_key.txt")if os.getenv("OPENAI_API_KEY") isNoneand key_path.exists(): os.environ["OPENAI_API_KEY"] = key_path.read_text().strip()ifnot os.getenv("OPENAI_API_KEY"):raiseValueError("No API key found. Put it in key/openai_key.txt or set OPENAI_API_KEY.")client = OpenAI()# ---------- Model (GPT‑5) ----------MODEL = os.getenv("UJ_MODEL", "gpt-5")# ---------- PDF → text with a fixed tokenizer ----------_enc = tiktoken.get_encoding("o200k_base") # large-context encoderdef pdf_to_string(path, max_tokens=180_000):"""Extract text, drop references section, hard-cap by tokens."""with pdfplumber.open(path) as pdf: text =" ".join(p.extract_text() or""for p in pdf.pages) text = re.sub(r"\s+", " ", text) m = re.search(r"\b(References|Bibliography)\b", text, flags=re.I)if m: text = text[: m.start()] toks = _enc.encode(text)iflen(toks) > max_tokens: text = _enc.decode(toks[:max_tokens])return text# Response schema & system prompt:# This section defines the *contract* we expect back from the model.# - **`METRICS`** lists the categories we score.# - **`metric_schema`** specifies the shape and bounds of each metric.# - **`response_format`** (JSON Schema) asks the model to return **only** a JSON# object conforming to that schema.# - **`SYSTEM_PROMPT`** explains scoring semantics (percentiles + 90% credible# intervals) and how to set the `overall` score.# -----------------------------# 1. Metric list# -----------------------------METRICS = ["overall","claims_evidence","methods","advancing_knowledge","logic_communication","open_science","global_relevance"]# -----------------------------# 2. JSON schema# -----------------------------metric_schema = {"type": "object","properties": {"midpoint": {"type": "integer", "minimum": 0, "maximum": 100},"lower_bound": {"type": "integer", "minimum": 0, "maximum": 100},"upper_bound": {"type": "integer", "minimum": 0, "maximum": 100},"rationale": {"type": "string"} },"required": ["midpoint", "lower_bound", "upper_bound", "rationale"],"additionalProperties": False}response_format = {"type": "json_schema","json_schema": {"name": "paper_assessment_v1","strict": True,"schema": {"type": "object","properties": {"metrics": {"type": "object","properties": {m: metric_schema for m in METRICS},"required": METRICS,"additionalProperties": False } },"required": ["metrics"],"additionalProperties": False } }}# -----------------------------# 3. System prompt# -----------------------------SYSTEM_PROMPT = textwrap.dedent(f"""You are an expert evaluator.We ask for a set of quantitative metrics. For each metric, we ask for a score and a 90% credible interval.Percentile rankingsWe ask for a percentile ranking from 0-100%. This represents "what proportion of papers in the reference group are worse than this paper, by this criterion". A score of 100% means this is essentially the best paper in the reference group. 0% is the worst paper. A score of 50% means this is the median paper; i.e., half of all papers in the reference group do this better, and half do this worse, and so on.The population of papers should be all serious research in the same area that you have encountered in the last three years.Midpoint rating and credible intervals For each metric, we ask you to provide a 'midpoint rating' and a 90% credible interval as a measure of your uncertainty. We want policymakers, researchers, funders, and managers to be able to use evaluations to update their beliefs and make better decisions. Evaluators may feel confident about their rating for one category, but less confident in another area. How much weight should readers give to each? In this context, it is useful to quantify the uncertainty. You are asked to give a 'midpoint' and a 90% credible interval. Consider this as the smallest interval that you believe is 90% likely to contain the true value.Overall assessment- Judge the quality of the research heuristically. Consider all aspects of quality, credibility, importance to future impactful applied research, and practical relevance and usefulness.importance to knowledge production, and importance to practice. Claims, strength and characterization of evidence- Do the authors do a good job of (i) stating their main questions and claims, (ii) providing strong evidence and powerful approaches to inform these, and (iii) correctly characterizing the nature of their evidence?Methods: Justification, reasonableness, validity, robustness- Are the methods used well-justified and explained; are they a reasonable approach to answering the question(s) in this context? Are the underlying assumptions reasonable? - Are the results and methods likely to be robust to reasonable changes in the underlying assumptions? Does the author demonstrate this?- Avoiding bias and questionable research practices (QRP): Did the authors take steps to reduce bias from opportunistic reporting and QRP? For example, did they do a strong pre-registration and pre-analysis plan, incorporate multiple hypothesis testing corrections, and report flexible specifications? Advancing our knowledge and practice- To what extent does the project contribute to the field or to practice, particularly in ways that are relevant to global priorities and impactful interventions?- Do the paper's insights inform our beliefs about important parameters and about the effectiveness of interventions? - Does the project add useful value to other impactful research? We don't require surprising results; sound and well-presented null results can also be valuable.Logic and communication- Are the goals and questions of the paper clearly expressed? Are concepts clearly defined and referenced?- Is the reasoning "transparent"? Are assumptions made explicit? Are all logical steps clear and correct? Does the writing make the argument easy to follow?- Are the conclusions consistent with the evidence (or formal proofs) presented? Do the authors accurately state the nature of their evidence, and the extent it supports their main claims? - Are the data and/or analysis presented relevant to the arguments made? Are the tables, graphs, and diagrams easy to understand in the context of the narrative (e.g., no major errors in labeling)?Open, collaborative, replicable research- Replicability, reproducibility, data integrity: Would another researcher be able to perform the same analysis and get the same results? Are the methods explained clearly and in enough detail to enable easy and credible replication? For example, are all analyses and statistical tests explained, and is code provided?- Is the source of the data clear? Is the data made as available as is reasonably possible? If so, is it clearly labeled and explained?? - Consistency: Do the numbers in the paper and/or code output make sense? Are they internally consistent throughout the paper?- Useful building blocks: Do the authors provide tools, resources, data, and outputs that might enable or enhance future work and meta-analysis?Relevance to global priorities, usefulness for practitioners- Are the paper’s chosen topic and approach likely to be useful to global priorities, cause prioritization, and high-impact interventions? - Does the paper consider real-world relevance and deal with policy and implementation questions? Are the setup, assumptions, and focus realistic? - Do the authors report results that are relevant to practitioners? Do they provide useful quantified estimates (costs, benefits, etc.) enabling practical impact quantification and prioritization? - Do they communicate (at least in the abstract or introduction) in ways policymakers and decision-makers can understand, without misleading or oversimplifying?Return STRICT JSON matching the supplied schema.Fill every key in the object `metrics`:{', '.join(METRICS)}Definitions are percentile scores (0 – 100) versus serious work in the field from the last 3 years.For `overall`: • Default = arithmetic mean of the other six midpoints (rounded). • If, in your judgment, another value is better (e.g. one metric is far more decision-relevant), choose it **and explain why** in `overall.rationale`.Field meanings midpoint → best-guess percentile lower_bound → 5th-percentile plausible value upper_bound → 95th-percentile plausible value rationale → ≤100 words; terse but informative.Do **not** wrap the JSON in markdown fences or add extra text.""").strip()# Below is the main helper that runs a single evaluation:# - extracts text with `pdf_to_string()`,# - calls **Chat Completions** with the strict JSON Schema in `response_format`,# - parses the returned JSON string into a Python dict.def evaluate_paper(pdf_path: str| pathlib.Path, model: str= MODEL) ->dict: paper_text = pdf_to_string(pdf_path) chat = client.chat.completions.create( model=model, # "gpt-5" response_format=response_format, # your JSON-Schema defined above messages=[ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": paper_text} ],# max_tokens=2500, # optional cap for the JSON block ) raw_json = chat.choices[0].message.contentreturn json.loads(raw_json)# Minimal probe using Chat Completions (works on gpt-5 and o3)# A quick “smoke test” to confirm the client is alive and willing to emit JSON.# This intentionally uses the simple `json_object` format for wide model compatibility.# MODEL = "gpt-5" # "o3" # probe = client.chat.completions.create(# model=MODEL,# messages=[# {"role": "system", "content": "Return only valid JSON: {\"ok\": true}"},# {"role": "user", "content": "Return {\"ok\": true} only."}# ],# # Wide compatibility: ask for a JSON object (no schema)# response_format={"type": "json_object"},# )# print(probe.choices[0].message.content) # should print {"ok": true}```We iterate over `papers/*.pdf`, collect raw results and write a long CSV to `results/metrics_long.csv`.```{python}#| label: eval-many-metrics#| eval: false# Batch-evaluate all PDFs# This loop walks every `*.pdf` in `papers/`, calls `evaluate_paper()`,# pads a **0.8 s** sleep for rate‑limits, and collects the raw JSON dicts in memory# (`records`). We *don’t* write one JSON file per paper here; instead, the nested `records` structure is converted to a tidy long format and# written to `results/metrics_long.csv`.ROOT = pathlib.Path("papers") # put your PDFs hereOUT = pathlib.Path("results")OUT.mkdir(exist_ok=True)pdfs =sorted(ROOT.glob("*.pdf"))records = []for pdf in tqdm(pdfs, desc="Metrics"):try: res = evaluate_paper(pdf) # <-- res["paper"] = pdf.stem records.append(res) time.sleep(0.8) exceptExceptionas e:print(f"⚠️ {pdf.name}: {e}")tidy_rows = []for rec in records: paper_id = rec["paper"]for metric, vals in rec["metrics"].items(): tidy_rows.append({"paper": paper_id,"metric": metric,**vals # midpoint, lower_bound, upper_bound, rationale })tidy = pd.DataFrame(tidy_rows)tidy.to_csv(OUT /"metrics_long.csv", index=False)# tidy.head()p = pathlib.Path("results/metrics_long.csv")ifnot p.exists(): display(Markdown("::: {.callout-warning}\n**No results yet.** Run the batch evaluation to generate `results/metrics_long.csv`.\n:::"))else: tidy = pd.read_csv(p)# Basic counts n_papers = tidy["paper"].nunique() n_rows =len(tidy) n_metrics = tidy["metric"].nunique()# Overall-only slice overall = tidy[tidy["metric"] =="overall"].copy() mean_overall = overall["midpoint"].mean() ifnot overall.empty else np.nan med_overall = overall["midpoint"].median() ifnot overall.empty else np.nan display(Markdown(f"**Batch size:** {n_papers} papers \n"f"**Total metric rows:** {n_rows:,} across {n_metrics} metrics \n"f"**Overall ratings:** mean = {mean_overall:.1f}, median = {med_overall:.1f}" ))```---