Comparing LLM and human reviews of social science research using data from Unjournal.org - Appendix A — Did Authors Respond to Unjournal Evaluations?

Overview

This chapter analyzes whether research papers were substantively revised in response to Unjournal evaluations.

Research Questions

Did authors update their papers after receiving Unjournal evaluations?
What were the major changes between the evaluated version and later versions?
Do the changes reflect evaluator suggestions?

Methodology

We use a multi-step approach:

Identify paper pairs: Find papers that exist in both “before” (at time of evaluation) and “after” (latest version) states
Match evaluations: Extract paper titles from evaluation markdown files and match against metadata to identify which papers received evaluations
Extract and compare: Use PDF text extraction to identify line-level changes between versions
LLM analysis (optional): Use GPT-4 to:
- Identify major substantive changes
- Extract key suggestions from evaluations
- Assess whether changes align with evaluator feedback
Present evidence: Show specific examples of likely evaluation-driven changes

Data Sources

Before versions: Papers in papers/ and more papers/ folders
After versions: Papers in latest_papers_post_UJ/ folder
Evaluations: Markdown files in unjournal_evaluations/
Metadata: Coda table with publication dates and DOI deposit dates

Limitation

The Coda API doesn’t expose DOI deposit dates from the public table. For this pilot analysis, we focus on papers we can confidently match between folders based on author names and titles.

Phase 1 Results: Paper Matching and Change Detection

Show code

import pandas as pd
import json
from pathlib import Path

# Load results from Phase 1 analysis
with open('paper_change_analysis/change_analysis_results.json', 'r') as f:
    results = json.load(f)

df_all = pd.DataFrame(results)

# Summary statistics
print(f"Total paper pairs analyzed: {len(df_all)}")
print(f"Papers with evaluations: {df_all['has_evaluation'].sum()}")
print(f"Papers with changes (>50 lines): {(df_all['total_changes'] > 50).sum()}")
print(f"Papers with BOTH evaluations AND changes: {((df_all['has_evaluation']) & (df_all['total_changes'] > 50)).sum()}")

Total paper pairs analyzed: 34
Papers with evaluations: 20
Papers with changes (>50 lines): 12
Papers with BOTH evaluations AND changes: 6

Papers with Evaluations and Substantial Changes

These are the most interesting cases for LLM analysis:

Show code

# Filter to papers with both evaluations and changes
candidates = df_all[
    (df_all['has_evaluation'] == True) &
    (df_all['total_changes'] > 50)
].copy()

# Display key metrics
candidates_display = candidates[[
    'paper_id', 'total_changes', 'additions_count', 'deletions_count',
    'text_length_change_pct', 'evaluation_files'
]].copy()

candidates_display['evaluation_count'] = candidates_display['evaluation_files'].apply(len)
candidates_display = candidates_display.drop('evaluation_files', axis=1)

candidates_display.columns = [
    'Paper', 'Total Line Changes', 'Lines Added', 'Lines Deleted',
    'Text Change %', 'Evaluation Count'
]

candidates_display

	Paper	Total Line Changes	Lines Added	Lines Deleted	Text Change %	Evaluation Count
7	Mark BuntaineMichael GreenstoneGuojun HeMengdi...	110	39	71	-0.829916	2
8	Vivi AlatasArun G ChandrasekharMarkus MobiusBe...	2403	931	1472	-36.556987	2
16	Verónica Salazar RestrepoGabriel Leite Mariant...	223	106	117	-0.479753	1
19	Augustin BergeronJean-Paul CarvalhoJoseph Henr...	3207	1678	1529	-1.438094	2
25	Robert W HahnNathaniel HendrenRobert D Metcalf...	1291	649	642	0.266932	2
32	Daron AcemogluCevat Giray AksoyCeren BaysanCar...	1660	849	811	3.863107	2

All Papers with Changes (No evaluation required)

Some papers show substantial changes even without matched evaluations:

Show code

# Papers with significant changes
changed_papers = df_all[df_all['total_changes'] > 200].copy()

changed_display = changed_papers[[
    'paper_id', 'total_changes', 'text_length_change_pct', 'has_evaluation'
]].sort_values('total_changes', ascending=False).copy()

changed_display.columns = ['Paper', 'Total Line Changes', 'Text Change %', 'Has Evaluation']

print(f"\nPapers with >200 line changes: {len(changed_display)}")
changed_display


Papers with >200 line changes: 11

	Paper	Total Line Changes	Text Change %	Has Evaluation
11	Abhijit BanerjeeMichael FayeAlan KruegerPaul N...	4489	-13.005971	False
19	Augustin BergeronJean-Paul CarvalhoJoseph Henr...	3207	-1.438094	True
21	Adrien BilalDiego R Känzig_The Macroeconomic I...	2614	2.021894	False
8	Vivi AlatasArun G ChandrasekharMarkus MobiusBe...	2403	-36.556987	True
9	B Kelsey JackSeema JayachandranNamrata KalaRoh...	2382	-41.188777	False
12	Bhargav BhatJonathan de QuidtJohannes Haushofe...	1730	2.253546	False
32	Daron AcemogluCevat Giray AksoyCeren BaysanCar...	1660	3.863107	True
26	Richard T CarsonJoshua S Graff ZivinJeffrey G ...	1401	30.152721	False
25	Robert W HahnNathaniel HendrenRobert D Metcalf...	1291	0.266932	True
2	Michael KremerJonathan D LevinChristopher M Sn...	1163	128.405166	False
16	Verónica Salazar RestrepoGabriel Leite Mariant...	223	-0.479753	True

Summary

From 34 confirmed paper pairs:

20 papers had matched Unjournal evaluations (using title-based matching from evaluation markdown files)
6 papers showed both evaluations AND substantial changes (>50 line changes)
12 papers showed substantial revisions overall (some with evaluations, some without)

The six papers with both evaluations and substantial changes represent promising cases for further analysis to assess whether changes reflected evaluator feedback.

Next Steps

Extract evaluation suggestions: Parse evaluation markdown files for specific suggestions
LLM analysis: Run GPT-4 analysis on the 6 candidate papers to assess attribution
Generate case studies: Present detailed examples with evidence

Placeholder for Results

Results from the LLM analysis will appear here once the analysis pipeline is complete.

Example Structure

For each paper pair, we’ll show:

Paper metadata (authors, title, dates)
Major changes detected (sections added/removed, methods changed, etc.)
Evaluator suggestions (extracted from evaluation files)
Attribution assessment (likely/unlikely influenced by evaluation)
Evidence and quotes (specific textual evidence)

Phase 2: LLM Analysis (To Be Run)

The next step is to run GPT-4 analysis on the 6 papers with both evaluations and substantial changes. This will:

Identify major substantive changes in each paper
Extract specific suggestions from evaluations
Assess whether changes likely reflect evaluator feedback
Provide evidence and quotes supporting the assessment

Estimated cost: $1.50-6.00 (depending on paper/evaluation length) Estimated time: 10-30 minutes

To run the LLM analysis:

conda activate qpy311_arm
python paper_change_analysis/scripts/llm_change_attribution.py

Results will be saved to paper_change_analysis/llm_analysis/ and can be incorporated into this document.

Running the Full Pipeline

To reproduce this analysis:

# Phase 1: Identify paper pairs and extract text (completed)
conda activate qpy311_arm
python paper_change_analysis/scripts/analyze_paper_changes.py

# Phase 2: LLM analysis of changes and attribution (optional, requires OpenAI API key)
python paper_change_analysis/scripts/llm_change_attribution.py

# Phase 3: Render this document with results
quarto render paper_response_analysis.qmd