title: Bulk extraction audit (laptop, 2026-06-01) status: handoff

Bulk extraction audit — laptop, 2026-06-01

The laptop bulk run of pipelines/politica/source/llm/poll_extract.py --year 2024 --exclude-states SP --workers 8 finished at ~15:13 UTC, 4h49m after launch. Final stats from the assembled parquet at pipelines/politica/build/llm/poll_relatorio_2024.parquet.

Companion: sp_slice_analysis.md covers the SP-slice prelim regressions that ran on educloud while this bulk extraction was in flight — those used educloud's earlier SP extraction.

Final counts

Metric	Value
Non-SP PDFs input	9,737
Successful extractions (`ok`)	9,307
Image-only / no text	296
Errors (JSON decode etc.)	9
Cost (gpt-4o-mini)	$11.51
Tokens (in / out)	51.6 M / 6.3 M
Runtime at 8 workers	17,343 s (4h49m)

Coverage

Slice	Polls in parquet	Out of
Non-SP fresh extractions	8,067	9,737 PDFs → 82.9%
Legacy AC + AL pilot	102	(pre-existing, merged via legacy fallback)
Total	8,169

1,568 PDFs (16%) yielded no parquet rows: 296 image-only, 9 errors, ~1,156 schema-valid-but-empty extractions (LLM judged the document unreadable / graphics-only). Per-state coverage range: 96% (PB, AC, SC) down to 58% (TO) — TO/AP/PA likely have a higher share of scanned-image PDFs. Worth an OCR follow-up pass on the bottom three states.

Per-state coverage (excluding SP, which is extracted on educloud):

    pdfs  parquet_protocols  share
GO  1031   864   83.8%
MG   949   807   85.0%
BA   834   743   89.1%
PE   742   698   94.1%
PR   669   513   76.7%
RN   652   593   90.9%
PI   539   516   95.7%
TO   475   275   57.9%   ← OCR candidate
MA   472   412   87.3%
PA   443   310   69.9%   ← OCR candidate
SE   373   232   62.2%   ← OCR candidate
PB   350   337   96.3%
ES   342   270   78.9%
MS   278   226   81.3%
MT   278   250   89.9%
RJ   258   204   79.1%
SC   246   234   95.1%
CE   211   180   85.3%
RS   205   182   88.8%
AL   148   130   87.8%
AM    95    72   75.8%
RO    48    35   72.9%
AC    48    46   95.8%
RR    26    24   92.3%
AP    25    16   64.0%   ← small

Quality

Sub-scenario percentage sums for the headline espontaneo / estimulado scenarios (group by protocol × scenario_type × scenario_label):

Sum bin	Count	Share
0–50 (mostly zero-only)	230	1.5%
50–95	369	2.4%
95–105 (clean)	14,518	93.7%
105–115	168	1.1%
115–200	167	1.1%
> 200	40	0.3%

7,358 of 8,169 protocols (90%) have at least one clean (95–105%) estimulado sub-scenario — the usable headline-analysis sample is comfortable.

Per-row range: 6 rows have percent outside [0, 100]. Two protocols have a UF mismatch between filename and echoed tse_protocol_display.

Scenario type distribution (149,934 rows)

Type	Rows	Share
estimulado	47,967	32.0%
espontaneo	42,037	28.0%
rejeicao	32,292	21.5%
avaliacao_governo	13,221	8.8%
votos_validos	6,192	4.1%
segundo_turno_simulacao	6,122	4.1%
outro	1,965	1.3%
invented (not in schema enum)	138	0.1%

The LLM emitted four scenario types beyond the enum: expectativa_vitoria (126), avaliacao_vitoria (6), clima_vitoria (3), percepcao_vitoria (3). Negligible volume but worth folding into outro in a downstream cleanup step or adding to the schema's enum if these are real recurring scenarios in Brazilian poll reports.

The rejeicao share (21.5%) matters for design: it gives the symmetric opponent-sponsored test in the within-candidate FE design plenty of statistical heft (does an opponent-sponsored poll report a higher rejeicao for candidate c?). The SP-slice analysis already exploits this — see sp_slice_analysis.md § symmetry test.

Anomalies and follow-up (non-blocking)

Image-only PDFs: 296 + part of the 1,156 empty extractions. OCR pass with pdf2image + tesseract on TO/AP/PA/SE would likely push coverage to ~95%. Small compute cost on educloud, no LLM cost if pdftotext-equivalent layer is used post-OCR.
9 JSON-decode errors: re-extract these specific protocols with --reextract on a tiny protocol list — near-free.
4 invented scenario types: 138 rows total — fold into outro in the next clean step, or whitelist in the schema if Brazilian poll reports use them recurrently.
6 out-of-range percentages and 2 UF mismatches: small enough to flag but not act on; will be visible in any per-poll QA step.

Cache state and merge with SP

pipelines/politica/build/llm/poll_relatorio/ now has 9,325 cache files on the laptop (8,067 fresh llmkit-format entries from this run

1,258 from prior runs / smoke tests). All fresh llmkit entries record api_params: {"response_format": "structured_outputs"} per the llmkit upgrade in commit da1cbad. Early smoke-test entries on RR use the legacy json_object mode — visible in api_params and a sample audit would treat them as mixed-mode without distinguishing in the analysis.

To merge with educloud's SP extraction (~1,635 SP protocols), rsync or rclone the laptop cache up to bi-dropbox and pull on educloud, then re-run the --validate-cached assemble step. Mixed json_object / SO cache is fine — the assembler is agnostic. Procedure documented in educloud_next_steps.md § 7.

Hand-off

The SP-slice prelim regressions (sp_slice_analysis.md) demonstrate the design works and give a first preliminary point estimate of β = +8.22 on SP alone (n = 15 self-sponsored rows). Merging this bulk-extraction parquet with the SP slice on educloud will scale that analysis to all 26 UFs — Spec 3c (race × week FE), which the SP slice could only run on 3 cells, gets a usable sample.

Bulk Extraction Audit