title: Bulk extraction audit (laptop, 2026-06-01) status: handoff

Bulk extraction audit — laptop, 2026-06-01

The laptop bulk run of pipelines/politica/source/llm/poll_extract.py --year 2024 --exclude-states SP --workers 8 finished at ~15:13 UTC, 4h49m after launch. Final stats from the assembled parquet at pipelines/politica/build/llm/poll_relatorio_2024.parquet.

Companion: sp_slice_analysis.md covers the SP-slice prelim regressions that ran on educloud while this bulk extraction was in flight — those used educloud's earlier SP extraction.

Final counts

Metric Value
Non-SP PDFs input 9,737
Successful extractions (ok) 9,307
Image-only / no text 296
Errors (JSON decode etc.) 9
Cost (gpt-4o-mini) $11.51
Tokens (in / out) 51.6 M / 6.3 M
Runtime at 8 workers 17,343 s (4h49m)

Coverage

Slice Polls in parquet Out of
Non-SP fresh extractions 8,067 9,737 PDFs → 82.9%
Legacy AC + AL pilot 102 (pre-existing, merged via legacy fallback)
Total 8,169

1,568 PDFs (16%) yielded no parquet rows: 296 image-only, 9 errors, ~1,156 schema-valid-but-empty extractions (LLM judged the document unreadable / graphics-only). Per-state coverage range: 96% (PB, AC, SC) down to 58% (TO) — TO/AP/PA likely have a higher share of scanned-image PDFs. Worth an OCR follow-up pass on the bottom three states.

Per-state coverage (excluding SP, which is extracted on educloud):

    pdfs  parquet_protocols  share
GO  1031   864   83.8%
MG   949   807   85.0%
BA   834   743   89.1%
PE   742   698   94.1%
PR   669   513   76.7%
RN   652   593   90.9%
PI   539   516   95.7%
TO   475   275   57.9%   ← OCR candidate
MA   472   412   87.3%
PA   443   310   69.9%   ← OCR candidate
SE   373   232   62.2%   ← OCR candidate
PB   350   337   96.3%
ES   342   270   78.9%
MS   278   226   81.3%
MT   278   250   89.9%
RJ   258   204   79.1%
SC   246   234   95.1%
CE   211   180   85.3%
RS   205   182   88.8%
AL   148   130   87.8%
AM    95    72   75.8%
RO    48    35   72.9%
AC    48    46   95.8%
RR    26    24   92.3%
AP    25    16   64.0%   ← small

Quality

Sub-scenario percentage sums for the headline espontaneo / estimulado scenarios (group by protocol × scenario_type × scenario_label):

Sum bin Count Share
0–50 (mostly zero-only) 230 1.5%
50–95 369 2.4%
95–105 (clean) 14,518 93.7%
105–115 168 1.1%
115–200 167 1.1%
> 200 40 0.3%

7,358 of 8,169 protocols (90%) have at least one clean (95–105%) estimulado sub-scenario — the usable headline-analysis sample is comfortable.

Per-row range: 6 rows have percent outside [0, 100]. Two protocols have a UF mismatch between filename and echoed tse_protocol_display.

Scenario type distribution (149,934 rows)

Type Rows Share
estimulado 47,967 32.0%
espontaneo 42,037 28.0%
rejeicao 32,292 21.5%
avaliacao_governo 13,221 8.8%
votos_validos 6,192 4.1%
segundo_turno_simulacao 6,122 4.1%
outro 1,965 1.3%
invented (not in schema enum) 138 0.1%

The LLM emitted four scenario types beyond the enum: expectativa_vitoria (126), avaliacao_vitoria (6), clima_vitoria (3), percepcao_vitoria (3). Negligible volume but worth folding into outro in a downstream cleanup step or adding to the schema's enum if these are real recurring scenarios in Brazilian poll reports.

The rejeicao share (21.5%) matters for design: it gives the symmetric opponent-sponsored test in the within-candidate FE design plenty of statistical heft (does an opponent-sponsored poll report a higher rejeicao for candidate c?). The SP-slice analysis already exploits this — see sp_slice_analysis.md § symmetry test.

Anomalies and follow-up (non-blocking)

Cache state and merge with SP

pipelines/politica/build/llm/poll_relatorio/ now has 9,325 cache files on the laptop (8,067 fresh llmkit-format entries from this run

To merge with educloud's SP extraction (~1,635 SP protocols), rsync or rclone the laptop cache up to bi-dropbox and pull on educloud, then re-run the --validate-cached assemble step. Mixed json_object / SO cache is fine — the assembler is agnostic. Procedure documented in educloud_next_steps.md § 7.

Hand-off

The SP-slice prelim regressions (sp_slice_analysis.md) demonstrate the design works and give a first preliminary point estimate of β = +8.22 on SP alone (n = 15 self-sponsored rows). Merging this bulk-extraction parquet with the SP slice on educloud will scale that analysis to all 26 UFs — Spec 3c (race × week FE), which the SP slice could only run on 3 cells, gets a usable sample.