title: Bulk extraction audit (laptop, 2026-06-01) status: handoff
Bulk extraction audit — laptop, 2026-06-01
The laptop bulk run of pipelines/politica/source/llm/poll_extract.py --year 2024 --exclude-states SP --workers 8 finished at ~15:13 UTC,
4h49m after launch. Final stats from the assembled parquet at
pipelines/politica/build/llm/poll_relatorio_2024.parquet.
Companion: sp_slice_analysis.md covers the SP-slice prelim
regressions that ran on educloud while this bulk extraction was
in flight — those used educloud's earlier SP extraction.
Final counts
| Metric | Value |
|---|---|
| Non-SP PDFs input | 9,737 |
Successful extractions (ok) |
9,307 |
| Image-only / no text | 296 |
| Errors (JSON decode etc.) | 9 |
| Cost (gpt-4o-mini) | $11.51 |
| Tokens (in / out) | 51.6 M / 6.3 M |
| Runtime at 8 workers | 17,343 s (4h49m) |
Coverage
| Slice | Polls in parquet | Out of |
|---|---|---|
| Non-SP fresh extractions | 8,067 | 9,737 PDFs → 82.9% |
| Legacy AC + AL pilot | 102 | (pre-existing, merged via legacy fallback) |
| Total | 8,169 |
1,568 PDFs (16%) yielded no parquet rows: 296 image-only, 9 errors, ~1,156 schema-valid-but-empty extractions (LLM judged the document unreadable / graphics-only). Per-state coverage range: 96% (PB, AC, SC) down to 58% (TO) — TO/AP/PA likely have a higher share of scanned-image PDFs. Worth an OCR follow-up pass on the bottom three states.
Per-state coverage (excluding SP, which is extracted on educloud):
pdfs parquet_protocols share
GO 1031 864 83.8%
MG 949 807 85.0%
BA 834 743 89.1%
PE 742 698 94.1%
PR 669 513 76.7%
RN 652 593 90.9%
PI 539 516 95.7%
TO 475 275 57.9% ← OCR candidate
MA 472 412 87.3%
PA 443 310 69.9% ← OCR candidate
SE 373 232 62.2% ← OCR candidate
PB 350 337 96.3%
ES 342 270 78.9%
MS 278 226 81.3%
MT 278 250 89.9%
RJ 258 204 79.1%
SC 246 234 95.1%
CE 211 180 85.3%
RS 205 182 88.8%
AL 148 130 87.8%
AM 95 72 75.8%
RO 48 35 72.9%
AC 48 46 95.8%
RR 26 24 92.3%
AP 25 16 64.0% ← small
Quality
Sub-scenario percentage sums for the headline espontaneo /
estimulado scenarios (group by protocol × scenario_type × scenario_label):
| Sum bin | Count | Share |
|---|---|---|
| 0–50 (mostly zero-only) | 230 | 1.5% |
| 50–95 | 369 | 2.4% |
| 95–105 (clean) | 14,518 | 93.7% |
| 105–115 | 168 | 1.1% |
| 115–200 | 167 | 1.1% |
| > 200 | 40 | 0.3% |
7,358 of 8,169 protocols (90%) have at least one clean (95–105%) estimulado sub-scenario — the usable headline-analysis sample is comfortable.
Per-row range: 6 rows have percent outside [0, 100]. Two protocols
have a UF mismatch between filename and echoed tse_protocol_display.
Scenario type distribution (149,934 rows)
| Type | Rows | Share |
|---|---|---|
| estimulado | 47,967 | 32.0% |
| espontaneo | 42,037 | 28.0% |
| rejeicao | 32,292 | 21.5% |
| avaliacao_governo | 13,221 | 8.8% |
| votos_validos | 6,192 | 4.1% |
| segundo_turno_simulacao | 6,122 | 4.1% |
| outro | 1,965 | 1.3% |
| invented (not in schema enum) | 138 | 0.1% |
The LLM emitted four scenario types beyond the enum:
expectativa_vitoria (126), avaliacao_vitoria (6),
clima_vitoria (3), percepcao_vitoria (3). Negligible volume but
worth folding into outro in a downstream cleanup step or adding to
the schema's enum if these are real recurring scenarios in Brazilian
poll reports.
The rejeicao share (21.5%) matters for design: it gives the
symmetric opponent-sponsored test in the within-candidate FE
design plenty of statistical heft (does an opponent-sponsored poll
report a higher rejeicao for candidate c?). The SP-slice analysis
already exploits this — see sp_slice_analysis.md § symmetry test.
Anomalies and follow-up (non-blocking)
- Image-only PDFs: 296 + part of the 1,156 empty extractions. OCR
pass with
pdf2image + tesseracton TO/AP/PA/SE would likely push coverage to ~95%. Small compute cost on educloud, no LLM cost if pdftotext-equivalent layer is used post-OCR. - 9 JSON-decode errors: re-extract these specific protocols with
--reextracton a tiny protocol list — near-free. - 4 invented scenario types: 138 rows total — fold into
outroin the next clean step, or whitelist in the schema if Brazilian poll reports use them recurrently. - 6 out-of-range percentages and 2 UF mismatches: small enough to flag but not act on; will be visible in any per-poll QA step.
Cache state and merge with SP
pipelines/politica/build/llm/poll_relatorio/ now has 9,325 cache
files on the laptop (8,067 fresh llmkit-format entries from this run
- 1,258 from prior runs / smoke tests). All fresh llmkit entries
record
api_params: {"response_format": "structured_outputs"}per the llmkit upgrade in commitda1cbad. Early smoke-test entries on RR use the legacyjson_objectmode — visible inapi_paramsand a sample audit would treat them as mixed-mode without distinguishing in the analysis.
To merge with educloud's SP extraction (~1,635 SP protocols), rsync
or rclone the laptop cache up to bi-dropbox and pull on educloud, then
re-run the --validate-cached assemble step. Mixed json_object / SO
cache is fine — the assembler is agnostic. Procedure documented in
educloud_next_steps.md § 7.
Hand-off
The SP-slice prelim regressions (sp_slice_analysis.md) demonstrate
the design works and give a first preliminary point estimate of
β = +8.22 on SP alone (n = 15 self-sponsored rows). Merging this
bulk-extraction parquet with the SP slice on educloud will scale that
analysis to all 26 UFs — Spec 3c (race × week FE), which the SP slice
could only run on 3 cells, gets a usable sample.