title: SP-slice analysis (prototype) status: prototype (2026-06-01)
SP-slice analysis — poll-sponsor-bias
(Originally written as build_audit.md on educloud; renamed to disambiguate
from the laptop bulk-extraction quality audit, which is now build_audit.md.)
Prototype run of the within-candidate FE design on the SP slice only (SP is the UF whose LLM relatório extraction was done on educloud ahead of the laptop bulk run for the other 25 UFs). Code-pipeline sanity check, not the headline.
Data assembly
2026-06-14 update. The SP-specific assembly script was retired
when the national cand_poll.parquet build matured (Routes A+B+C+D
- bulk LLM extraction). The SP slice is now produced as a filter on the national table:
df = pd.read_parquet("build/assemble/cand_poll.parquet")
df = df[df["uf"] == "SP"]
This is what source/analysis/sp_regressions.py does. Row counts
shift slightly (~5.3k SP candidate-poll rows under the national
pipeline vs ~4.8k under the dropped SP-specific within-muni name
matcher); coefficient magnitudes are unchanged in sign and remain
strongly positive and significant.
The numbers below are the pre-retirement counts (from the original SP-only assembly, kept for context):
- 8,684 SP estimulado scenario rows from 1,203 polls
- 1,964 SP mayoral protocols in the TSE registry
- 2,053 SP 2024 PREFEITO 1st-round candidates across 645 munis
- 49 SP sponsor→candidate links (via Routes A+B in
poll_sponsor.py) - 1,423 aggregate rows (Branco/Nulo/Não sabe) dropped
- 4,847 candidate-poll rows after fuzzy-name match (score ≥ 1)
- 3,499 rows in the regression sample (match_score ≥ 2)
Per-race coverage:
- 341 races with any data
- 22 races with ≥1 self-sponsored row
- 20 races with both self & opp sponsored polls
- 172 races with ≥2 polls
Preliminary regressions
source/analysis/sp_regressions.py → build/table/sp_regressions.csv. All
cluster-robust SEs at the race (muni) level.
| Spec | β (sponsored_by) | SE | p | β (opp_sponsored) | SE | p |
|---|---|---|---|---|---|---|
| naive (no FE) | +2.17 | 2.49 | 0.38 | -0.99 | 1.68 | 0.55 |
| Spec 1 (pollster + candidate FE) | +7.64 | 2.56 | 0.003 | -3.11 | 2.79 | 0.27 |
| Spec 2 (+ structured methodology) | +7.24 | 2.46 | 0.003 | -3.14 | 2.63 | 0.23 |
| Spec 2 WLS (weighted by N) | +8.06 | 2.60 | 0.002 | -2.84 | 2.28 | 0.21 |
| Spec 3a (clean comparator + candidate FE) | +8.22 | 3.17 | 0.010 | — | ||
| Spec 3b (clean + race × month FE) | +8.02 | 5.43 | 0.140 | — | ||
| Spec 3c (clean + race × week FE, strict) | +15.73 | 3.41 | <0.001 | — |
Symmetry test (Spec 2): β_self − β_opp = +10.4 — clear sign-test evidence the bias operates on the sponsor's own candidate, not as a generic pollster house effect.
Spec 3 family: timing-controlled identification
Endogeneity concern (flagged 2026-06-01): candidates may commission polls when they privately believe they're leading — time-varying private momentum that within-candidate FE doesn't absorb. Three specs address it:
- Comparator restriction: sample restricted to (a) self-sponsored
rows (treatment) plus (b) rows where the poll is sponsored ONLY by
independent media or pollster-self (no committee / party / individual
sponsor). Cuts the comparator class to "contemporaneous independent
polls" — exactly the counterfactual the design needs.
Classification logic in
source/analysis/sp_analysis_table.py::classify_sponsor_row. - Race × time-window FE: identifies β off polls fielded around the same dates in the same race, killing the timing-of-commission story by construction.
Spec 3a (clean comparator only, no timing FE) gives β = +8.22, almost identical to Spec 2's +7.24 — the within-candidate FE was already doing most of the work, and restricting to clean comparators sharpens the estimate slightly. Spec 3c (race × week FE) is identified off only 3 (race × week) cells in SP and gives β = +15.7 — high but imprecise. The bulk laptop run will give a usable Spec 3c sample.
Pre-poll trajectory placebo
A direct test of the "self-sponsor when leading" hypothesis. For each candidate with a self-sponsored poll, look at their poll number in the most recent INDEPENDENT poll fielded before the self-sponsored one in the same race. If "candidate commissions when leading" is the explanation, the preceding independent poll should already be high (both polls are measuring the same private peak).
7 SP candidates qualify (a self-sponsored poll preceded by an independent poll in the same race). Median time gap: 4 days.
| Metric | Value |
|---|---|
| Mean error in self-sponsored polls | −1.28 (close to truth) |
| Mean error in preceding independent polls | −12.55 (large negative — they understated) |
| Mean within-candidate jump (self − pre-indep) | +11.27 |
In other words, for the same candidate, the self-sponsored poll lands ~11 pp higher than the immediately preceding independent poll. The time gap is too short (median 4 days) for genuine momentum to plausibly explain that magnitude.
Caveat: only 7 candidates contribute to the placebo. Bulk run will firm this up. But the direction + magnitude consistently support the slant interpretation over the timing-of-commission alternative.
Interpretation (caveats apply)
- Naive → FE-adjusted β jumps from +2 to +7. Self-sponsoring candidates are negatively selected on true standing — the within-candidate FE strips this confound.
- ~7 pp self-sponsor bias is large but plausible given Brazil's "encomendada" stereotype. The TSE registration regime means we observe every registered poll, including unpublicized ones, so this estimate isn't contaminated by selection-into-release.
- Spec 3 (LLM methodology controls) deferred — the
poll_methodologyextractor isn't built yet (queued inpipelines/politica/docs/todo.md). Without it, we can't decompose Channel A (Bayesian-persuasion via design) vs Channel B (residual / fabrication). The fact that β is stable from Spec 1 to Spec 2 (+7.64 → +7.24) hints structured methodology controls don't absorb much of it — leaving room for both channels.
Caveats
- n = 15 self-sponsored rows in the regression sample is thin; the +7 to +15 estimates have wide CIs even with cluster-robust SEs. Bulk laptop run (~9,737 non-SP PDFs) will give 26 UFs and many more self-sponsored observations.
- Slant vs. correction is undecidable from this slice alone. The preceding independent polls UNDERSHOT the eventual final share by ~12 pp on average. Two interpretations: (i) committee polls slant up via methodology choices / fabrication (the Channel A / B story); (ii) independent polls (often from local media in small munis) systematically undershoot eventual winners, and committee polls simply correct. Within-candidate FE plus the short median time gap (4 days) make (ii) hard to sustain mechanically, but the question is open until Spec 3 (LLM methodology controls) can decompose Channel A from fabrication.
- Route C deferred — partidos-directorate CNPJ table not staged (~728 protocols by the playbook estimate; would grow self-sponsored N by ~3x).
match_score ≥ 2filter drops 1,348 candidate-poll rows; the residue is single-token matches that include both legitimate matches and false positives. TheNM_URNA_CANDIDATO(ballot name) pass-through queued in politica todo would lift match quality enough to relax this filter.
Reproduce
# On educloud, in /workspace/pipelines/politica:
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
python source/llm/poll_extract.py --year 2024 --states SP --validate-cached
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
python source/clean/poll_sponsor.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
python source/clean/poll_sponsor.py
# Then in /workspace/projects/poll-sponsor-bias:
python source/assemble/cand_poll.py
python source/analysis/sp_regressions.py