title: All-Brazil analysis status: first-pass (2026-06-02)
All-Brazil analysis — poll-sponsor-bias
Scaled-up version of sp_slice_analysis.md now that the laptop bulk
LLM extraction has finished (26 UFs, see build_audit.md) and the
sponsor→candidate join has Routes A+B+C+D wired into politica.
The SP-slice prototype demonstrated the design works on 15 self-sponsored rows in 22 races. The all-Brazil sample multiplies that by ~25×, lets the strict timing-controlled specs actually run, and brings the pre-poll trajectory placebo from n=7 to n=132.
Data assembly
source/analysis/analysis_table.py → build/analysis_table.parquet.
- 174,747 candidate-scenario rows from 9,509 protocols (26 UFs)
- 56,651 estimulado rows
- 35,690 after dropping aggregate names (Branco/Nulo/NS variants)
- 30,555 after
match_score ≥ 2filter (multi-token or stronger) - → 30,555 candidate-poll rows across 8,431 candidates / 2,942 races
- 793 polls have a sponsor→candidate link (Routes A+B+C+D):
committee 364, party_name 148, party 38, cpf 18 → 568 candidate-poll
rows with
sponsored_by=1, 1,048 withopponent_sponsored=1 - 20,885 candidate-poll rows are in polls sponsored exclusively by
independent media or pollster-self (
poll_is_independent=1) — the clean-comparator pool
Per-race availability for the clean-comparator design:
- 395 races with ≥1 self-sponsored row
- 130 races with ≥1 self + ≥1 independent-poll row (the spec 3b/3c sample)
Preliminary regressions
source/analysis/regressions.py → build/table/regressions.csv. Cluster-robust SEs
at the race (muni) level. Candidate FE absorbed via
linearmodels.PanelOLS within-demeaning (8,431 entities — Patsy
C(entity) was OOM-ing the dense dummy matrix).
| Spec | β (sponsored_by) | SE | p | β (opp_sponsored) | SE | p |
|---|---|---|---|---|---|---|
| naive (no FE) | +7.57 | 0.88 | <0.001 | -1.67 | 0.57 | 0.004 |
| Spec 1 (pollster + candidate FE) | +7.60 | 1.34 | <0.001 | -2.12 | 0.88 | 0.015 |
| Spec 2 (+ methodology controls) | +7.75 | 1.34 | <0.001 | -1.93 | 0.89 | 0.030 |
| Spec 2 WLS (weighted by N) | +7.83 | 1.37 | <0.001 | -1.56 | 0.77 | 0.044 |
| Spec 3a (clean comparator + cand FE) | +6.32 | 1.46 | <0.001 | — | ||
| Spec 3b (clean + race × month FE) | +7.77 | 1.65 | <0.001 | — | ||
| Spec 3c (clean + race × week FE, strict) | +6.95 | 2.57 | 0.008 | — |
Symmetry test (Spec 2): β_self − β_opp ≈ +9.7 — sender-specific bias, not generic pollster house effect.
What changed from the SP-slice prototype
- β is more stable. SP showed naive +2.17 → FE-adjusted +7.24 (3.5× jump from negative selection on standing). All-Brazil shows naive +7.57 → FE-adjusted +7.75 — the FE adds almost nothing. Negative-selection-into-self-sponsoring is much weaker at all-Brazil scale than in SP alone.
- Spec 3c is now identified off real samples. SP had 3 (race × week) cells; all-Brazil has 60 cells / 409 rows. β = +6.95 (p=0.008) is the tightest design and the headline number to lead with — every comparison is between polls fielded in the same race within the same week, with the comparator class restricted to independent media / pollster-self.
- Opponent-sponsored coefficient is now significant. β_opp = −1.7 to −2.1 across specs at p < 0.05 — opponent-sponsored polls do understate the candidate, by ~2 pp. Small effect, but real.
Pre-poll trajectory placebo
The cleanest counter to the "self-sponsor when leading" story. For each candidate with a self-sponsored poll, look at their poll number in the most recent INDEPENDENT poll fielded before the self-sponsored one in the same race.
| Metric | All-Brazil (n=132) | SP-only (n=7) |
|---|---|---|
| Median time gap (days) | 10 | 4 |
| Mean error in self-sponsored polls | +7.64 | -1.28 |
| Mean error in preceding independent polls | +0.93 | -12.55 |
| Mean within-candidate jump (self − pre-indep) | +6.70 | +11.27 |
| Share of jumps > 0 | 74% | 86% |
| One-sample t (H0: jump=0) | 5.21 | — |
Restricted to gap ≤ 14 days (n=77): jump = +6.43, t = 3.96, 77% positive — same magnitude, can't blame "campaign crystallized over weeks".
The all-Brazil placebo is much more informative than the SP-only version:
- The independent-poll baseline is now centered. SP only showed independent polls undershooting by 12.5 pp on average — opening the "independent polls are structurally biased low" alternative interpretation. All-Brazil shows independent polls at +0.93 pp error (essentially unbiased). The within-candidate jump of +6.70 is then a clean estimate of how much self-sponsored polls deviate from unbiased polls of the SAME candidate.
- Magnitudes match the regression coefficient. The placebo's +6.70 pp jump lines up with spec 3a's +6.32 pp and spec 3c's +6.95 pp coefficients — three different identification strategies (within-candidate FE, race × week FE, descriptive within-candidate jump) converging on the same number.
- Statistical confidence. t = 5.21 on the within-candidate trajectory jump. The SP-slice version was qualitatively suggestive; the all-Brazil version is quantitatively decisive.
Headline interpretation
The within-candidate jump from the preceding independent poll is +6.7 pp (t = 5.2, n=132), median gap 10 days. The race × week FE spec gives +6.9 pp (p = 0.008, 60 cells). The within-candidate panel FE spec gives +7.8 pp (p < 0.001). Three independent specs, same answer.
The "candidate commissions when leading" alternative cannot mechanically generate this. A 6–7 pp jump over a 4–14 day window exceeds plausible genuine momentum, and the comparator is restricted to independent polls of the SAME candidate in the SAME race within the SAME window.
The remaining (and now sharper) open question is slant via design
vs. slant via fabrication — Channel A vs Channel B from
summary.md. The Spec 3 LLM methodology decomposition needs the
poll_methodology extractor that's queued in politica's docs/todo.md.
Caveats
- Slant vs design: this analysis can't decompose Channel A (Bayesian persuasion via methodology choices declared in the registration) vs Channel B (residual / fabrication). All current specs control only for structured methodology (sample size, days, ST_PESQUISA_PROPRIA). The free-text plano amostral / dado município fields need LLM extraction first.
- Match quality: 12% of non-aggregate candidate names in
estimulado scenarios still don't match a politico (down from 41%
before the
nome_urnapatch). Mostly aggregate-row variants my regex misses ("Anularia o voto", "Não votará") + hypothetical names ("Candidato A/B/C"). The headline numbers are robust to relaxingmatch_score ≥ 2to≥ 1(single-token matches) — checked. - Spec 3c is identified off 60 (race × week) cells: a real improvement over SP's 3, but a small share of total variation. The +6.95 there is a meaningful but not overwhelmingly tight number.
- Routes A and C are underweighted: CPF route (Route A) matches only 18 sponsor rows; party-CNPJ route (Route C) only 38. Most signal comes from Route B (committee name) + Route D (party name parse). If the despesa_partidaria coverage of party CNPJs is incomplete, the true sponsor set is a superset of what we see.
Reproduce
# Already done — outputs at pipelines/politica/build/...
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
python pipelines/politica/source/clean/poll_response_2024.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
python pipelines/politica/source/clean/poll_sponsor.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
python pipelines/politica/source/clean/poll_sponsor.py
# In /workspace/projects/poll-sponsor-bias:
python source/assemble/cand_poll.py
python source/analysis/regressions.py