title: All-Brazil analysis status: first-pass (2026-06-02)

All-Brazil analysis — poll-sponsor-bias

Scaled-up version of sp_slice_analysis.md now that the laptop bulk LLM extraction has finished (26 UFs, see build_audit.md) and the sponsor→candidate join has Routes A+B+C+D wired into politica.

The SP-slice prototype demonstrated the design works on 15 self-sponsored rows in 22 races. The all-Brazil sample multiplies that by ~25×, lets the strict timing-controlled specs actually run, and brings the pre-poll trajectory placebo from n=7 to n=132.

Data assembly

source/analysis/analysis_table.pybuild/analysis_table.parquet.

Per-race availability for the clean-comparator design:

Preliminary regressions

source/analysis/regressions.pybuild/table/regressions.csv. Cluster-robust SEs at the race (muni) level. Candidate FE absorbed via linearmodels.PanelOLS within-demeaning (8,431 entities — Patsy C(entity) was OOM-ing the dense dummy matrix).

Spec β (sponsored_by) SE p β (opp_sponsored) SE p
naive (no FE) +7.57 0.88 <0.001 -1.67 0.57 0.004
Spec 1 (pollster + candidate FE) +7.60 1.34 <0.001 -2.12 0.88 0.015
Spec 2 (+ methodology controls) +7.75 1.34 <0.001 -1.93 0.89 0.030
Spec 2 WLS (weighted by N) +7.83 1.37 <0.001 -1.56 0.77 0.044
Spec 3a (clean comparator + cand FE) +6.32 1.46 <0.001
Spec 3b (clean + race × month FE) +7.77 1.65 <0.001
Spec 3c (clean + race × week FE, strict) +6.95 2.57 0.008

Symmetry test (Spec 2): β_self − β_opp ≈ +9.7 — sender-specific bias, not generic pollster house effect.

What changed from the SP-slice prototype

Pre-poll trajectory placebo

The cleanest counter to the "self-sponsor when leading" story. For each candidate with a self-sponsored poll, look at their poll number in the most recent INDEPENDENT poll fielded before the self-sponsored one in the same race.

Metric All-Brazil (n=132) SP-only (n=7)
Median time gap (days) 10 4
Mean error in self-sponsored polls +7.64 -1.28
Mean error in preceding independent polls +0.93 -12.55
Mean within-candidate jump (self − pre-indep) +6.70 +11.27
Share of jumps > 0 74% 86%
One-sample t (H0: jump=0) 5.21

Restricted to gap ≤ 14 days (n=77): jump = +6.43, t = 3.96, 77% positive — same magnitude, can't blame "campaign crystallized over weeks".

The all-Brazil placebo is much more informative than the SP-only version:

  1. The independent-poll baseline is now centered. SP only showed independent polls undershooting by 12.5 pp on average — opening the "independent polls are structurally biased low" alternative interpretation. All-Brazil shows independent polls at +0.93 pp error (essentially unbiased). The within-candidate jump of +6.70 is then a clean estimate of how much self-sponsored polls deviate from unbiased polls of the SAME candidate.
  2. Magnitudes match the regression coefficient. The placebo's +6.70 pp jump lines up with spec 3a's +6.32 pp and spec 3c's +6.95 pp coefficients — three different identification strategies (within-candidate FE, race × week FE, descriptive within-candidate jump) converging on the same number.
  3. Statistical confidence. t = 5.21 on the within-candidate trajectory jump. The SP-slice version was qualitatively suggestive; the all-Brazil version is quantitatively decisive.

Headline interpretation

The within-candidate jump from the preceding independent poll is +6.7 pp (t = 5.2, n=132), median gap 10 days. The race × week FE spec gives +6.9 pp (p = 0.008, 60 cells). The within-candidate panel FE spec gives +7.8 pp (p < 0.001). Three independent specs, same answer.

The "candidate commissions when leading" alternative cannot mechanically generate this. A 6–7 pp jump over a 4–14 day window exceeds plausible genuine momentum, and the comparator is restricted to independent polls of the SAME candidate in the SAME race within the SAME window.

The remaining (and now sharper) open question is slant via design vs. slant via fabrication — Channel A vs Channel B from summary.md. The Spec 3 LLM methodology decomposition needs the poll_methodology extractor that's queued in politica's docs/todo.md.

Caveats

Reproduce

# Already done — outputs at pipelines/politica/build/...
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
  python pipelines/politica/source/clean/poll_response_2024.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
  python pipelines/politica/source/clean/poll_sponsor.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
  python pipelines/politica/source/clean/poll_sponsor.py

# In /workspace/projects/poll-sponsor-bias:
python source/assemble/cand_poll.py
python source/analysis/regressions.py