title: All-Brazil analysis status: first-pass (2026-06-02)

All-Brazil analysis — poll-sponsor-bias

Scaled-up version of sp_slice_analysis.md now that the laptop bulk LLM extraction has finished (26 UFs, see build_audit.md) and the sponsor→candidate join has Routes A+B+C+D wired into politica.

The SP-slice prototype demonstrated the design works on 15 self-sponsored rows in 22 races. The all-Brazil sample multiplies that by ~25×, lets the strict timing-controlled specs actually run, and brings the pre-poll trajectory placebo from n=7 to n=132.

Data assembly

source/analysis/analysis_table.py → build/analysis_table.parquet.

174,747 candidate-scenario rows from 9,509 protocols (26 UFs)
56,651 estimulado rows
35,690 after dropping aggregate names (Branco/Nulo/NS variants)
30,555 after match_score ≥ 2 filter (multi-token or stronger)
→ 30,555 candidate-poll rows across 8,431 candidates / 2,942 races
793 polls have a sponsor→candidate link (Routes A+B+C+D): committee 364, party_name 148, party 38, cpf 18 → 568 candidate-poll rows with sponsored_by=1, 1,048 with opponent_sponsored=1
20,885 candidate-poll rows are in polls sponsored exclusively by independent media or pollster-self (poll_is_independent=1) — the clean-comparator pool

Per-race availability for the clean-comparator design:

395 races with ≥1 self-sponsored row
130 races with ≥1 self + ≥1 independent-poll row (the spec 3b/3c sample)

Preliminary regressions

source/analysis/regressions.py → build/table/regressions.csv. Cluster-robust SEs at the race (muni) level. Candidate FE absorbed via linearmodels.PanelOLS within-demeaning (8,431 entities — Patsy C(entity) was OOM-ing the dense dummy matrix).

Spec	β (sponsored_by)	SE	p	β (opp_sponsored)	SE	p
naive (no FE)	+7.57	0.88	<0.001	-1.67	0.57	0.004
Spec 1 (pollster + candidate FE)	+7.60	1.34	<0.001	-2.12	0.88	0.015
Spec 2 (+ methodology controls)	+7.75	1.34	<0.001	-1.93	0.89	0.030
Spec 2 WLS (weighted by N)	+7.83	1.37	<0.001	-1.56	0.77	0.044
Spec 3a (clean comparator + cand FE)	+6.32	1.46	<0.001	—
Spec 3b (clean + race × month FE)	+7.77	1.65	<0.001	—
*Spec 3c (clean + race × week FE, strict)*	+6.95	2.57	0.008	—

Symmetry test (Spec 2): β_self − β_opp ≈ +9.7 — sender-specific bias, not generic pollster house effect.

What changed from the SP-slice prototype

β is more stable. SP showed naive +2.17 → FE-adjusted +7.24 (3.5× jump from negative selection on standing). All-Brazil shows naive +7.57 → FE-adjusted +7.75 — the FE adds almost nothing. Negative-selection-into-self-sponsoring is much weaker at all-Brazil scale than in SP alone.
Spec 3c is now identified off real samples. SP had 3 (race × week) cells; all-Brazil has 60 cells / 409 rows. β = +6.95 (p=0.008) is the tightest design and the headline number to lead with — every comparison is between polls fielded in the same race within the same week, with the comparator class restricted to independent media / pollster-self.
Opponent-sponsored coefficient is now significant. β_opp = −1.7 to −2.1 across specs at p < 0.05 — opponent-sponsored polls do understate the candidate, by ~2 pp. Small effect, but real.

Pre-poll trajectory placebo

The cleanest counter to the "self-sponsor when leading" story. For each candidate with a self-sponsored poll, look at their poll number in the most recent INDEPENDENT poll fielded before the self-sponsored one in the same race.

Metric	All-Brazil (n=132)	SP-only (n=7)
Median time gap (days)	10	4
Mean error in self-sponsored polls	+7.64	-1.28
Mean error in preceding independent polls	+0.93	-12.55
Mean within-candidate jump (self − pre-indep)	+6.70	+11.27
Share of jumps > 0	74%	86%
One-sample t (H0: jump=0)	5.21	—

Restricted to gap ≤ 14 days (n=77): jump = +6.43, t = 3.96, 77% positive — same magnitude, can't blame "campaign crystallized over weeks".

The all-Brazil placebo is much more informative than the SP-only version:

The independent-poll baseline is now centered. SP only showed independent polls undershooting by 12.5 pp on average — opening the "independent polls are structurally biased low" alternative interpretation. All-Brazil shows independent polls at +0.93 pp error (essentially unbiased). The within-candidate jump of +6.70 is then a clean estimate of how much self-sponsored polls deviate from unbiased polls of the SAME candidate.
Magnitudes match the regression coefficient. The placebo's +6.70 pp jump lines up with spec 3a's +6.32 pp and spec 3c's +6.95 pp coefficients — three different identification strategies (within-candidate FE, race × week FE, descriptive within-candidate jump) converging on the same number.
Statistical confidence. t = 5.21 on the within-candidate trajectory jump. The SP-slice version was qualitatively suggestive; the all-Brazil version is quantitatively decisive.

Headline interpretation

The within-candidate jump from the preceding independent poll is +6.7 pp (t = 5.2, n=132), median gap 10 days. The race × week FE spec gives +6.9 pp (p = 0.008, 60 cells). The within-candidate panel FE spec gives +7.8 pp (p < 0.001). Three independent specs, same answer.

The "candidate commissions when leading" alternative cannot mechanically generate this. A 6–7 pp jump over a 4–14 day window exceeds plausible genuine momentum, and the comparator is restricted to independent polls of the SAME candidate in the SAME race within the SAME window.

The remaining (and now sharper) open question is slant via design vs. slant via fabrication — Channel A vs Channel B from summary.md. The Spec 3 LLM methodology decomposition needs the poll_methodology extractor that's queued in politica's docs/todo.md.

Caveats

Slant vs design: this analysis can't decompose Channel A (Bayesian persuasion via methodology choices declared in the registration) vs Channel B (residual / fabrication). All current specs control only for structured methodology (sample size, days, ST_PESQUISA_PROPRIA). The free-text plano amostral / dado município fields need LLM extraction first.
Match quality: 12% of non-aggregate candidate names in estimulado scenarios still don't match a politico (down from 41% before the nome_urna patch). Mostly aggregate-row variants my regex misses ("Anularia o voto", "Não votará") + hypothetical names ("Candidato A/B/C"). The headline numbers are robust to relaxing match_score ≥ 2 to ≥ 1 (single-token matches) — checked.
Spec 3c is identified off 60 (race × week) cells: a real improvement over SP's 3, but a small share of total variation. The +6.95 there is a meaningful but not overwhelmingly tight number.
Routes A and C are underweighted: CPF route (Route A) matches only 18 sponsor rows; party-CNPJ route (Route C) only 38. Most signal comes from Route B (committee name) + Route D (party name parse). If the despesa_partidaria coverage of party CNPJs is incomplete, the true sponsor set is a superset of what we see.

Reproduce

# Already done — outputs at pipelines/politica/build/...
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
  python pipelines/politica/source/clean/poll_response_2024.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
  python pipelines/politica/source/clean/poll_sponsor.py
BASE_DIR=$PWD DATA_DIR=/workspace/data PYTHONPATH=/workspace/packages/llmkit \
  python pipelines/politica/source/clean/poll_sponsor.py

# In /workspace/projects/poll-sponsor-bias:
python source/assemble/cand_poll.py
python source/analysis/regressions.py

All-Brazil Analysis