Decisions

2026-06-02 — Promote from idea to project

Decision: Move research/ideas/poll-sponsor-bias/ to projects/poll-sponsor-bias/ with the canonical project structure.

Reason: The design's empirical core is settled: three independent specs (within-candidate FE, race × week FE, descriptive within-candidate jump) converge on β ≈ +6 to +7 pp; the pre-poll trajectory placebo (n=132, t = 5.21) decisively rules out the "candidate commissions when leading" alternative. The next 2–3 months of work — Channel A vs B decomposition, theory framing, write-up — benefits from a project structure (paper/, decisions.md, formal todo).

Alternatives considered:

Keep as idea: rejected — the work has moved past exploration.
Pre-register the design: deferred to the next iteration after the Channel A vs B decomposition lands.

Implications: First-todo of the project is the poll_methodology LLM extractor (queued in pipelines/politica/docs/todo.md) — Channel A vs B is the project's main unfinished empirical lever.

2026-06-02 — Outcome variable: poll percent renormalized within scenario

Decision: Compute error = poll_percent_normalized - 100 * final_share where poll_percent_normalized = 100 * percent / sum(percent over non-aggregate candidates in scenario), and final_share = candidate_votes / sum(candidate_votes within muni).

Reason: Final-share denominator is valid candidate votes (excludes Branco/Nulo). Poll percent's natural denominator includes "Don't know / Branco / Nulo" — which would mechanically bias error negative if not renormalized. Renormalization brings both to a common denominator.

Alternatives considered:

Use the votos_validos scenario directly: rejected — only ~4% of scenarios are votos_validos; estimulado is the workhorse and what the public press follows.
Use votes / turnout (raw share of voters): rejected — pollsters don't aim to predict that quantity, and it would inflate the outcome variance without adding identification.

Implications: Error scale interpretation: error = +7 pp means the poll overstates the candidate's valid-vote share by 7 percentage points. Matches the headline finding's magnitude across all specs.

2026-06-02 — Clean comparator: media + pollster-self sponsored only

Decision: For the timing-controlled specs (3a/3b/3c), restrict the regression sample to candidate-poll rows where the poll is either (a) sponsored by this candidate (treatment) or (b) sponsored exclusively by independent media or pollster-self (control). Drop opponent-sponsored rows and rows in mixed/other-firm polls.

Reason: The "candidate commissions when leading" concern can only be addressed by comparing against polls that are themselves independent of the candidate's signal. Mixed-sponsor polls and polls sponsored by other-candidate committees are biased in unknown directions and would contaminate the comparator.

Alternatives considered:

Keep the full sample with opponent_sponsored as a control: retained as the main spec ladder (Specs 1, 2) but flagged as second- best because the opponent-sponsored direction enters with the opposite sign — it's a sign-test, not a clean baseline.
Restrict comparator to media-sponsored only (drop pollster-self): rejected — pollster-self polls are the firms' marketing initiatives, behaviorally independent of any candidate, and account for ~26% of polls. Keeping them roughly doubles the control sample.

Implications: The clean-comparator spec drops 9,102 rows (opponent-sponsored or mixed/other), reducing the sample from 30,555 to 21,453 but identifying β on the cleaner comparison. Headline β estimates are stable (~+6 to +8) across the choice.

Decision: Match sponsor CPF / CNPJ to candidate via four routes:

A: sponsor CPF == candidate CPF in same muni (~18 matches)
B: committee CNPJ name parsed for ELEIÇÃO 2024 {NAME} PREFEITO (~364 matches)
C: party CNPJ via despesa_partidaria → party's PREFEITO in muni (~38 matches)
D: sponsor name contains party identifier → party's PREFEITO in muni (~148 matches)

Reason: The committee-CNPJ name parse (Route B) catches the bulk of obvious cases. Routes C+D extend coverage to party-directorate sponsorship, which is meaningful for the 1:1 electoral-law constraint (each party fields at most one mayoral candidate per muni).

Alternatives considered:

Drop Routes A and C (low marginal contribution): rejected — they catch politically substantively different cases (Route A is literally the candidate paying with their own CPF; Route C is party-level rather than candidate-committee-level).
Add a CNPJ-name fuzzy match for non-committee CNPJs: queued in docs/todo.md as the sponsor-classifier LLM refinement.

Implications: 568 candidate-poll rows with sponsored_by=1 across 793 polls (vs the SP-only 22 with just A+B). The all-Brazil sample is large enough to identify the clean-comparator + race × week FE spec on 60 cells / 409 rows.

2026-06-02 — `match_score ≥ 2` cutoff for the regression sample

Decision: Restrict the regression sample to candidate-poll rows with match_score ≥ 2 (multi-token or stronger match between poll candidate name and the TSE registry).

Reason: Score-1 matches are single-token (e.g., "Luiz" matching "Luiz Roberto" by first name only), which mixes legitimate matches with false positives. Score-2 (multi-token) and above are reliable. The nome_urna patch (score 4) gives most matches; the score-3 (substring) and score-2 (multi-token) tails are also clean.

Alternatives considered:

Use score ≥ 1: relaxed-version sensitivity — headline β stable.
Use score ≥ 3: too restrictive — drops legitimate multi-token matches.

Implications: Drops 1,419 candidate-poll rows in the SP-style filter (most are mis-matches that would have introduced noise).

Key Decisions

Decisions

2026-06-02 — Promote from idea to project

2026-06-02 — Outcome variable: poll percent renormalized within scenario

2026-06-02 — Clean comparator: media + pollster-self sponsored only

2026-06-02 — Routes A+B+C+D for sponsor → candidate

2026-06-02 — match_score ≥ 2 cutoff for the regression sample

2026-06-02 — `match_score ≥ 2` cutoff for the regression sample