Decisions
2026-06-02 — Promote from idea to project
Decision: Move research/ideas/poll-sponsor-bias/ to
projects/poll-sponsor-bias/ with the canonical project structure.
Reason: The design's empirical core is settled: three independent specs (within-candidate FE, race × week FE, descriptive within-candidate jump) converge on β ≈ +6 to +7 pp; the pre-poll trajectory placebo (n=132, t = 5.21) decisively rules out the "candidate commissions when leading" alternative. The next 2–3 months of work — Channel A vs B decomposition, theory framing, write-up — benefits from a project structure (paper/, decisions.md, formal todo).
Alternatives considered:
- Keep as idea: rejected — the work has moved past exploration.
- Pre-register the design: deferred to the next iteration after the Channel A vs B decomposition lands.
Implications: First-todo of the project is the
poll_methodology LLM extractor (queued in
pipelines/politica/docs/todo.md) — Channel A vs B is the project's
main unfinished empirical lever.
2026-06-02 — Outcome variable: poll percent renormalized within scenario
Decision: Compute error = poll_percent_normalized - 100 * final_share where poll_percent_normalized = 100 * percent / sum(percent over non-aggregate candidates in scenario), and final_share = candidate_votes / sum(candidate_votes within muni).
Reason: Final-share denominator is valid candidate votes
(excludes Branco/Nulo). Poll percent's natural denominator includes
"Don't know / Branco / Nulo" — which would mechanically bias error
negative if not renormalized. Renormalization brings both to a common
denominator.
Alternatives considered:
- Use the
votos_validosscenario directly: rejected — only ~4% of scenarios are votos_validos; estimulado is the workhorse and what the public press follows. - Use
votes / turnout(raw share of voters): rejected — pollsters don't aim to predict that quantity, and it would inflate the outcome variance without adding identification.
Implications: Error scale interpretation: error = +7 pp means
the poll overstates the candidate's valid-vote share by 7 percentage
points. Matches the headline finding's magnitude across all specs.
2026-06-02 — Clean comparator: media + pollster-self sponsored only
Decision: For the timing-controlled specs (3a/3b/3c), restrict the regression sample to candidate-poll rows where the poll is either (a) sponsored by this candidate (treatment) or (b) sponsored exclusively by independent media or pollster-self (control). Drop opponent-sponsored rows and rows in mixed/other-firm polls.
Reason: The "candidate commissions when leading" concern can only be addressed by comparing against polls that are themselves independent of the candidate's signal. Mixed-sponsor polls and polls sponsored by other-candidate committees are biased in unknown directions and would contaminate the comparator.
Alternatives considered:
- Keep the full sample with
opponent_sponsoredas a control: retained as the main spec ladder (Specs 1, 2) but flagged as second- best because the opponent-sponsored direction enters with the opposite sign — it's a sign-test, not a clean baseline. - Restrict comparator to media-sponsored only (drop pollster-self): rejected — pollster-self polls are the firms' marketing initiatives, behaviorally independent of any candidate, and account for ~26% of polls. Keeping them roughly doubles the control sample.
Implications: The clean-comparator spec drops 9,102 rows (opponent-sponsored or mixed/other), reducing the sample from 30,555 to 21,453 but identifying β on the cleaner comparison. Headline β estimates are stable (~+6 to +8) across the choice.
2026-06-02 — Routes A+B+C+D for sponsor → candidate
Decision: Match sponsor CPF / CNPJ to candidate via four routes:
- A: sponsor CPF == candidate CPF in same muni (~18 matches)
- B: committee CNPJ name parsed for
ELEIÇÃO 2024 {NAME} PREFEITO(~364 matches) - C: party CNPJ via despesa_partidaria → party's PREFEITO in muni (~38 matches)
- D: sponsor name contains party identifier → party's PREFEITO in muni (~148 matches)
Reason: The committee-CNPJ name parse (Route B) catches the bulk of obvious cases. Routes C+D extend coverage to party-directorate sponsorship, which is meaningful for the 1:1 electoral-law constraint (each party fields at most one mayoral candidate per muni).
Alternatives considered:
- Drop Routes A and C (low marginal contribution): rejected — they catch politically substantively different cases (Route A is literally the candidate paying with their own CPF; Route C is party-level rather than candidate-committee-level).
- Add a CNPJ-name fuzzy match for non-committee CNPJs: queued in
docs/todo.mdas the sponsor-classifier LLM refinement.
Implications: 568 candidate-poll rows with sponsored_by=1
across 793 polls (vs the SP-only 22 with just A+B). The all-Brazil
sample is large enough to identify the clean-comparator + race × week
FE spec on 60 cells / 409 rows.
2026-06-02 — match_score ≥ 2 cutoff for the regression sample
Decision: Restrict the regression sample to candidate-poll rows
with match_score ≥ 2 (multi-token or stronger match between poll
candidate name and the TSE registry).
Reason: Score-1 matches are single-token (e.g., "Luiz" matching
"Luiz Roberto" by first name only), which mixes legitimate matches
with false positives. Score-2 (multi-token) and above are reliable.
The nome_urna patch (score 4) gives most matches; the score-3
(substring) and score-2 (multi-token) tails are also clean.
Alternatives considered:
- Use
score ≥ 1: relaxed-version sensitivity — headline β stable. - Use
score ≥ 3: too restrictive — drops legitimate multi-token matches.
Implications: Drops 1,419 candidate-poll rows in the SP-style filter (most are mis-matches that would have introduced noise).