title: Poll sponsor bias status: project tags: [elections, polls, measurement] related_project: projects/electoral-justice last_updated: 2026-06-02
Poll sponsor bias
Question. Do registered Brazilian electoral polls systematically overstate the candidate or party that commissioned them? I.e., is there measurable "sponsor bias" / client-specific house effects in the poll numbers?
Why 2024 Brazil is a clean setting
- TSE registers every poll before it can be published (PesqEle / divulgação regime). So the sample includes slanted polls that were never publicized — killing most of the publication/selection bias that limits sponsor-bias studies elsewhere, where only released polls are observed.
- Large N: ~14.9k registered mayoral polls in 2024, ~11.4k with relatório
PDFs (scraped and backed up to
bi-dropbox:data/TSE/2024/pesquisa_eleitoral/), thousands of races, and a real mix of sponsor types. - The "encomendada" (commissioned-to-flatter) poll is a recognized phenomenon in Brazilian politics, so there is a substantive prior that the effect is nonzero.
Data
Who ordered / paid for the poll — available as structured CSVs, no PDF extraction needed (verified 2026-06-01 on the downloaded 2024 files at
research/data/tse/pesquisa_{contratante,pagante}_2024.zip). TSE dadosabertos distributes, alongside the mainPesquisas eleitoraisregistration file, two companion per-UF files keyed byNR_PROTOCOLO_REGISTRO:pesquisa_contratante— who commissioned the poll. 15,102 rows / 14,887 unique protocols, matching 100% of the 14,887 registry protocols (14,876 mayoral). Up to 6 contratantes per poll.pesquisa_pagante— who paid. 14,560 rows / 14,335 protocols (96%).
Fields that make the design work:
NR_CPF_CNPJ_CONTRATANTE/NR_CPF_CNPJ_PAGANTE— CPF (individual, 11 digits) or CNPJ (entity, 14). A CPF is joinable to the candidate registry (candidato.csv) to identify which candidate sponsored the poll; a CNPJ maps to party/committee entities. This is what enables the within-candidate design (not just self-vs-other).DS_ORIGEM_RECURSO— funding source, directly flagging political money:Fundo Partidário(537),Doações Eleitorais(156),Recursos Próprios(2,848),Outros(8,806),#NULO#(2,734), plus combos.ST_CONTRATANTE_PAGANTE(S/N) — separates commissioner from payer.VR_PAGO_CONTRATANTE— amount paid (weighting; price–slant check).
A sponsor-type classification step is still needed (CPF→candidate / CNPJ→party-or-committee vs media outlet vs the pollster itself); only the candidate/party/coligação subset is the treatment.
Vote intentions (the outcome) come from the relatório PDFs already scraped and extracted (
build/llm/poll_relatorio.parquet). Note the CSV sponsor data covers all ~14.9k registered polls, including the ~3,400 with no relatório PDF — so sponsor identity is available for a larger set than the vote-intention outcomes.Benchmark: actual election results (TSE results data, already available).
Pipeline gap (small): the current clean step (
projects/electoral-justice/source/clean/poll_response_2024.py) keepsinstituteandvalue_brlbut does not yet join the contratantes/pagantes CSVs. Adding that join (by protocol) + a sponsor-type classifier is the only build work needed.
Identification
Naive "regress poll error on self-sponsored" is confounded by (a) the candidate's true standing and (b) generic pollster house effects. The within-candidate design addresses both:
error_{c,p} = (poll share of candidate c in poll p) − (c's final vote share)
error_{c,p} = β · SponsoredBy_c_{c,p} + λ_pollster + μ_(c×race) + f(days_to_election) + ε
μ_(c×race)(candidate-in-race fixed effects) compares the same candidate across polls with different sponsors, differencing out true standing. β is the sponsor effect: do polls this candidate paid for overstate this candidate?λ_pollsterseparates a generically rosy firm from a firm rosy specifically for its client. (Fails where a pollster only ever does sponsored polls — needs firms that do both sponsored and media/institutional polls; true of the major firms.)- Symmetric test: do opponent-sponsored polls understate c?
Threats to pre-empt. (i) Genuine late campaign movement contaminates "error vs final" — restrict to polls in the last ~2–3 weeks or control the time gap. (ii) Strategic timing of when a candidate commissions a poll — within-candidate FE + days-to-election handle most of it. (iii) Pollster–sponsor collinearity (see above).
Mechanism: design-driven slant vs residual / fraud
If we find a sponsor effect, the natural follow-up is how it's produced. Two competing mechanisms — and the TSE registration data makes the decomposition feasible because it forces every pollster to register a full sampling plan and methodology narrative before publication.
Why design choices have room to slant at all. In textbook survey theory, a full probability sample of all eligible voters with inverse-inclusion-probability weights against the correct frame is unbiased — so design choices would be irrelevant for sponsor effects. Brazilian electoral polls almost universally are not probability samples; they are quota samples with multi-stage stratified selection of clusters and quota-controlled selection within. Quota sampling is model-based: unbiased only when the chosen quota variables fully explain response heterogeneity, and the chosen population frame is the right one. That assumption is everywhere violated, in legally-disclosed ways, which is what gives Channel A its room.
Channel A — Bayesian persuasion / design-driven (sender strategy). Sponsors choose poll designs that mechanically favor their candidate without violating the methodology. The decomposition has six concrete levers, each declared in registration:
- Coverage exclusion (the strongest version of "zero-probability
subpopulation").
DS_DADO_MUNICIPIOtypically lists "área urbana" — rural sub-districts get probability zero. If a candidate's base is rural / peripheral / a specific district, this is textbook-legal slant declared in registration. Extracted as acoverage_classflag below. - Quota variables. Education vs income vs religion vs occupation — none is mandatory, the pollster picks. Different choices give different vote distributions because each quota variable absorbs a different slice of unobserved political heterogeneity. Excluding education when education predicts vote is a tilt without lying.
- Population reference for the quotas. Census-2022 residents, TSE-eligible voters, or an estimated turnout-weighted electorate — each gives different cell shares. The pollster declares one. None is "wrong" by TSE standards.
- Mode. In-person door-to-door under-samples gated communities and working professionals; phone under-samples young & low-income; online over-samples high-internet-penetration. Mode is itself the choice of who's in the frame, and post-stratification weights only partially fix the missingness.
- Question wording / order. Estimulado: order in which candidate names are read can shift answers 2–3pp. Whether familiarity / approval is asked before vote intention primes the answer. Which scenario (estimulado / espontaneo / votos válidos) is the headline number is the pollster's choice.
- Non-response handling. Undecideds: redistribute to leaders, hold separate, distribute proportionally, exclude. Each gives different "votos válidos" numbers from the same raw data.
This is "honest persuasion" — the slant is encoded in design choices the pollster is paid to make. The signal is predictable from registered fields, conditional on extracting the right features from the free-text fields.
Channel B — residual / fabrication. Results don't match the declared methodology. Same firm, same declared sample design, same nominal sample size — but the numbers for the sponsor's candidate are tilted in a way the design cannot explain. This is the fraud / cooking-the-numbers channel.
Decomposition test. Run the headline within-candidate FE in three specs:
error = β·SponsoredBy_c + λ_pollster + μ_(c×race) + f(days_to_election) + ε→ β₁ = total sponsor effect.- Add structured methodology controls (sample size, field-period length, mode, ST_PESQUISA_PROPRIA, declared cost): β₂.
- Add LLM-extracted methodology features (sample design class, quota variables, n_stages, audit mechanism, etc.): β₃.
Reading:
- β shrinks toward 0 across the controls (β₁ > β₂ > β₃ ≈ 0): slant is design-driven — Channel A wins. The slant is real but legal — policy implication is to regulate the design choices (mandatory quota variables, mandatory population reference, prohibited coverage exclusions), not to punish pollsters.
- β stays stable (β₁ ≈ β₃): then either (a) there's an unobserved design dimension TSE registration doesn't capture (interviewer instructions, in-field rotation rules, refusal-handling protocols, question-list rotation across respondents), or (b) residual / fabrication — Channel B. Either reading is interesting: (a) tells us which fields the publication regime would need to add to discipline polls; (b) is the fraud finding.
Complementary fraud tests on the residual:
- digit-frequency / Benford on the reported vote percentages of the sponsor's candidate in the slanted-poll tail vs the non-slanted middle;
- suspiciously clean numbers: e.g. multiples of 5, last-digit uniformity tests;
- between-pollster consistency for the same race × date: if pollster X (sponsored by candidate c) reports c at 55% while every other pollster within ±3 days reports c at 35–40%, that's residual sponsor slant that the registered methodology can't explain.
This decomposition is a stronger contribution than measuring β alone. It also speaks directly to the Batista Pereira & Nunes (2024) Batista Pereira & Nunes 2024 alternative — late voter movement — because the within-(c×race) FE + days-to-election control nets out shared movement; what remains is sponsor-specific.
Registration fields available for the test (structured)
Already in each per-UF pesquisa_eleitoral_{year}_*.csv row, no LLM
needed:
ST_PESQUISA_PROPRIA(S/N): self-initiated by the pollsterQT_ENTREVISTADO: declared sample size (range 140–1,500+ in AC sample, median 400)DT_INICIO_PESQUISA,DT_FIM_PESQUISA: field-period start / endDT_DIVULGACAO: disclosure dateNM_ESTATISTICO_RESP,CD_CONRE: responsible statistician name + CONRE (Conselho Regional de Estatística) registration code — lets us track individual statistician fixed effects, separate from firmVR_PESQUISA: declared cost in BRLNM_EMPRESA,NR_CNPJ_EMPRESA: pollster firm
Registration fields requiring LLM extraction (free text)
Four narrative blocks per poll, median length 150–1,600 chars (max ~3k):
DS_METODOLOGIA_PESQUISA(~230 chars median): overall methodologyDS_PLANO_AMOSTRAL(~1,640 chars median): detailed sampling plan — the main signal; describes stratification scheme, PPT vs simple random, quota variables (sex / age / education / income / religion / region) with exact percentages, n_stagesDS_SISTEMA_CONTROLE(~660 chars median): interviewer training, tablet vs paper, supervisor audit, re-contact verificationDS_DADO_MUNICIPIO(~150 chars median): coverage area within the municipality
These are LLM-extractable into structured features (mode, sample_design_class, n_stages, quota_variables list, audit_mechanism flag, population_source). Sample of three Rio Branco pollsters (Quaest / F. Façanha / Instituto Verita) confirms substantial heterogeneity in declared design, so the controls have variance to work with.
Related work and positioning
(Full curated index at literature.md; 63 papers + a ## Notes on positioning
synthesis.)
Has this paper been done? No. The six closest predecessors leave a clean gap:
- Meireles & Russo (2022) Meireles & Russo (2022) — audits 2,000+ TSE-registered Brazilian polls 2012–2020 including municipal races. Decomposes error by sample size / mode / time-to-election. No sponsor split, no within-candidate FE. Direct same-data benchmark.
- Batista Pereira & Nunes (2024) Batista Pereira & Nunes (2024) — explains the 2022 presidential polls-vs-results gap with late voter-side change on the same TSE-registered universe. Competing alternative explanation that the paper has to distinguish sponsor bias from.
- Gramacho (2013) Gramacho (2013) / Gramacho (2015) — Brazilian poll-accuracy audits at president/governor level. No sponsor identifier, no mandatory pre-registration angle.
- Cantú et al. (2016) Cantú, Hoyo & Morales (2016) — Kalman-filter pipeline for firm-level bias in Mexican multiparty presidential polls. Closest Latin American methodological predecessor; pollster-level, not sponsor-level.
- Leeper & Thorson (2019) Leeper & Thorson (2019) — survey-sponsor effects but in online opinion surveys, not pre-election polls, and not by candidate sponsor.
- Lee et al. (2024) Lee, Zhang & Pak (2024) — Bayesian decomposition of Korean news-media selective reporting of polls. Same spirit as the publication-selection problem TSE pre-registration removes.
The combination "mandatory pre-registration + within-candidate FE that splits sponsor from pollster house effects" is unoccupied in the literature.
Industry-insider note. Felipe Nunes — co-author of Batista Pereira & Nunes (2024) and the main academic voice on Brazilian poll performance — co-founded Quaest, one of the major Brazilian polling firms, while continuing as an academic. The paper's intro framing has to acknowledge that the industry has serious academic voices and that we document an average bias, not malfeasance.
Connection to electoral-justice
Mostly standalone. But there is a measurement-error link: the EJ paper's
polling DiD/RD designs (Section 6) use poll movements as outcomes. If litigants
commission biased polls, that is measurement error in the EJ outcome — so a
"are sponsored polls slanted?" check is a legitimate robustness angle for EJ
even if the standalone paper is not built. See
projects/electoral-justice/docs/notes/poll_data_expansion.md for the poll-data
provenance and pipeline.
Complementary data — EJ poll-lawsuits. EJ has a topic classifier
(projects/electoral-justice/source/clean/proc_2024.py, lines 53–58) that
flags injunction cases (Representações) whose assuntos field contains
PESQUISA ELEITORAL. In the 2020 dataset that's 2,376 poll-related
injunctions (~12.5% of all 19,032 EJ injunctions); the same classifier
runs on 2024. The cases are parties / candidates suing about polls —
challenging registration / methodology / publication of a rival poll.
For this project these are useful three ways:
Perceived-bias validation. A poll sued by a rival is a poll the market believed was slanted. The lawsuit set gives an outside indicator of perceived sponsor bias that we can cross-check against our econometric estimate. Polls in the high-
SponsoredBy_ctail should be over-represented in the sued set if our estimate is capturing real bias.Mechanism evidence. The lawsuit complaints state which dimension of the poll is alleged to be biased — methodology, sample, sponsor identity, publication timing. A small LLM pass over the petitions in the poll-lawsuit subset would surface the qualitative mechanism behind the average bias we estimate.
Sponsor-side robustness. Some lawsuits name the pollster or the sponsor as defendant. Cross-referencing defendant CNPJs with the
poll_sponsor_2024table gives a candidate-level treatment indicator "this candidate was sued for a poll they sponsored" that is independent of our regression error definition.Join key on the EJ side:
build/merge/proc.csv(assuntoscontainsPESQUISA) → defendants → CNPJ.
This is a complement, not a replacement. The within-candidate FE design on the full registered-poll universe remains the headline analysis; the lawsuit subset is a robustness and mechanism layer.
Next step
Feasibility probe (now a CSV join, not a PDF pass): load the contratantes + pagantes CSVs, join to the poll registry by protocol, and measure the sponsor-type distribution — what share of polls are commissioned/paid by a candidate/party/coligação vs a media outlet vs the institute itself, and whether the sponsoring entity is cleanly matchable to a candidate/party in the race. That share (and the count of candidates appearing in both self- and other-sponsored polls, which powers the within-candidate FE) decides whether there is a paper here.
Next steps (handoff for a sandboxed Claude)
Self-contained so a fresh/sandboxed session can take over. Data policy: do
not read raw data files directly; write scripts (source/ of a project, or a
scratch script) that read raw data and emit structural/aggregate output or
build artifacts. Per the feedback_no_inline_python rule, put Python in a file,
not python3 -c / heredocs.
Status (2026-06-01)
- Scrape of 2024 relatório PDFs: complete (11,372 PDFs), backed up to
bi-dropbox:data/TSE/2024/pesquisa_eleitoral/relatorios/(26 per-UF.tar.zst). - LLM extraction of vote intentions: pilot only — 102 protocols in the
orphaned EJ pilot
projects/electoral-justice/build/llm/poll_relatorio_2024.parquet(+ 111 cached JSONs inbuild/llm/poll_relatorio/). The full extraction has NOT been run and is NOT on bi-dropbox.- The poll pipeline MOVED out of electoral-justice into
pipelines/politica(EJ commit85c8cd2/ politicacb0523c, 2026-05-28; logic byte-identical apart from path/docstring tweaks). Canonical scripts now live atpipelines/politica/source/{scrape/tse_relatorio.py, llm/poll_extract.py, clean/poll_response_2024.py}; onlyassemblestays in EJ (projects/electoral-justice/source/assemble/cand_poll_2024.py). - Run order (~$11, needs
OPENAI_API_KEY):politica/source/llm/poll_extract.py→politica/source/clean/poll_response_2024.py→EJ/source/assemble/cand_poll_2024.py. Outputs:politica/build/llm/poll_relatorio_2024.parquet→politica/build/clean/poll_response_2024.parquet(workspace-wide) →EJ/build/assemble/cand_poll_2024.parquet(EJ candidate-poll wide). - Best run on educloud (PDFs pullable from bi-dropbox); first check whether it
was already run on educloud under
/projects/ec113/henrik.
- The poll pipeline MOVED out of electoral-justice into
- Sponsor metadata (contratantes + pagantes): verified + backed up (see Data section). This is the new, ready-to-use input.
- Sponsor cleaning (Step 1 structural part): DONE on laptop sandbox
(2026-06-01).
pipelines/politica/source/clean/poll_sponsor.py→pipelines/politica/build/clean/poll_sponsor_2024.parquet(29,662 sponsor rows, long by(protocol, role ∈ {contratante, pagante}, sponsor_idx)). See "Step 1 — laptop findings" below. CPF→candidate join NOT done here: the 2024 candidato registry (consulta_cand_2024) is not on this laptop (data/tse/candidato.csvcovers 1998–2022 only — seedata/tse/README.md); it lives on educloud, where Step 1b should run.
Data inventory (exact paths on this host)
- Sponsor CSVs (zips, per-UF + BRASIL):
research/data/tse/pesquisa_contratante_2024.zip,research/data/tse/pesquisa_pagante_2024.zip. Also atbi-dropbox:data/TSE/2024/pesquisa_eleitoral/registration/. Read withsep=';', encoding='latin-1'. Key:NR_PROTOCOLO_REGISTRO. Sponsor id:NR_CPF_CNPJ_CONTRATANTE/NR_CPF_CNPJ_PAGANTE. - Poll registry (per-UF):
data_local/tse_polls_2024/pesquisa_eleitoral_2024_*.csv(keyNR_PROTOCOLO_REGISTRO, mayoral =DS_CARGOcontains "Prefeito", 14,876 protocols). NB: this lives underdata_local, notresearch/data/tse. - Candidate registry (for CPF→candidate): on educloud only for 2024 —
consulta_cand_2024_BRASIL.zipfrom TSE dadosabertos. The localresearch/data/tse/candidato.csvis the cleanedpipelines/politicaoutput and covers 1998–2022 only (perdata/tse/README.md), so the CPF→candidate join cannot run on the laptop sandbox. Rerun Step 1b on educloud using the 2024 file there. - Election results (benchmark final vote shares):
research/data/tse/eleicao.csv. - Vote intentions (outcome):
pipelines/politica/build/llm/poll_relatorio_2024.parquetonce the full extraction exists (only the orphaned 102-protocol EJ pilot today). - Sandbox note:
claude-sandboxmounts cwd→/workspaceand setsDATA_DIR=/workspace/data, with~/Dropbox(personal, read-only) — not bi-dropbox. Launch it from~/researchso/workspace/data=research/data; thedata_localregistry and bi-dropbox (rclone remote) need separate handling.
Step 1 — Go/no-go: size the treatment (do this first, cheap)
Write a script that:
- Loads contratantes + pagantes (concat per-UF, drop
_BRASIL). - Normalizes
NR_CPF_CNPJ_*— important: read the ID columns asdtype=strat CSV read; pandas' default numeric inference silently drops leading zeros and misclassifies ~22% of CNPJs. - Joins CPFs to
candidato.csv(CPF of 2024 candidates) → flags polls commissioned/paid by a candidate, and identifies which candidate. - Reports (structural output only): # polls with a candidate-CPF contratante or
pagante; cross-tab against
DS_ORIGEM_RECURSO; and — the key number — the count of candidates that appear in both a self-sponsored poll and an other-sponsored poll within their own race (this powers the within-candidate FE). CNPJ→party/committee match is a secondary path.
Decision rule: if the within-candidate overlap is only a handful, the clean design is underpowered → reconsider (fall back to race-level or funding-source-flag designs). If it's in the hundreds+, proceed.
Step 1 — laptop findings (2026-06-01, structural part only)
Script: pipelines/politica/source/clean/poll_sponsor.py. Rerun with
BASE_DIR=… DATA_DIR=… python source/clean/poll_sponsor.py.
Output: pipelines/politica/build/clean/poll_sponsor_2024.parquet (long,
29,662 sponsor rows covering 14,876 mayoral protocols).
Headline numbers (mayoral 2024):
- Sponsor coverage: 100% of protocols have ≥1 contratante; 96% have ≥1 pagante. Sponsor metadata is essentially complete — no missingness story.
- Sponsor-id mix: 93% CNPJ, 7% CPF (rows). The "individual sponsor" route (CPF) is small; 1,035 of 14,876 protocols (7.0%) have ≥1 CPF sponsor. After CPF→candidate match (educloud) this is the upper bound for candidate-self-sponsored polls via the individual route.
- Institutional self-sponsoring (sponsor CNPJ == pollster CNPJ): 26%
of sponsor rows. A substantial "institute bankrolling its own poll"
subset — these are the rosy-pollster controls in
λ_pollster. - Politically-flagged funding (
DS_ORIGEM_RECURSO ∈ {Fundo Partidário, Doações Eleitorais, …}): 712 contratante rows (~4.7%) are explicitly political money. This is a protocol-level treatment indicator that does not need CPF→candidate matching — a clean fallback / complementary design path. - Within-race CPF diversity (candidate-agnostic proxy for design power): only 148 races have ≥2 distinct CPF sponsors across their polls. That's the ceiling for within-candidate FE on the CPF route alone — small.
- Heuristic sponsor-type classification (regex on contratante names;
see
pipelines/politica/source/clean/poll_sponsor.pyfollow-up or/tmp/sponsor_classify.py):- 16.1% of mayoral polls (2,389/14,876) have a political contratante — candidate CPF (6.9%), candidate-committee CNPJ (4.2%), or party CNPJ (4.9%). This is the upper bound for the treated set with no coalition / generic-committee matching yet (those returned 0 via regex — re-check with a broader pattern).
- Candidate-committee CNPJs follow a deterministic naming pattern:
ELEIÇÃO {ANO} {NOME COMPLETO} {PREFEITO|VICE-PREFEITO}. This means a sponsor name likeELEICAO 2024 FABIO FERREIRA PINHEIRO PREFEITOis the candidate's own campaign committee — and the candidate's full name is parseable from the sponsor name directly. No 2024 candidato registry needed to identify which candidate this poll boosts: the regex extracts it. Hugely lowers the educloud dependency for the candidate-committee subset. - 228 races have within-FE contrast structure on the candidate- committee path alone (≥1 candidate-committee poll AND ≥1 non-committee poll in the same race) — laptop-resolvable, independent of the educloud CPF/party join. Together with the CPF subset and the party-CNPJ subset (educloud), the total within-FE race count should easily be in the hundreds.
- Sponsor-type breakdown of all contratante rows: media 39.5%, pollster-self 26.2%, other/unknown 16.4%, candidate-or-individual CPF 7.0%, party CNPJ 4.9%, candidate-committee CNPJ 4.3%, other pollsters 1.7%. The "other/unknown" 16% pool likely contains more political sponsors (MEI-format CNPJs, sectoral associations, political consultancies) — worth one LLM-classifier pass to refine before final estimation.
Implication for design. The within-candidate FE via the CPF route is underpowered (≤148 races). The viable paths are:
- CNPJ→party→party's mayoral candidate in that race as the workhorse treated set. The map is 1:1 within a race: under Brazilian electoral law each party fields exactly one prefeito candidate per municipality, so a poll commissioned by Party X's CNPJ in city C unambiguously identifies Party X's prefeito candidate as the boosted party. None of the "which of the party's candidates does this poll boost?" ambiguity that would exist for vereador or proportional state/federal races. The cleanness of this map is the central reason the study scopes to mayoral races — extending to vereador or deputado is not just bigger N, it's a different identification problem. Needs party-CNPJ matching (TSE partidos table) plus the 2024 candidate registry (educloud).
- Funding-source flag as a protocol-level treatment (
Fundo Partidário/Doações Eleitoraispolls vs media-financed polls): 712 treated rows, no candidate match needed — but identifies partisan-financed polls, not self-financed; cleaner for a "are partisan-financed polls biased?" question than for the strict sponsor-bias one. Worth running first as a feasibility check. - Keep the CPF route as a clean subsample within the larger CNPJ design (high-quality but small).
Step 1b (within-candidate overlap on CPF + CNPJ routes) is deferred to educloud where the 2024 candidato registry lives.
Step 2 — Outcome: full vote-intention extraction
Run / locate the full pipelines/politica/source/llm/poll_extract.py over all
11,372 PDFs (educloud; or pull the parquet if already produced there). Output:
one row per candidate×scenario×poll with percent + tse_protocol at
pipelines/politica/build/llm/poll_relatorio_2024.parquet.
NB: this extractor uses the raw OpenAI SDK + a hand-rolled cache; there is an
open pipelines/politica/docs/todo.md item to migrate it to llmkit before
the full run (standardized cache + audit workflow). Prefer doing that first.
Step 3 — Build the candidate×poll analysis table
Join: extracted vote intentions × sponsor (Step 1) × final results
(eleicao.csv). For each (candidate c, poll p): error = poll_share − final_share
(pick one comparable scenario, e.g. estimulado first round; match candidate
names within race — name matching is the main data-cleaning risk). Carry
days_to_election, pollster (NM_EMPRESA from registry), race id, sponsor flags.
Step 4 — Estimate
error ~ SponsoredBy_c + pollster_FE + (c × race)_FE + f(days_to_election);
restrict to late polls (last ~2–3 weeks) to limit genuine-movement contamination.
Symmetric opponent-sponsored test. See Identification section above for threats.
Related ideas
- [[effect-of-campaign-spending-on-votes]] — same 2024 electoral data universe.
- [[marginal-winners-and-campaign-finance]] — elections/finance overlap.