title: Poll sponsor bias status: project tags: [elections, polls, measurement] related_project: projects/electoral-justice last_updated: 2026-06-02

Poll sponsor bias

Question. Do registered Brazilian electoral polls systematically overstate the candidate or party that commissioned them? I.e., is there measurable "sponsor bias" / client-specific house effects in the poll numbers?

Why 2024 Brazil is a clean setting

Data

Identification

Naive "regress poll error on self-sponsored" is confounded by (a) the candidate's true standing and (b) generic pollster house effects. The within-candidate design addresses both:

error_{c,p} = (poll share of candidate c in poll p) − (c's final vote share)
error_{c,p} = β · SponsoredBy_c_{c,p} + λ_pollster + μ_(c×race) + f(days_to_election) + ε

Threats to pre-empt. (i) Genuine late campaign movement contaminates "error vs final" — restrict to polls in the last ~2–3 weeks or control the time gap. (ii) Strategic timing of when a candidate commissions a poll — within-candidate FE + days-to-election handle most of it. (iii) Pollster–sponsor collinearity (see above).

Mechanism: design-driven slant vs residual / fraud

If we find a sponsor effect, the natural follow-up is how it's produced. Two competing mechanisms — and the TSE registration data makes the decomposition feasible because it forces every pollster to register a full sampling plan and methodology narrative before publication.

Why design choices have room to slant at all. In textbook survey theory, a full probability sample of all eligible voters with inverse-inclusion-probability weights against the correct frame is unbiased — so design choices would be irrelevant for sponsor effects. Brazilian electoral polls almost universally are not probability samples; they are quota samples with multi-stage stratified selection of clusters and quota-controlled selection within. Quota sampling is model-based: unbiased only when the chosen quota variables fully explain response heterogeneity, and the chosen population frame is the right one. That assumption is everywhere violated, in legally-disclosed ways, which is what gives Channel A its room.

Channel A — Bayesian persuasion / design-driven (sender strategy). Sponsors choose poll designs that mechanically favor their candidate without violating the methodology. The decomposition has six concrete levers, each declared in registration:

This is "honest persuasion" — the slant is encoded in design choices the pollster is paid to make. The signal is predictable from registered fields, conditional on extracting the right features from the free-text fields.

Channel B — residual / fabrication. Results don't match the declared methodology. Same firm, same declared sample design, same nominal sample size — but the numbers for the sponsor's candidate are tilted in a way the design cannot explain. This is the fraud / cooking-the-numbers channel.

Decomposition test. Run the headline within-candidate FE in three specs:

  1. error = β·SponsoredBy_c + λ_pollster + μ_(c×race) + f(days_to_election) + ε → β₁ = total sponsor effect.
  2. Add structured methodology controls (sample size, field-period length, mode, ST_PESQUISA_PROPRIA, declared cost): β₂.
  3. Add LLM-extracted methodology features (sample design class, quota variables, n_stages, audit mechanism, etc.): β₃.

Reading:

Complementary fraud tests on the residual:

This decomposition is a stronger contribution than measuring β alone. It also speaks directly to the Batista Pereira & Nunes (2024) Batista Pereira & Nunes 2024 alternative — late voter movement — because the within-(c×race) FE + days-to-election control nets out shared movement; what remains is sponsor-specific.

Registration fields available for the test (structured)

Already in each per-UF pesquisa_eleitoral_{year}_*.csv row, no LLM needed:

Registration fields requiring LLM extraction (free text)

Four narrative blocks per poll, median length 150–1,600 chars (max ~3k):

These are LLM-extractable into structured features (mode, sample_design_class, n_stages, quota_variables list, audit_mechanism flag, population_source). Sample of three Rio Branco pollsters (Quaest / F. Façanha / Instituto Verita) confirms substantial heterogeneity in declared design, so the controls have variance to work with.

(Full curated index at literature.md; 63 papers + a ## Notes on positioning synthesis.)

Has this paper been done? No. The six closest predecessors leave a clean gap:

The combination "mandatory pre-registration + within-candidate FE that splits sponsor from pollster house effects" is unoccupied in the literature.

Industry-insider note. Felipe Nunes — co-author of Batista Pereira & Nunes (2024) and the main academic voice on Brazilian poll performance — co-founded Quaest, one of the major Brazilian polling firms, while continuing as an academic. The paper's intro framing has to acknowledge that the industry has serious academic voices and that we document an average bias, not malfeasance.

Connection to electoral-justice

Mostly standalone. But there is a measurement-error link: the EJ paper's polling DiD/RD designs (Section 6) use poll movements as outcomes. If litigants commission biased polls, that is measurement error in the EJ outcome — so a "are sponsored polls slanted?" check is a legitimate robustness angle for EJ even if the standalone paper is not built. See projects/electoral-justice/docs/notes/poll_data_expansion.md for the poll-data provenance and pipeline.

Complementary data — EJ poll-lawsuits. EJ has a topic classifier (projects/electoral-justice/source/clean/proc_2024.py, lines 53–58) that flags injunction cases (Representações) whose assuntos field contains PESQUISA ELEITORAL. In the 2020 dataset that's 2,376 poll-related injunctions (~12.5% of all 19,032 EJ injunctions); the same classifier runs on 2024. The cases are parties / candidates suing about polls — challenging registration / methodology / publication of a rival poll. For this project these are useful three ways:

  1. Perceived-bias validation. A poll sued by a rival is a poll the market believed was slanted. The lawsuit set gives an outside indicator of perceived sponsor bias that we can cross-check against our econometric estimate. Polls in the high-SponsoredBy_c tail should be over-represented in the sued set if our estimate is capturing real bias.

  2. Mechanism evidence. The lawsuit complaints state which dimension of the poll is alleged to be biased — methodology, sample, sponsor identity, publication timing. A small LLM pass over the petitions in the poll-lawsuit subset would surface the qualitative mechanism behind the average bias we estimate.

  3. Sponsor-side robustness. Some lawsuits name the pollster or the sponsor as defendant. Cross-referencing defendant CNPJs with the poll_sponsor_2024 table gives a candidate-level treatment indicator "this candidate was sued for a poll they sponsored" that is independent of our regression error definition.

    Join key on the EJ side: build/merge/proc.csv (assuntos contains PESQUISA) → defendants → CNPJ.

This is a complement, not a replacement. The within-candidate FE design on the full registered-poll universe remains the headline analysis; the lawsuit subset is a robustness and mechanism layer.

Next step

Feasibility probe (now a CSV join, not a PDF pass): load the contratantes + pagantes CSVs, join to the poll registry by protocol, and measure the sponsor-type distribution — what share of polls are commissioned/paid by a candidate/party/coligação vs a media outlet vs the institute itself, and whether the sponsoring entity is cleanly matchable to a candidate/party in the race. That share (and the count of candidates appearing in both self- and other-sponsored polls, which powers the within-candidate FE) decides whether there is a paper here.

Next steps (handoff for a sandboxed Claude)

Self-contained so a fresh/sandboxed session can take over. Data policy: do not read raw data files directly; write scripts (source/ of a project, or a scratch script) that read raw data and emit structural/aggregate output or build artifacts. Per the feedback_no_inline_python rule, put Python in a file, not python3 -c / heredocs.

Status (2026-06-01)

Data inventory (exact paths on this host)

Step 1 — Go/no-go: size the treatment (do this first, cheap)

Write a script that:

  1. Loads contratantes + pagantes (concat per-UF, drop _BRASIL).
  2. Normalizes NR_CPF_CNPJ_*important: read the ID columns as dtype=str at CSV read; pandas' default numeric inference silently drops leading zeros and misclassifies ~22% of CNPJs.
  3. Joins CPFs to candidato.csv (CPF of 2024 candidates) → flags polls commissioned/paid by a candidate, and identifies which candidate.
  4. Reports (structural output only): # polls with a candidate-CPF contratante or pagante; cross-tab against DS_ORIGEM_RECURSO; and — the key number — the count of candidates that appear in both a self-sponsored poll and an other-sponsored poll within their own race (this powers the within-candidate FE). CNPJ→party/committee match is a secondary path.

Decision rule: if the within-candidate overlap is only a handful, the clean design is underpowered → reconsider (fall back to race-level or funding-source-flag designs). If it's in the hundreds+, proceed.

Step 1 — laptop findings (2026-06-01, structural part only)

Script: pipelines/politica/source/clean/poll_sponsor.py. Rerun with BASE_DIR=… DATA_DIR=… python source/clean/poll_sponsor.py. Output: pipelines/politica/build/clean/poll_sponsor_2024.parquet (long, 29,662 sponsor rows covering 14,876 mayoral protocols).

Headline numbers (mayoral 2024):

Implication for design. The within-candidate FE via the CPF route is underpowered (≤148 races). The viable paths are:

  1. CNPJ→party→party's mayoral candidate in that race as the workhorse treated set. The map is 1:1 within a race: under Brazilian electoral law each party fields exactly one prefeito candidate per municipality, so a poll commissioned by Party X's CNPJ in city C unambiguously identifies Party X's prefeito candidate as the boosted party. None of the "which of the party's candidates does this poll boost?" ambiguity that would exist for vereador or proportional state/federal races. The cleanness of this map is the central reason the study scopes to mayoral races — extending to vereador or deputado is not just bigger N, it's a different identification problem. Needs party-CNPJ matching (TSE partidos table) plus the 2024 candidate registry (educloud).
  2. Funding-source flag as a protocol-level treatment (Fundo Partidário / Doações Eleitorais polls vs media-financed polls): 712 treated rows, no candidate match needed — but identifies partisan-financed polls, not self-financed; cleaner for a "are partisan-financed polls biased?" question than for the strict sponsor-bias one. Worth running first as a feasibility check.
  3. Keep the CPF route as a clean subsample within the larger CNPJ design (high-quality but small).

Step 1b (within-candidate overlap on CPF + CNPJ routes) is deferred to educloud where the 2024 candidato registry lives.

Step 2 — Outcome: full vote-intention extraction

Run / locate the full pipelines/politica/source/llm/poll_extract.py over all 11,372 PDFs (educloud; or pull the parquet if already produced there). Output: one row per candidate×scenario×poll with percent + tse_protocol at pipelines/politica/build/llm/poll_relatorio_2024.parquet. NB: this extractor uses the raw OpenAI SDK + a hand-rolled cache; there is an open pipelines/politica/docs/todo.md item to migrate it to llmkit before the full run (standardized cache + audit workflow). Prefer doing that first.

Step 3 — Build the candidate×poll analysis table

Join: extracted vote intentions × sponsor (Step 1) × final results (eleicao.csv). For each (candidate c, poll p): error = poll_share − final_share (pick one comparable scenario, e.g. estimulado first round; match candidate names within race — name matching is the main data-cleaning risk). Carry days_to_election, pollster (NM_EMPRESA from registry), race id, sponsor flags.

Step 4 — Estimate

error ~ SponsoredBy_c + pollster_FE + (c × race)_FE + f(days_to_election); restrict to late polls (last ~2–3 weeks) to limit genuine-movement contamination. Symmetric opponent-sponsored test. See Identification section above for threats.