Data

TSE Pesquisa Eleitoral 2024 — registry + relatório PDFs

Source

provider: TSE dadosabertos (https://dadosabertos.tse.jus.br/dataset/pesquisas-eleitorais-2024)
mirror: bi-dropbox:data/TSE/2024/pesquisa_eleitoral/ (CDN blocks direct curl from educloud sandbox; use rclone via dropbox)
ingested: pipelines/politica/source/scrape/ + build/scrape/tse_polls_2024/
access: 14,887 mayoral protocols; ~11,400 with relatório PDF

Coverage

time period: 2024 electoral cycle, all polls registered before 1st round
geographic scope: 26 UFs (no DF — no municipal elections)
population: every registered municipal poll (prefeito / vice / vereador)

Unit of observation

Registry CSVs: one row per NR_PROTOCOLO_REGISTRO
Relatório PDFs: one PDF per protocol with vote-intention tables
Sponsor CSVs: long format, one row per (protocol, role ∈ {contratante, pagante}, sponsor_idx) — up to 6 contratantes per protocol

Key variables

NR_PROTOCOLO_REGISTRO: TSE registry ID (joins everything)
NR_CPF_CNPJ_CONTRATANTE / NR_CPF_CNPJ_PAGANTE: 11-digit CPF or 14-digit CNPJ — joinable to candidato registry (CPF), party directorate (CNPJ via despesa_partidaria), or media/pollster firms
DS_ORIGEM_RECURSO: funding source — Fundo Partidário (537), Doações Eleitorais (156), Recursos Próprios (2,848), Outros (8,806), #NULO# (2,734)
ST_CONTRATANTE_PAGANTE (S/N): commissioner vs payer flag
VR_PAGO_CONTRATANTE: amount paid (BRL)
DT_INICIO_PESQUISA, DT_FIM_PESQUISA: field period
QT_ENTREVISTADO: sample size
ST_PESQUISA_PROPRIA (S/N): pollster-conducted-for-itself flag
NR_CNPJ_EMPRESA, NM_EMPRESA: pollster identity
DS_METODOLOGIA_PESQUISA, DS_PLANO_AMOSTRAL, DS_SISTEMA_CONTROLE, DS_DADO_MUNICIPIO: free-text methodology fields (each median 150–1,640 chars) — input to the Channel A vs B decomposition once the LLM methodology extractor runs

Sample restrictions (for the regression)

2024 mayoral 1st round only (DS_CARGO ∈ {Prefeito, Prefeito, Vereador})
estimulado scenario only (espontaneo/votos_validos used for robustness)
non-aggregate candidate names (drop Branco/Nulo/NS/Não sabe variants)
match_score ≥ 2 (multi-token or stronger registry match)

Data quality

Bulk LLM extraction (gpt-4o-mini) over relatório PDFs: 90% of protocols have at least one clean (95–105%) estimulado sub-scenario; see docs/briefs/bulk_extraction_audit.md. Per-state coverage range 58% (TO) to 96% (PB/AC/SC) — TO/AP/PA/SE have a higher share of scanned-image PDFs (OCR follow-up queued).

TSE candidato 2024 (consulta_cand)

Source

provider: TSE dadosabertos
mirror: cleaned to pipelines/politica/build/clean/candidato.csv (1998–2024, all UFs, all offices)
2024 coverage: 17,005 PREFEITO 1st-round rows across 26 UFs

Key variables

cpf: candidate's CPF (Route A join key)
politico_id: persistent person identifier (across cycles)
municipio_id: TSE 5-digit muni code (race identifier)
nome_urna: ballot name as voters see it — patch landed 2026-06-02, the load-bearing field for matching poll-reported candidate names to the registry (lifted match rate from ~64% to ~88%)
votes: candidate vote count in 1st round (vote share denominator is sum over candidates within muni)
party: party abbreviation

Notes

The cleaned panel covers 1998–2024 with 2024 staged ~late May 2026.
municipio_id rendered as XXXXX.0 (year-as-float cast artifact) — strip the .0 suffix before joining.

TSE despesa_partidaria 2024

Source

provider: TSE dadosabertos
ingested: pipelines/politica/build/clean/despesa_partidaria.csv

Use

Input to Route C of the sponsor join: maps party-directorate CNPJs to (party, muni, year). Lets us link a CNPJ-sponsored poll to that party's PREFEITO candidate in the muni (1:1 by the electoral-law constraint that each party fields at most one mayoral candidate per municipality).

TSE eleicao + resultados (for the outcome)

Source

provider: TSE dadosabertos
cleaned: pipelines/politica/build/clean/eleicao.csv, plus per-candidate votes in candidato.csv

Use

Compute final_share = candidate_votes / sum(candidate_votes within muni) — the benchmark the poll percent is compared against to construct error = poll_percent - 100 * final_share.

Assembled tables (`build/assemble/`)

Two project-owned parquets, one per unit of observation. Both are browsable on the site under the Data dropdown.

`build/assemble/poll.parquet`

Built by source/assemble/poll.py.
Grain: one row per protocol (~9.5k mayoral polls across 26 UFs).
Columns: protocol, uf, muni_id, municipality, institute, institute_fantasy, pollster_cnpj, pollster_name, st_pesquisa_propria, sample_size, field_end, field_period_week, days_to_election, sponsor_types, poll_is_independent, poll_has_candidate_sponsor.
Sources: poll_response_2024.parquet (per-protocol metadata), poll_sponsor_2024.parquet (sponsor-type classifier), and pesquisa_eleitoral_2024_BRASIL.csv (for ST_PESQUISA_PROPRIA, not carried by the cleaned poll_2024).

`build/assemble/cand_poll.parquet`

Built by source/assemble/cand_poll.py.
Grain: one row per (protocol, politico_id) (~28k rows). The regression-ready table — every analysis under source/analysis/ reads this. Grain is asserted at the end of the build script.
Filters: scenario_type == "estimulado", non-aggregate candidate_name, match_score >= 2. Relatórios with multiple estimulado scenarios are collapsed to a single canonical scenario per protocol (the one with the most distinct candidates; ties broken alphabetically on scenario_label for determinism), so the grain is unique on (protocol, politico_id) and downstream code that runs groupby([protocol, politico_id]) is guaranteed to produce one row per group.
Joins poll.parquet on protocol so all poll-level fields are available without a second read.
The SP feasibility-probe slice (sp_regressions.py) is this table filtered to uf == "SP"; there is no separate SP parquet.

Companion folder: `docs/data/`

This file stays the concise index. A source with non-obvious manual provenance (e.g., the LLM extraction details, the Route C CNPJ matching) gets a companion docs/data/<source>.md if/when needed.

Data Sources

Data

TSE Pesquisa Eleitoral 2024 — registry + relatório PDFs

Source

Coverage

Unit of observation

Key variables

Sample restrictions (for the regression)

Data quality

TSE candidato 2024 (consulta_cand)

Source

Key variables

Notes

TSE despesa_partidaria 2024

Source

Use

TSE eleicao + resultados (for the outcome)

Source

Use

Assembled tables (build/assemble/)

build/assemble/poll.parquet

build/assemble/cand_poll.parquet

Companion folder: docs/data/

Assembled tables (`build/assemble/`)

`build/assemble/poll.parquet`

`build/assemble/cand_poll.parquet`

Companion folder: `docs/data/`