Data
TSE Pesquisa Eleitoral 2024 — registry + relatório PDFs
Source
- provider: TSE dadosabertos (
https://dadosabertos.tse.jus.br/dataset/pesquisas-eleitorais-2024) - mirror:
bi-dropbox:data/TSE/2024/pesquisa_eleitoral/(CDN blocks direct curl from educloud sandbox; use rclone via dropbox) - ingested:
pipelines/politica/source/scrape/+build/scrape/tse_polls_2024/ - access: 14,887 mayoral protocols; ~11,400 with relatório PDF
Coverage
- time period: 2024 electoral cycle, all polls registered before 1st round
- geographic scope: 26 UFs (no DF — no municipal elections)
- population: every registered municipal poll (prefeito / vice / vereador)
Unit of observation
- Registry CSVs: one row per
NR_PROTOCOLO_REGISTRO - Relatório PDFs: one PDF per protocol with vote-intention tables
- Sponsor CSVs: long format, one row per (protocol, role ∈ {contratante, pagante}, sponsor_idx) — up to 6 contratantes per protocol
Key variables
NR_PROTOCOLO_REGISTRO: TSE registry ID (joins everything)NR_CPF_CNPJ_CONTRATANTE/NR_CPF_CNPJ_PAGANTE: 11-digit CPF or 14-digit CNPJ — joinable to candidato registry (CPF), party directorate (CNPJ via despesa_partidaria), or media/pollster firmsDS_ORIGEM_RECURSO: funding source —Fundo Partidário(537),Doações Eleitorais(156),Recursos Próprios(2,848),Outros(8,806),#NULO#(2,734)ST_CONTRATANTE_PAGANTE(S/N): commissioner vs payer flagVR_PAGO_CONTRATANTE: amount paid (BRL)DT_INICIO_PESQUISA,DT_FIM_PESQUISA: field periodQT_ENTREVISTADO: sample sizeST_PESQUISA_PROPRIA(S/N): pollster-conducted-for-itself flagNR_CNPJ_EMPRESA,NM_EMPRESA: pollster identityDS_METODOLOGIA_PESQUISA,DS_PLANO_AMOSTRAL,DS_SISTEMA_CONTROLE,DS_DADO_MUNICIPIO: free-text methodology fields (each median 150–1,640 chars) — input to the Channel A vs B decomposition once the LLM methodology extractor runs
Sample restrictions (for the regression)
- 2024 mayoral 1st round only (
DS_CARGO ∈ {Prefeito, Prefeito, Vereador}) - estimulado scenario only (espontaneo/votos_validos used for robustness)
- non-aggregate candidate names (drop Branco/Nulo/NS/Não sabe variants)
match_score ≥ 2(multi-token or stronger registry match)
Data quality
- Bulk LLM extraction (gpt-4o-mini) over relatório PDFs: 90% of
protocols have at least one clean (95–105%) estimulado sub-scenario;
see
docs/briefs/bulk_extraction_audit.md. Per-state coverage range 58% (TO) to 96% (PB/AC/SC) — TO/AP/PA/SE have a higher share of scanned-image PDFs (OCR follow-up queued).
TSE candidato 2024 (consulta_cand)
Source
- provider: TSE dadosabertos
- mirror: cleaned to
pipelines/politica/build/clean/candidato.csv(1998–2024, all UFs, all offices) - 2024 coverage: 17,005 PREFEITO 1st-round rows across 26 UFs
Key variables
cpf: candidate's CPF (Route A join key)politico_id: persistent person identifier (across cycles)municipio_id: TSE 5-digit muni code (race identifier)nome_urna: ballot name as voters see it — patch landed 2026-06-02, the load-bearing field for matching poll-reported candidate names to the registry (lifted match rate from ~64% to ~88%)votes: candidate vote count in 1st round (vote share denominator is sum over candidates within muni)party: party abbreviation
Notes
- The cleaned panel covers 1998–2024 with 2024 staged ~late May 2026.
municipio_idrendered asXXXXX.0(year-as-float cast artifact) — strip the.0suffix before joining.
TSE despesa_partidaria 2024
Source
- provider: TSE dadosabertos
- ingested:
pipelines/politica/build/clean/despesa_partidaria.csv
Use
- Input to Route C of the sponsor join: maps party-directorate CNPJs to (party, muni, year). Lets us link a CNPJ-sponsored poll to that party's PREFEITO candidate in the muni (1:1 by the electoral-law constraint that each party fields at most one mayoral candidate per municipality).
TSE eleicao + resultados (for the outcome)
Source
- provider: TSE dadosabertos
- cleaned:
pipelines/politica/build/clean/eleicao.csv, plus per-candidate votes incandidato.csv
Use
- Compute
final_share = candidate_votes / sum(candidate_votes within muni)— the benchmark the poll percent is compared against to constructerror = poll_percent - 100 * final_share.
Assembled tables (build/assemble/)
Two project-owned parquets, one per unit of observation. Both are browsable on the site under the Data dropdown.
build/assemble/poll.parquet
- Built by
source/assemble/poll.py. - Grain: one row per
protocol(~9.5k mayoral polls across 26 UFs). - Columns:
protocol,uf,muni_id,municipality,institute,institute_fantasy,pollster_cnpj,pollster_name,st_pesquisa_propria,sample_size,field_end,field_period_week,days_to_election,sponsor_types,poll_is_independent,poll_has_candidate_sponsor. - Sources:
poll_response_2024.parquet(per-protocol metadata),poll_sponsor_2024.parquet(sponsor-type classifier), andpesquisa_eleitoral_2024_BRASIL.csv(forST_PESQUISA_PROPRIA, not carried by the cleaned poll_2024).
build/assemble/cand_poll.parquet
- Built by
source/assemble/cand_poll.py. - Grain: one row per
(protocol, politico_id)(~28k rows). The regression-ready table — every analysis undersource/analysis/reads this. Grain is asserted at the end of the build script. - Filters:
scenario_type == "estimulado", non-aggregate candidate_name,match_score >= 2. Relatórios with multiple estimulado scenarios are collapsed to a single canonical scenario per protocol (the one with the most distinct candidates; ties broken alphabetically onscenario_labelfor determinism), so the grain is unique on(protocol, politico_id)and downstream code that runsgroupby([protocol, politico_id])is guaranteed to produce one row per group. - Joins poll.parquet on
protocolso all poll-level fields are available without a second read. - The SP feasibility-probe slice (
sp_regressions.py) is this table filtered touf == "SP"; there is no separate SP parquet.
Companion folder: docs/data/
This file stays the concise index. A source with non-obvious manual
provenance (e.g., the LLM extraction details, the Route C CNPJ
matching) gets a companion docs/data/<source>.md if/when needed.