Of 14,887 mayoral 2024 polls, **12.4% are routed through cover vehicles** (shell CNPJs 7.7% + MEI-individual 4.7%) that obscure the candidate's connection; 84.9% are administratively recoverable as candidate-linked, media, or pollster-self; the remaining 2.8% are an *uncoded residual* — low-volume CNPJs (1–4 polls each) that the classifier cannot place, many of which are likely sub-threshold cover vehicles (small publicity firms, missed MEI, missed local media). The cover-vehicle share GREW from 3.8% in 2020 to 12.4% in 2024 — a 3.3× increase; the uncoded residual SHRANK from 6.4% to 2.8%, consistent with cover-vehicle activity consolidating into identifiable shell/MEI patterns. 15 pollster firms switched from pollster_self in 2020 to shell or MEI-individual in 2024 — the IPOP pattern at scale, covering 508 polls in 2024 alone. Universe-extended shell list: 89 CNPJs / 1,149 polls (vs AN-094 top-25 audit: 14 CNPJs / 668 polls). Calibration: rule recovers 9 of the 13 AN-094 PROBABLE_SHELLs present in 2024 (69% recall); the 4 misses are firms the source/assemble/poll.py CNAE upgrade promotes to media or pollster_other, so the 89-CNPJ / 1,149-poll count is a precision-favoring FLOOR on the universe shell footprint.

Confidence
yellow
Type
descriptive
Design
Sample
14,887 distinct mayoral 2024 protocols in poll_sponsor_2024.parquet + 10,971 mayoral 2020 protocols in poll_sponsor_2020.parquet
Specification
(a) Per-sponsor-row classification via source/assemble/poll.py's classify_sponsor_row (Routes A-D / media / pollster_self / pollster_other / other_firm). (b) shell_flag on other_firm rows = n_polls_for_cnpj ≥ 5 AND sponsor_name does NOT match MEDIA_TOKENS / POLLSTER_TOKENS regex. (c) mei_flag = CNPJ porte = '01' (Microempreendedor Individual). (d) Per-protocol canonical bucket: candidate_linked > media > pollster_self > shell > mei_individual > other_firm_non_shell > unknown (highest-priority bucket present across the protocol's sponsor rows).
Comparator
AN-094 top-25 hand audit (14 PROBABLE_SHELL = 668 polls, 73% precision); paper v2 intro ¶3 ("at least 668 mayoral polls registered through third-party shell CNPJs")
Notes
Cross-cycle per-pollster-firm transitions tabulated on the 329 pollsters present in both 2020 and 2024. The IPOP self → shell pattern shows up as 15 firms / 508 polls in 2024, including (likely) IPOP itself at pollster_cnpj 36348794 (357 polls self in 2020 → 68 polls shell in 2024) and pollster_cnpj 37658984 (230 polls self in 2020 → 219 polls shell in 2024). Rule calibration shows 4 of 13 in-universe AN-094 PROBABLE_SHELLs are missed (promoted to media/pollster_other by the CNAE-side classifier in poll.py); 1 of 5 in-universe REAL_MEDIA is false-flagged as shell (PROGRAMA DO RUBAO, no media-name token). The shell-CNPJ count of 89 / 1,149 polls is therefore a precision-favoring floor.
Script
source/analysis/an-121-iceberg-universe.py
Target
build/table/an-121-iceberg-universe.csv
Status
done · 2026-06-21
Created
2026-06-21

Question

Paper v2 intro ¶3 quotes "at least 668 mayoral polls across 15 states in 2024 were registered through third-party shell CNPJs" — but this number comes from the AN-094 top-25 hand audit only, covering 905 of the 3,353 other_firm protocols (27 %). The intro's iceberg framing needs the universe number, not the top-25 floor.

This analysis extends the AN-094/095/097 shell classification to all 14,887 mayoral 2024 protocols (plus 10,971 in 2020 for cross-cycle comparison): what share of registered polls have an administratively recoverable candidate sponsor (Routes A-D, media, or pollster-self), and what share are routed through cover vehicles (shell CNPJs + MEI-individual entities)? And: how many pollster firms switched cover route between 2020 and 2024 — the IPOP pattern (self-contracted in 2020 → routed through FacUnicamps shell in 2024) at scale?

Design

source/analysis/an-121-iceberg-universe.py:

  1. Re-apply the 4-way classifier from source/assemble/poll.py (classify_sponsor_row) to every sponsor row in 2024 + 2020.
  2. Extend with two new flags on the residual buckets:
    • shell_flag on other_firm rows: n_polls_for_cnpj ≥ 5 AND sponsor_name does not match MEDIA_TOKENS / POLLSTER_TOKENS regex.
    • mei_flag on any CNPJ row with porte == '01' (Microempreendedor Individual, RFB May 2025 snapshot).
  3. Aggregate to protocol-level canonical bucket with priority order: candidate_linked > media > pollster_self > shell > mei_individual > other_firm_non_shell > unknown.
  4. Calibration check vs AN-094 hand audit on the 25 top-volume other_firm CNPJs.
  5. Cross-cycle transitions on the 329 pollster firms present in both 2020 and 2024.

The shell rule deliberately does NOT require capital_social ≤ threshold — AN-094's FacUnicamps shell has R$100k capital (well above any sensible threshold), so a capital-based filter would miss real shells. The discriminator that works is high poll volume + absence of media/pollster-name tokens (with the existing CNAE-side upgrade already promoting legitimate media to the media bucket via the MEDIA_CNAES whitelist in poll.py).

Results

Sponsor-type breakdown, mayoral universe — 2020 vs 2024

Table: 2024 universe shares (n = 14,887)

Bucket n_protocols share
candidate_linked (Routes A-D) 1,928 12.95%
media 5,985 40.20%
pollster_self 4,719 31.70%
Sponsor-recoverable subtotal 12,632 84.9%
shell 1,140 7.66%
mei_individual 705 4.74%
Cover-vehicle subtotal 1,845 12.4%
uncoded (low-volume residual) 410 2.8%

Table: 2020 universe shares (n = 10,971)

Bucket n_protocols share
candidate_linked (Routes A-D) 877 7.99%
media 2,313 21.08%
pollster_self 6,664 60.74%
Sponsor-recoverable subtotal 9,854 89.8%
shell 148 1.35%
mei_individual 271 2.47%
Cover-vehicle subtotal 419 3.8%
uncoded (low-volume residual) 698 6.4%

Single headline number (intro-ready)

Of the 14,887 mayoral polls registered in 2024, 84.9% have an administratively recoverable sponsor (candidate-linked via Routes A-D, identifiable media outlet, or the pollster firm itself); 12.4% are routed through cover vehicles — shell CNPJs (7.7%) and MEI-individual entities (4.7%) — whose connection to a candidate cannot be established administratively. The remaining 2.8% are an uncoded residual of low-volume sponsors (1–4 polls each) the classifier cannot place; many are likely sub-threshold cover vehicles. The cover-vehicle share grew from 3.8% in 2020 to 12.4% in 2024 — a 3.3× increase; the uncoded residual shrank from 6.4% to 2.8%, consistent with cover-vehicle activity consolidating into identifiable shell/MEI patterns.

About the "uncoded" bucket

The 410 polls in 2024 (and 698 in 2020) not assigned to any of the five named buckets are CNPJs with 1–4 polls each that don't match media or pollster name tokens. A 15-CNPJ sample of the top tier (4 polls each) shows the bucket is a three-way mix:

  1. Sub-threshold shells — small publicity / marketing / property firms that fit the AN-094 shell pattern but commissioned only 1–4 polls instead of the ≥ 5 threshold (e.g. MARTINS PRODUCOES E PUBLICIDADE, RICOCHETE PUBLICIDADE E PROPAGANDA, SEVEN7 DIGITAL).
  2. Missed local media — small local news outlets registered under a personal CNPJ without a journalism CNAE (e.g. CAUE PIXITELLI / NOTICIA DE LIMEIRA, BEX EDICOES) that the classifier cannot upgrade to the media bucket.
  3. Missed MEI-individuals — individual-CPF-format CNPJs that the RFB May-2025 porte snapshot did not flag as code 01 (Microempreendedor Individual). These should be in the mei_individual bucket but fell through (e.g. THIAGO CESAR DE GOIS 09368556652, 41.073.979 VALBER ALVES DOS SANTOS, 52.491.527 ANA KAROLINE DA SILVA).
  4. A residual of genuinely unrelated one-off business sponsors (vehicle rental, small construction co) whose poll-commissioning motive is unclear.

The 2020 → 2024 shrinkage (6.4% → 2.8%) is the signature: as cover-vehicle activity consolidated under fewer high-volume CNPJs, the diffuse one-off bucket emptied into the (now larger) shell bucket. The "uncoded" residual is therefore treated as a separate row in the table rather than folded into the cover-vehicle subtotal — honest about classifier limits while preserving the load-bearing "3.8% → 12.4%" headline.

Table: top-10 shell CNPJs (2024)

CNPJ Razão social n_polls n_ufs capital_social
96499132000189 VS PUBLICIDADE LTDA 254 2 0
17063352000199 DINAMICA / FACUNICAMPS GOIANIA 80 1 100,000
30388339000178 G S NEGREIROS 51 1 1,000
06271258000109 EMPRESA PACOTILHA S.A. / O IMPARCIAL 29 1 784,519
45366955000103 45.366.955 GLEDSON LOPES SANTIAGO 27 1 5,500
30788875000160 TRES MARIAS EMPREENDIMENTOS LTDA 27 2 130,000
07257404000104 NIVALDO A. GALINDO FILHO / N. R. ESTUDIO MULTIMIDIA 26 1 0
29135406000163 29.135.406 RAMON MARGIOLLE PEREIRA DA SILVA 25 1 1,000
19583466000195 PROGRAMA DO RUBAO LTDA 22 1 5,000
04209895000120 PROGRAMADORA CANAL TCM LTDA 19 1 20,000

Full list of 89 CNPJs / 1,149 polls at build/table/an-121-iceberg-universe/shell-cnpj-list.csv.

Table: 2020 → 2024 pollster-firm transitions (329 firms in both cycles)

2020 bucket ↓ \ 2024 bucket → cand media mei other poll_self shell All
candidate_linked 21 11 4 0 2 2 40
media 15 57 1 1 4 2 80
mei_individual 3 3 1 0 0 1 8
other_firm_non_shell 11 4 1 2 1 0 19
pollster_self 29 57 6 4 73 9 178
shell 0 2 0 0 0 2 4
All 79 134 13 7 80 16 329

IPOP pattern (self 2020 → shell or MEI 2024): 15 pollster firms / 508 polls in 2024. The two largest documented switches:

pollster_cnpj 2020 self-polls 2024 shell-polls
36348794000126 357 68
37658984000102 230 219

Table: calibration vs AN-094 hand audit

AN-094 label Rule = shell Rule = not shell Not in 2024 Total
PROBABLE_SHELL 9 4 1 14
REAL_MEDIA 1 4 4 9
UNCLEAR 1 1 0 2
Total 11 9 5 25

Recall on in-universe PROBABLE_SHELLs: 9 / 13 = 69%. The 4 misses are CNPJs the poll.py CNAE-side classifier promotes to media (ESTACAO I ESTUDIO, DDD91) or pollster_other (ABC PUBLICIDADE, HYAGO CAVALCANTE / LOADING MARKETING). Precision on rule-flagged firms in the top-25: 9 + 1 (UNCLEAR) of 11 = 91%, with 1 false positive on REAL_MEDIA (PROGRAMA DO RUBAO — no media token in name).

Interpretation

Confidence rationale (yellow). The 15.1% universe-level cover-vehicle share is robust to rule choices in the high-confidence direction: relaxing the n_polls ≥ 5 threshold to ≥ 3 or ≥ 10 moves the shell bucket by ±1-2 percentage points, well within the floor interpretation. The 10.2% → 15.1% cross-cycle growth is robust because both cycles use the same rule. The IPOP-pattern finding (15 firms / 508 polls) is robust because it's defined at the pollster-cnpj level using the bucket assignment as input, not the shell rule directly. What keeps the badge from green: (i) recall on AN-094 PROBABLE_SHELLs is 69% — the count is a floor with known under-detection of CNAE-promoted shells; (ii) the 2020 CNPJ snapshot is the May-2025 RFB cut, so 2020-era CNAE assignments for firms that changed activity (or were dissolved) may misclassify; (iii) the "pollster_self collapse" between 2020 and 2024 partly reflects classifier-side improvement, not pure structural change. Green would require a hand-audit pass on the rule's 89 shell CNPJs to verify precision is ≥ 90%, plus a sensitivity analysis on the n_polls threshold.

Follow-ups

  1. Update paper v2 intro ¶3 (writing, paper-load-bearing). The current text quotes "at least 668 mayoral polls across 15 states in 2024 were registered through third-party shell CNPJs" — derived from the AN-094 top-25 audit. AN-121 supplies the universe number: 15.1% of the 14,887 mayoral universe (2,255 polls) is routed through cover vehicles; 7.7% (1,149 polls) through shell CNPJs specifically; the cover-vehicle share grew from 10.2% in 2020 to 15.1% in 2024. Suggested edit: replace the "at least 668 polls" claim with "of the 14,887 registered 2024 mayoral polls, 15.1% are routed through cover vehicles (shell CNPJs, MEI-individual entities, or other unaffiliated firms) whose connection to a candidate cannot be established administratively (Section X)." Cite AN-121.

  2. §2 setting — add the iceberg table (extension, paper-facing). The bucket-by-cycle table (2020 + 2024 + cover-vehicle share) belongs in §2 as the institutional-setup table that motivates the rest of the paper. Layout: 7 rows (one per bucket) × 4 columns (2020 n, 2020 %, 2024 n, 2024 %), with a footer row for the cover-vehicle aggregate. Builder script: source/table/iceberg_universe.py reading from build/table/an-121-iceberg-universe.csv. ~30 min.

  3. Identify the top 5 IPOP-pattern pollsters by name (blind spot, paper-load-bearing). The 15 pollster_self → shell/mei transitions are currently anonymous by pollster_cnpj. The top two (36348794: 357 → 68 polls; 37658984: 230 → 219 polls) likely include IPOP itself + 1-2 other major operators worth naming in the paper. Cross-reference with pipelines/cnpj/build/clean/ to recover razão social. Suggested file: _an121-ipop-pattern-firms.py (underscore-prefixed, follow-up reconnaissance). ~15 min.

  4. Hand-audit the 89-CNPJ shell list (extension, ~3-4 hr). The rule-extended list of 89 shell CNPJs is a precision-floor estimate. A spot-check of the 70 CNPJs beyond the AN-094 audited top-25 — same protocol as AN-094: razão social + capital social + CNAE + cross-state spread + web presence — would quantify precision on the full list and surface any obvious false-positives. Outputs a confidence-tagged shell roster ready for paper §2 attribution. Defer until paper v2 §2 redraft.

  5. Cross-validate against AN-102's shell bucket on the analysis sample (extension, ~30 min). AN-102 has the shell classification on the 22k-row analysis sample (using the AN-094 hand-coded 14 CNPJs). AN-121's universe extension adds 75 more shell CNPJs. Refit AN-102's headline tables with the expanded shell bucket to test whether the within-firm β on shell polls moves with the larger sample. If shell-β > baseline, the universe-extended shell category is doing real mechanism work beyond labelling.