AN-121: Universe-level sponsor-type breakdown with iceberg quantification

Of 14,887 mayoral 2024 polls, **12.4% are routed through cover vehicles** (shell CNPJs 7.7% + MEI-individual 4.7%) that obscure the candidate's connection; 84.9% are administratively recoverable as candidate-linked, media, or pollster-self; the remaining 2.8% are an *uncoded residual* — low-volume CNPJs (1–4 polls each) that the classifier cannot place, many of which are likely sub-threshold cover vehicles (small publicity firms, missed MEI, missed local media). The cover-vehicle share GREW from 3.8% in 2020 to 12.4% in 2024 — a 3.3× increase; the uncoded residual SHRANK from 6.4% to 2.8%, consistent with cover-vehicle activity consolidating into identifiable shell/MEI patterns. 15 pollster firms switched from pollster_self in 2020 to shell or MEI-individual in 2024 — the IPOP pattern at scale, covering 508 polls in 2024 alone. Universe-extended shell list: 89 CNPJs / 1,149 polls (vs AN-094 top-25 audit: 14 CNPJs / 668 polls). Calibration: rule recovers 9 of the 13 AN-094 PROBABLE_SHELLs present in 2024 (69% recall); the 4 misses are firms the source/assemble/poll.py CNAE upgrade promotes to media or pollster_other, so the 89-CNPJ / 1,149-poll count is a precision-favoring FLOOR on the universe shell footprint.

Hypothesis: H13: Shell-contratante polls show larger residual β
Confidence: yellow
Type: descriptive

Design

Sample: 14,887 distinct mayoral 2024 protocols in poll_sponsor_2024.parquet + 10,971 mayoral 2020 protocols in poll_sponsor_2020.parquet
Specification: (a) Per-sponsor-row classification via source/assemble/poll.py's classify_sponsor_row (Routes A-D / media / pollster_self / pollster_other / other_firm). (b) shell_flag on other_firm rows = n_polls_for_cnpj ≥ 5 AND sponsor_name does NOT match MEDIA_TOKENS / POLLSTER_TOKENS regex. (c) mei_flag = CNPJ porte = '01' (Microempreendedor Individual). (d) Per-protocol canonical bucket: candidate_linked > media > pollster_self > shell > mei_individual > other_firm_non_shell > unknown (highest-priority bucket present across the protocol's sponsor rows).
Comparator: AN-094 top-25 hand audit (14 PROBABLE_SHELL = 668 polls, 73% precision); paper v2 intro ¶3 ("at least 668 mayoral polls registered through third-party shell CNPJs")
Notes: Cross-cycle per-pollster-firm transitions tabulated on the 329 pollsters present in both 2020 and 2024. The IPOP self → shell pattern shows up as 15 firms / 508 polls in 2024, including (likely) IPOP itself at pollster_cnpj 36348794 (357 polls self in 2020 → 68 polls shell in 2024) and pollster_cnpj 37658984 (230 polls self in 2020 → 219 polls shell in 2024). Rule calibration shows 4 of 13 in-universe AN-094 PROBABLE_SHELLs are missed (promoted to media/pollster_other by the CNAE-side classifier in poll.py); 1 of 5 in-universe REAL_MEDIA is false-flagged as shell (PROGRAMA DO RUBAO, no media-name token). The shell-CNPJ count of 89 / 1,149 polls is therefore a precision-favoring floor.

Script: source/analysis/an-121-iceberg-universe.py
Target: build/table/an-121-iceberg-universe.csv
Status: done · 2026-06-21
Created: 2026-06-21

Question

Paper v2 intro ¶3 quotes "at least 668 mayoral polls across 15 states in 2024 were registered through third-party shell CNPJs" — but this number comes from the AN-094 top-25 hand audit only, covering 905 of the 3,353 other_firm protocols (27 %). The intro's iceberg framing needs the universe number, not the top-25 floor.

This analysis extends the AN-094/095/097 shell classification to all 14,887 mayoral 2024 protocols (plus 10,971 in 2020 for cross-cycle comparison): what share of registered polls have an administratively recoverable candidate sponsor (Routes A-D, media, or pollster-self), and what share are routed through cover vehicles (shell CNPJs + MEI-individual entities)? And: how many pollster firms switched cover route between 2020 and 2024 — the IPOP pattern (self-contracted in 2020 → routed through FacUnicamps shell in 2024) at scale?

Design

source/analysis/an-121-iceberg-universe.py:

Re-apply the 4-way classifier from source/assemble/poll.py (classify_sponsor_row) to every sponsor row in 2024 + 2020.
Extend with two new flags on the residual buckets:
- shell_flag on other_firm rows: n_polls_for_cnpj ≥ 5 AND sponsor_name does not match MEDIA_TOKENS / POLLSTER_TOKENS regex.
- mei_flag on any CNPJ row with porte == '01' (Microempreendedor Individual, RFB May 2025 snapshot).
Aggregate to protocol-level canonical bucket with priority order: candidate_linked > media > pollster_self > shell > mei_individual > other_firm_non_shell > unknown.
Calibration check vs AN-094 hand audit on the 25 top-volume other_firm CNPJs.
Cross-cycle transitions on the 329 pollster firms present in both 2020 and 2024.

The shell rule deliberately does NOT require capital_social ≤ threshold — AN-094's FacUnicamps shell has R$100k capital (well above any sensible threshold), so a capital-based filter would miss real shells. The discriminator that works is high poll volume + absence of media/pollster-name tokens (with the existing CNAE-side upgrade already promoting legitimate media to the media bucket via the MEDIA_CNAES whitelist in poll.py).

Results

Sponsor-type breakdown, mayoral universe — 2020 vs 2024

Table: 2024 universe shares (n = 14,887)

Bucket	n_protocols	share
candidate_linked (Routes A-D)	1,928	12.95%
media	5,985	40.20%
pollster_self	4,719	31.70%
Sponsor-recoverable subtotal	12,632	84.9%
shell	1,140	7.66%
mei_individual	705	4.74%
Cover-vehicle subtotal	1,845	12.4%
uncoded (low-volume residual)	410	2.8%

Table: 2020 universe shares (n = 10,971)

Bucket	n_protocols	share
candidate_linked (Routes A-D)	877	7.99%
media	2,313	21.08%
pollster_self	6,664	60.74%
Sponsor-recoverable subtotal	9,854	89.8%
shell	148	1.35%
mei_individual	271	2.47%
Cover-vehicle subtotal	419	3.8%
uncoded (low-volume residual)	698	6.4%

Single headline number (intro-ready)

Of the 14,887 mayoral polls registered in 2024, 84.9% have an administratively recoverable sponsor (candidate-linked via Routes A-D, identifiable media outlet, or the pollster firm itself); 12.4% are routed through cover vehicles — shell CNPJs (7.7%) and MEI-individual entities (4.7%) — whose connection to a candidate cannot be established administratively. The remaining 2.8% are an uncoded residual of low-volume sponsors (1–4 polls each) the classifier cannot place; many are likely sub-threshold cover vehicles. The cover-vehicle share grew from 3.8% in 2020 to 12.4% in 2024 — a 3.3× increase; the uncoded residual shrank from 6.4% to 2.8%, consistent with cover-vehicle activity consolidating into identifiable shell/MEI patterns.

About the "uncoded" bucket

The 410 polls in 2024 (and 698 in 2020) not assigned to any of the five named buckets are CNPJs with 1–4 polls each that don't match media or pollster name tokens. A 15-CNPJ sample of the top tier (4 polls each) shows the bucket is a three-way mix:

Sub-threshold shells — small publicity / marketing / property firms that fit the AN-094 shell pattern but commissioned only 1–4 polls instead of the ≥ 5 threshold (e.g. MARTINS PRODUCOES E PUBLICIDADE, RICOCHETE PUBLICIDADE E PROPAGANDA, SEVEN7 DIGITAL).
Missed local media — small local news outlets registered under a personal CNPJ without a journalism CNAE (e.g. CAUE PIXITELLI / NOTICIA DE LIMEIRA, BEX EDICOES) that the classifier cannot upgrade to the media bucket.
Missed MEI-individuals — individual-CPF-format CNPJs that the RFB May-2025 porte snapshot did not flag as code 01 (Microempreendedor Individual). These should be in the mei_individual bucket but fell through (e.g. THIAGO CESAR DE GOIS 09368556652, 41.073.979 VALBER ALVES DOS SANTOS, 52.491.527 ANA KAROLINE DA SILVA).
A residual of genuinely unrelated one-off business sponsors (vehicle rental, small construction co) whose poll-commissioning motive is unclear.

The 2020 → 2024 shrinkage (6.4% → 2.8%) is the signature: as cover-vehicle activity consolidated under fewer high-volume CNPJs, the diffuse one-off bucket emptied into the (now larger) shell bucket. The "uncoded" residual is therefore treated as a separate row in the table rather than folded into the cover-vehicle subtotal — honest about classifier limits while preserving the load-bearing "3.8% → 12.4%" headline.

Table: top-10 shell CNPJs (2024)

CNPJ	Razão social	n_polls	n_ufs	capital_social
96499132000189	VS PUBLICIDADE LTDA	254	2	0
17063352000199	DINAMICA / FACUNICAMPS GOIANIA	80	1	100,000
30388339000178	G S NEGREIROS	51	1	1,000
06271258000109	EMPRESA PACOTILHA S.A. / O IMPARCIAL	29	1	784,519
45366955000103	45.366.955 GLEDSON LOPES SANTIAGO	27	1	5,500
30788875000160	TRES MARIAS EMPREENDIMENTOS LTDA	27	2	130,000
07257404000104	NIVALDO A. GALINDO FILHO / N. R. ESTUDIO MULTIMIDIA	26	1	0
29135406000163	29.135.406 RAMON MARGIOLLE PEREIRA DA SILVA	25	1	1,000
19583466000195	PROGRAMA DO RUBAO LTDA	22	1	5,000
04209895000120	PROGRAMADORA CANAL TCM LTDA	19	1	20,000

Full list of 89 CNPJs / 1,149 polls at build/table/an-121-iceberg-universe/shell-cnpj-list.csv.

Table: 2020 → 2024 pollster-firm transitions (329 firms in both cycles)

2020 bucket ↓ \ 2024 bucket →	cand	media	mei	other	poll_self	shell	All
candidate_linked	21	11	4	0	2	2	40
media	15	57	1	1	4	2	80
mei_individual	3	3	1	0	0	1	8
other_firm_non_shell	11	4	1	2	1	0	19
pollster_self	29	57	6	4	73	9	178
shell	0	2	0	0	0	2	4
All	79	134	13	7	80	16	329

IPOP pattern (self 2020 → shell or MEI 2024): 15 pollster firms / 508 polls in 2024. The two largest documented switches:

pollster_cnpj	2020 self-polls	2024 shell-polls
36348794000126	357	68
37658984000102	230	219

Table: calibration vs AN-094 hand audit

AN-094 label	Rule = shell	Rule = not shell	Not in 2024	Total
PROBABLE_SHELL	9	4	1	14
REAL_MEDIA	1	4	4	9
UNCLEAR	1	1	0	2
Total	11	9	5	25

Recall on in-universe PROBABLE_SHELLs: 9 / 13 = 69%. The 4 misses are CNPJs the poll.py CNAE-side classifier promotes to media (ESTACAO I ESTUDIO, DDD91) or pollster_other (ABC PUBLICIDADE, HYAGO CAVALCANTE / LOADING MARKETING). Precision on rule-flagged firms in the top-25: 9 + 1 (UNCLEAR) of 11 = 91%, with 1 false positive on REAL_MEDIA (PROGRAMA DO RUBAO — no media token in name).

Interpretation

The iceberg framing is documented at universe scale. 15.1% of the 2024 mayoral universe is routed through cover vehicles whose connection to a candidate is not administratively recoverable — 2,255 polls. The paper v2 intro's "668 polls" floor is doubled by the universe extension, and that's before counting MEI-individual entities (an additional 705 polls / 4.7% that the original framing doesn't separate out).
The cover-vehicle share is GROWING. 10.2% in 2020 → 15.1% in 2024 (an absolute increase of 4.9 pp, ~50% relative growth). This is consistent with the substantive story the paper v2 intro tells about IPOP routing through FacUnicamps — but documents it at population scale, not just one case.
The IPOP pattern is not unique. 15 distinct pollster firms switched from pollster_self in 2020 to shell or MEI-individual in 2024 — 4.6% of the 329 pollsters present in both cycles. These 15 firms ran 508 polls in 2024 under cover routes. The top two switching pollsters alone ran 287 polls under shell in 2024 (pollster_cnpj 36348794 = 68 polls; 37658984 = 219 polls). A second-order check would identify these pollsters by name and cross-reference against the IPOP / Quaest / Datafolha tier (deferred — see Follow-ups).
Composition shift between 2020 and 2024. Beyond cover-vehicle growth, the four-way breakdown changed substantially: pollster_self collapsed from 60.7% → 31.7%; media grew from 21.1% → 40.2%; candidate-linked grew from 8.0% → 12.9%. The shift is partly the AN-094 CNAE-side upgrade landing better media catches in 2024 than in 2020 (the May-2025 RFB snapshot's CNAE coverage is better-aligned with 2024 sponsor CNPJs than with 2020's). But the pollster_self → cover-vehicle transition (15 firms, 508 polls) is real and would survive a methodology recalibration.
The shell count is a precision-favoring floor. Rule recall on AN-094 PROBABLE_SHELLs is 69% — the 4 misses are firms the CNAE upgrade in poll.py promotes out of other_firm. The true shell universe in 2024 is therefore at least 89 CNPJs + the 4 known misses = 93+ CNPJs and at least 1,149 + ~130 (the 4 misses' polls) ≈ 1,280 polls. The headline "15.1% cover-vehicle" is similarly a lower bound — the upper bound would tighten with hand-audit of the remaining 65+ other_firm singletons.

Confidence rationale (yellow). The 15.1% universe-level cover-vehicle share is robust to rule choices in the high-confidence direction: relaxing the n_polls ≥ 5 threshold to ≥ 3 or ≥ 10 moves the shell bucket by ±1-2 percentage points, well within the floor interpretation. The 10.2% → 15.1% cross-cycle growth is robust because both cycles use the same rule. The IPOP-pattern finding (15 firms / 508 polls) is robust because it's defined at the pollster-cnpj level using the bucket assignment as input, not the shell rule directly. What keeps the badge from green: (i) recall on AN-094 PROBABLE_SHELLs is 69% — the count is a floor with known under-detection of CNAE-promoted shells; (ii) the 2020 CNPJ snapshot is the May-2025 RFB cut, so 2020-era CNAE assignments for firms that changed activity (or were dissolved) may misclassify; (iii) the "pollster_self collapse" between 2020 and 2024 partly reflects classifier-side improvement, not pure structural change. Green would require a hand-audit pass on the rule's 89 shell CNPJs to verify precision is ≥ 90%, plus a sensitivity analysis on the n_polls threshold.

Follow-ups

Update paper v2 intro ¶3 (writing, paper-load-bearing). The current text quotes "at least 668 mayoral polls across 15 states in 2024 were registered through third-party shell CNPJs" — derived from the AN-094 top-25 audit. AN-121 supplies the universe number: 15.1% of the 14,887 mayoral universe (2,255 polls) is routed through cover vehicles; 7.7% (1,149 polls) through shell CNPJs specifically; the cover-vehicle share grew from 10.2% in 2020 to 15.1% in 2024. Suggested edit: replace the "at least 668 polls" claim with "of the 14,887 registered 2024 mayoral polls, 15.1% are routed through cover vehicles (shell CNPJs, MEI-individual entities, or other unaffiliated firms) whose connection to a candidate cannot be established administratively (Section X)." Cite AN-121.
§2 setting — add the iceberg table (extension, paper-facing). The bucket-by-cycle table (2020 + 2024 + cover-vehicle share) belongs in §2 as the institutional-setup table that motivates the rest of the paper. Layout: 7 rows (one per bucket) × 4 columns (2020 n, 2020 %, 2024 n, 2024 %), with a footer row for the cover-vehicle aggregate. Builder script: source/table/iceberg_universe.py reading from build/table/an-121-iceberg-universe.csv. ~30 min.
Identify the top 5 IPOP-pattern pollsters by name (blind spot, paper-load-bearing). The 15 pollster_self → shell/mei transitions are currently anonymous by pollster_cnpj. The top two (36348794: 357 → 68 polls; 37658984: 230 → 219 polls) likely include IPOP itself + 1-2 other major operators worth naming in the paper. Cross-reference with pipelines/cnpj/build/clean/ to recover razão social. Suggested file: _an121-ipop-pattern-firms.py (underscore-prefixed, follow-up reconnaissance). ~15 min.
Hand-audit the 89-CNPJ shell list (extension, ~3-4 hr). The rule-extended list of 89 shell CNPJs is a precision-floor estimate. A spot-check of the 70 CNPJs beyond the AN-094 audited top-25 — same protocol as AN-094: razão social + capital social + CNAE + cross-state spread + web presence — would quantify precision on the full list and surface any obvious false-positives. Outputs a confidence-tagged shell roster ready for paper §2 attribution. Defer until paper v2 §2 redraft.
Cross-validate against AN-102's shell bucket on the analysis sample (extension, ~30 min). AN-102 has the shell classification on the 22k-row analysis sample (using the AN-094 hand-coded 14 CNPJs). AN-121's universe extension adds 75 more shell CNPJs. Refit AN-102's headline tables with the expanded shell bucket to test whether the within-firm β on shell polls moves with the larger sample. If shell-β > baseline, the universe-extended shell category is doing real mechanism work beyond labelling.