id: an-032 hypothesis: bayesian-cluster-selection headline: "SUPERSEDED by [[an-125]] (2026-06-23): the 'reverse-sign' headline below was a clustering artifact. The 22 muni-rich pairs (n_LV_bairros ≥ 50) come from only 2 unique sponsored polls (HELIO LOPES PSDB-Anápolis, ROGERIO CORREIA PT-BH) each re-paired against 11 independents — naive paired-t treats them as 22 independent draws when the effective n is 2. With cluster-bootstrap CI by sponsored_protocol, the result is null at every threshold where the bootstrap is meaningful (thr=0: 21 clusters, CI [−0.46, +0.34] pp; thr=11: 12 clusters, CI [−0.68, +0.49] pp; thr=20: 5 clusters, CI [−0.45, +0.90] pp). A sharper candidate-own variant (poll's bairros weighted by the sponsoring candidate's OWN 2020 mayoral vote share rather than the party's; n=34 pairs from 18 unique sponsored polls) is also null: Δ = +0.15 pp, cluster-BB CI [−0.49, +1.22] pp. Original (pre-clustering) headline preserved below for the audit trail. [Reverse-sign result on the cleanest subset. On the 22 muni-rich pairs (n_LV_bairros ≥ 50; mostly BH and other large state capitals), within-pair contrast (sponsored − independent oversample-index for the sponsor's party's 2020 strongholds) = −0.0029, t = −5.3, p < 0.001; sign test: only 2 of 22 positive (binomial p = 0.0001). The direction is opposite the Channel A 'Bayesian cluster selection' prediction: sponsored polls' bairro selection is less tilted toward the sponsor's party's prior strongholds than the matched independent poll's bairro selection, not more. On the wider 42-pair usable set the contrast is slightly negative and not significant (t = −0.85, p = 0.40) but the sign test still favors negatives (12/42 positive, binomial p = 0.008). Match rate is the binding constraint: median 0% per side because TSE local_votacao bairros are polling-station-location bairros (1-2 per small muni), not residential bairros (24+ per poll). Muni-rich restriction (≥50 LV bairros) keeps the matchable subset.]" type: descriptive question: "Within a sponsored × independent pair on the same muni × same candidate × ±14 days, does the sponsored poll's bairro selection over-represent the sponsor's party's prior-cycle (2020) strongholds relative to the matched independent poll? A positive paired contrast would be direct Channel A 'Bayesian cluster selection' evidence." tags: ["hyp:bayesian-cluster-selection", channel-a, mechanism, secao-oversampling, paired, an-019-followup, curated-pairs, null-result, superseded] status: interpreted-superseded status_date: 2026-06-23 superseded_by: an-125 confidence: yellow created: 2026-06-02 script: source/analysis/an-032-secao-oversampling-paired.py target: build/table/secao_oversampling_paired.csv design: sample: 224 pairs (of 244) in build/llm/curated_pairs/pairs_with_extractions.parquet with bairro extractions on both sides AND a sponsoring-candidate party assigned. Within-pair design: same muni × same candidate × ±14 days. Prior-cycle baseline: 2020 prefeito seção-level vote shares from pipelines/politica/build/clean/votacao_secao_2020.parquet. Seção-to-bairro mapping from eleitorado_local_votacao.csv. specification: "For each pair, compute oversample_index = weighted_party_share − muni_party_share for sponsored and independent polls separately (party = sponsor's party). Within-pair contrast = oversample_sp − oversample_ind. Paired-t test against zero. Sample-size weighting: per-bairro n_entrevistas when LLM extracted it, else uniform-by-elector-count within matched seções." comparator: independent poll on same muni × same candidate × within 14 days cluster: pair (paired-t handles by construction) weights: per-bairro n_entrevistas where extracted; uniform-elector otherwise
AN-032: Paired sponsored × independent oversample-index test
Question
Channel A's strongest specific prediction is Bayesian cluster selection: sponsors choose pollsters who choose poll designs that oversample bairros where the sponsor's party performed well in the prior cycle. If the prediction is right, comparing a sponsored poll to an independent poll of the same candidate in the same race within a two-week window should show the sponsored poll's bairros tilting toward the sponsor's prior strongholds, while the independent poll's bairros should not.
The _secao_oversampling_scope.py pilot (2026-06-02) showed (a)
the bairro-to-TSE-local_votacao matching machinery works,
(b) within-muni seção-level party-share variance is substantial
(sd ≈ 0.075), and (c) the oversample_index placebo on three
well-matched independent polls is ≈ 0 across the board (max
|index| = 0.003). Machinery validated.
The curated_pairs dataset (build/llm/curated_pairs/pairs_with_extractions.parquet)
already has 244 sponsored × independent pairs with bairro
extractions on both sides. 224 have a sponsoring-candidate party
recorded. This is the natural identifying sample for the test —
much stronger than the 1 sponsored poll in the methodology pilot.
Design
source/analysis/an-032-secao-oversampling-paired.py:
- Load the 224 well-extracted pairs from
build/llm/curated_pairs/pairs_with_extractions.parquet. - For each pair, parse the sponsored and independent bairro JSON
lists. Each bairro can include
n_entrevistas(LLM-extracted) — used as the bairro weight in the weighted-share calculation when present. - Match poll bairros to TSE
eleitorado_local_votacao.csvbairros within the pair'smuni_id(normalized string match, exact then substring fallback). - For each matched bairro, look up the bairro's seções; for each
seção, get prior-cycle (2020) prefeito party-share from
votacao_secao_2020.parquet. - Compute:
weighted_party_share(poll, sponsor_party)= sum over matched bairros of (bairro_weight × bairro's seção-weighted sponsor-party share)muni_party_share(muni, sponsor_party)= sponsor party's muni-wide weighted 2020 shareoversample_index(poll) = weighted_party_share − muni_party_share
- Per pair:
contrast = oversample_sp − oversample_ind. - Paired-t test on
contrastagainst zero. Sign test for robustness.
Results

Sample and match rates
| Value | |
|---|---|
| Pairs in curated_pairs (with both-side bairros + sponsor party) | 224 |
| Pairs with usable oversample_index on BOTH sides | 42 (18.8 %) |
| Median bairro-match share, sponsored side | 0 % |
| Median bairro-match share, independent side | 0 % |
| Most pairs cluster in two munis | BH (12 pairs), Anápolis (10 pairs) |
The match rate is the binding constraint. LLM-extracted bairro strings (e.g., "Itaguai I, II, III", "Centro Sul", "Aarão Reis") often don't match TSE eleitorado_local_votacao's coarser labels (e.g., "Itaguai", "Centro"). Exact normalized matching gave 28 usable pairs; substring fuzzy matching lifted to 42. The structural mismatch — small-muni LV coverage has 1-5 bairros while polls list 10-50 granular ones — remains.
Headline paired test
On the 42 usable pairs (any LV richness):
| Statistic | Value |
|---|---|
| Mean sponsored oversample_index | −0.0019 |
| Mean independent oversample_index | −0.0007 |
| Mean contrast (sp − ind) | −0.0012 |
| SD contrast | 0.0091 |
| Paired-t | −0.85 |
| Two-sided p | 0.40 |
| Sign test: positive pairs | 12 / 42 |
| Binomial p vs 50/50 | 0.008 |
The mean contrast is small but the sign test rejects 50/50 in the negative direction. Restricting to muni-rich munis (where LV bairro coverage is rich enough to make the match informative) sharpens the result dramatically:
Muni-richness sensitivity (the cleanest read)
Subset (n_LV_bairros ≥) |
n pairs | mean contrast | paired-t | p | sign (pos / N) | binom p |
|---|---|---|---|---|---|---|
| 0 (all usable) | 42 | −0.0012 | −0.85 | 0.40 | 12 / 42 | 0.008 |
| 11 | 32 | −0.0019 | −1.03 | 0.31 | 8 / 32 | 0.007 |
| 20 | 25 | −0.0015 | −1.64 | 0.11 | 5 / 25 | 0.004 |
| ≥ 50 (large munis with rich LV) | 22 | −0.0029 | −5.28 | < 0.001 | 2 / 22 | 0.0001 |
The signal sharpens as we restrict to munis where the bairro-string match is informative. On the cleanest 22-pair subset (large munis, mostly state capitals with rich LV coverage), the result is a precisely-estimated negative contrast with t = −5.3 — sponsored polls' bairros are systematically LESS tilted toward the sponsor's party's prior strongholds than the matched independent poll's bairros, in 20 of 22 pairs.
The interpretation is no longer just "null"; it's reversed-sign.
Interpretation
The structural unit-mismatch problem
The match-rate constraint is structural, not a string-matching bug.
TSE eleitorado_local_votacao "bairro" = the bairro where a polling
station SITS (a school, public building). Poll PDF "bairro" = the
RESIDENTIAL bairro where interviewers worked. In big munis these
align — most bairros have their own polling stations. In small
munis a single school serves the whole town, so LV has 1-2 bairros
while polls list 24+ residential ones.
The diagnostic:
| Muni size (electors) | # munis | Median LV bairros | Typical poll bairros |
|---|---|---|---|
| < 10k | 2,924 | 2 | ~ 24 |
| 10-50k | 2,149 | 8 | ~ 24 |
| 50-200k | 393 | 24 | ~ 24 |
| 200-500k | 54 | 48 | ~ 24 |
| > 500k | 49 | 80 | ~ 24 |
Below ~50k voters the bairro-string match cannot work without a finer geographic unit — the LV granularity isn't there. Hence the muni-rich restriction; the ≥50-LV subset is what the test can sharply identify.
Why the sign is negative
Two readings of the −0.003 contrast survive the data:
(R1) Bairro selection is a credibility signal, not a slant lever. Sponsored polls actively avoid stronghold concentration. If a sponsor pays for a poll showing their candidate strong, the poll's methodology must look defensible to the consumers (media, opponents, TSE itself). Concentrating bairros on the sponsor's prior strongholds would be cherry-picking and obviously visible in the registered methodology. Instead, the pollster includes a broader geographic spread — neutral / mixed / opposing-stronghold bairros — and slants through less visible methodology channels: population frame (mixed vs TSE-eligible; finding-paired, AN-019), quota variable choice (AN-022), audit-opacity (AN-021), coverage_class shift (AN-024). The bairro list is visible in the PDF; the population frame and audit rates are buried in the narrative text.
(R2) Independent media polls concentrate on electoral "hotspots" — bairros where the leading candidate's party is strongest, where the political story is happening. Sponsored polls of the same candidate cover more representatively because they need to defend a non-narrative methodology to TSE. This is the same explanation viewed from the independent-poll side: media polls oversample strongholds because that's where the news is.
Both readings predict the same observed pattern (sp < ind on sponsor-party oversample-index) and the data cannot fully distinguish them. The substantive consequence is the same either way: bairro selection does not carry the +7-8 pp sponsor effect.
Refined Channel A story
Combined with AN-019 (coverage_class × sponsor), AN-021 (audit-pct × sponsor), AN-022 (methodology-completeness × sponsor), AN-024 (deferral × sponsor), AN-032 sharpens the Channel A mechanism characterization:
| Lever | Direction | Evidence |
|---|---|---|
| Population reference (mixed vs TSE-eligible) | sp > ind | finding-paired, AN-019 |
| Census-setor as cluster frame | sp > ind | finding-paired |
| Coverage class shift (urban-only, deferred) | sp > ind | AN-019, AN-024 |
| Methodology completeness (audit, training) | sp < ind | AN-021, AN-022 |
| Quota variable mix | mixed | AN-022 |
| Partisan bairro/stronghold selection | sp < ind | AN-032 |
The Channel A levers that DO carry sponsor slant are population frame and operational opacity (low audit, deferred coverage, less-complete methodology). The bairro list does flex (sponsored polls list fewer or different bairros, per finding-paired), but the partisan composition of the bairro list does not — sponsored polls' bairros are more geographically representative of the muni's full partisan map than independent polls'.
Caveats explicit
- Muni-rich (≥50 LV) subset is 22 pairs, heavily concentrated in state capitals and major metros (BH, Anápolis, etc.). The result generalizes most cleanly to these — small-muni dynamics remain inaccessible to this matching strategy.
- 2020 prior-cycle baseline may not reflect 2024 stronghold structure for all candidates. A new candidate or a re-aligning party would have a different stronghold pattern not captured by the 2020 baseline. Setor-code matching + 2022 federal results would sharpen.
- The reversed-sign finding is robust in this matched subset but selection on muni LV richness is not independent of muni political characteristics (capitals tend to have stronger party identification and more pollster competition). The result is conditional on big-muni dynamics.
Setor-code addendum (2026-06-02)
source/analysis/_setor_oversampling_exploration.py probes what the
LLM-extracted setor codes
(s/i_bairro_detail__setor_codes_sample) can tell us without an
external IBGE setor-polygon shapefile. The proper setor → seção test
needs geobr / shapefiles which are not in the sandbox; the
exploration runs three coarser-but-feasible checks.
(a) LLM extraction validity. Each setor code's first 7 digits should equal the poll's muni IBGE7. 100 % of 107 sponsored and 100 % of 38 independent polls with extracted setor codes pass. The LLM is extracting setor codes correctly.
(b) n_setores_total contrast (sponsored − independent). On the
25 pairs where both polls actually use setor-based sampling
(n_setores_total > 0 on both sides):
| Statistic | Value |
|---|---|
| Median sponsored n_setores | 50 |
| Median independent n_setores | 100 |
| Mean Δ (sp − ind) | −25.6 (sd 114) |
| Paired-t | −1.12 |
| p | 0.28 |
| Sign: positive of N | 8 / 25 (binom p = 0.11) |
Sponsored polls use about half as many setores as matched independent polls (median 50 vs 100). The mean Δ is not statistically significant on n=25, but the sign + magnitude lean toward sponsored polls using fewer setores. This reinforces the AN-032 finding that the bairro/setor mechanism is not a "oversample partisan strongholds" lever; sponsored polls simply cover fewer setores, and the partisan composition of those fewer setores is, if anything, less tilted toward the sponsor's strongholds (AN-032 headline).
(c) Distrito-set Jaccard similarity. First 9 digits of setor code = muni + distrito (within muni). On the 20 pairs with setor codes on both sides:
| Statistic | Value |
|---|---|
| Mean Jaccard | 0.675 |
| Median Jaccard | 0.667 |
| Pairs with Jaccard > 0.5 | 11 / 20 |
| Pairs with Jaccard = 0 | 1 / 20 |
Sponsored and independent polls in the same muni × candidate × ±14 days share two-thirds of their distritos on average — they sample broadly similar parts of the muni. The 1 zero-overlap pair is the only case of completely disjoint geographies. The pattern is consistent with bairro/setor selection being a methodology shared across pollsters in a given race, not a sponsor-strategic lever.
Combined read. The setor codes confirm the AN-032 main reading:
the bairro/setor lever doesn't carry the +7–8 pp sponsor effect.
Sponsored polls use fewer setores within broadly similar distritos,
and the partisan composition of those setores doesn't tilt toward
the sponsor's strongholds. The proper finer test (setor → seção via
IBGE polygons + spatial join) is blocked on local data (geobr /
geopandas / shapefiles not in sandbox); the exploratory result
is suggestive that the substantive answer wouldn't change.
Follow-ups
- Setor-code matching path via polls' own setor_codes_sample
(highest paper-value extension). The curated_pairs parquet has
s_bairro_detail__setor_codes_sampleandi_bairro_detail__setor_codes_samplecolumns for polls that listed IBGE setor codes directly. That subset bypasses the bairro-string match entirely; setor → seção via the official IBGE↔TSE crosswalk gives a direct match. Should give a small-but-clean test on big-muni polls. Cost: ~50 lines of crosswalk-lookup code added to AN-032's script. - Geographic geocoding for small munis (extension, larger lift). For polls without setor codes, geocode the LLM-extracted bairro strings to lat/lon (Brazilian postal/Google API). Match seções via lat/lon polygon containment. Bypasses the LV polling-station-bairro layer entirely. Requires API quota + per-muni shapefiles where available.
- Why "less stronghold concentration"? (puzzle). Two competing readings: (R1) sponsored polls actively defend methodology face-validity by including non-stronghold bairros while slanting elsewhere; (R2) independent media polls concentrate on partisan-active bairros (where the political story lives). Disambiguating requires a direct measurement of "concentration" — e.g., the variance of weighted_party_share across bairros within each poll, not just the level. Add this to AN-032's output table as a sensitivity.
- Substantive note for theory.md / paper.tex § Channel A.
The Channel A lever inventory should be revised:
- REAL Channel A levers: population frame, coverage_class (urban-only/deferral), methodology incompleteness (audit opacity), quota mix.
- NOT a Channel A lever: partisan bairro / stronghold selection (AN-032 finds the OPPOSITE direction). The bairro list flexes between sponsored and independent polls (count, coverage_class — finding-paired) but the partisan composition does not tilt toward the sponsor. Update theory.md § "Polls as Bayesian persuasion" and paper.tex Channel A description with this refinement. The negative sign is substantively informative — it constrains the Channel A story away from the most obvious "cherry-pick the strongholds" lever toward the more sophisticated "manipulate the frame and methodology" levers.