id: an-032 hypothesis: bayesian-cluster-selection headline: "SUPERSEDED by [[an-125]] (2026-06-23): the 'reverse-sign' headline below was a clustering artifact. The 22 muni-rich pairs (n_LV_bairros ≥ 50) come from only 2 unique sponsored polls (HELIO LOPES PSDB-Anápolis, ROGERIO CORREIA PT-BH) each re-paired against 11 independents — naive paired-t treats them as 22 independent draws when the effective n is 2. With cluster-bootstrap CI by sponsored_protocol, the result is null at every threshold where the bootstrap is meaningful (thr=0: 21 clusters, CI [−0.46, +0.34] pp; thr=11: 12 clusters, CI [−0.68, +0.49] pp; thr=20: 5 clusters, CI [−0.45, +0.90] pp). A sharper candidate-own variant (poll's bairros weighted by the sponsoring candidate's OWN 2020 mayoral vote share rather than the party's; n=34 pairs from 18 unique sponsored polls) is also null: Δ = +0.15 pp, cluster-BB CI [−0.49, +1.22] pp. Original (pre-clustering) headline preserved below for the audit trail. [Reverse-sign result on the cleanest subset. On the 22 muni-rich pairs (n_LV_bairros ≥ 50; mostly BH and other large state capitals), within-pair contrast (sponsored − independent oversample-index for the sponsor's party's 2020 strongholds) = −0.0029, t = −5.3, p < 0.001; sign test: only 2 of 22 positive (binomial p = 0.0001). The direction is opposite the Channel A 'Bayesian cluster selection' prediction: sponsored polls' bairro selection is less tilted toward the sponsor's party's prior strongholds than the matched independent poll's bairro selection, not more. On the wider 42-pair usable set the contrast is slightly negative and not significant (t = −0.85, p = 0.40) but the sign test still favors negatives (12/42 positive, binomial p = 0.008). Match rate is the binding constraint: median 0% per side because TSE local_votacao bairros are polling-station-location bairros (1-2 per small muni), not residential bairros (24+ per poll). Muni-rich restriction (≥50 LV bairros) keeps the matchable subset.]" type: descriptive question: "Within a sponsored × independent pair on the same muni × same candidate × ±14 days, does the sponsored poll's bairro selection over-represent the sponsor's party's prior-cycle (2020) strongholds relative to the matched independent poll? A positive paired contrast would be direct Channel A 'Bayesian cluster selection' evidence." tags: ["hyp:bayesian-cluster-selection", channel-a, mechanism, secao-oversampling, paired, an-019-followup, curated-pairs, null-result, superseded] status: interpreted-superseded status_date: 2026-06-23 superseded_by: an-125 confidence: yellow created: 2026-06-02 script: source/analysis/an-032-secao-oversampling-paired.py target: build/table/secao_oversampling_paired.csv design: sample: 224 pairs (of 244) in build/llm/curated_pairs/pairs_with_extractions.parquet with bairro extractions on both sides AND a sponsoring-candidate party assigned. Within-pair design: same muni × same candidate × ±14 days. Prior-cycle baseline: 2020 prefeito seção-level vote shares from pipelines/politica/build/clean/votacao_secao_2020.parquet. Seção-to-bairro mapping from eleitorado_local_votacao.csv. specification: "For each pair, compute oversample_index = weighted_party_share − muni_party_share for sponsored and independent polls separately (party = sponsor's party). Within-pair contrast = oversample_sp − oversample_ind. Paired-t test against zero. Sample-size weighting: per-bairro n_entrevistas when LLM extracted it, else uniform-by-elector-count within matched seções." comparator: independent poll on same muni × same candidate × within 14 days cluster: pair (paired-t handles by construction) weights: per-bairro n_entrevistas where extracted; uniform-elector otherwise

AN-032: Paired sponsored × independent oversample-index test

Question

Channel A's strongest specific prediction is Bayesian cluster selection: sponsors choose pollsters who choose poll designs that oversample bairros where the sponsor's party performed well in the prior cycle. If the prediction is right, comparing a sponsored poll to an independent poll of the same candidate in the same race within a two-week window should show the sponsored poll's bairros tilting toward the sponsor's prior strongholds, while the independent poll's bairros should not.

The _secao_oversampling_scope.py pilot (2026-06-02) showed (a) the bairro-to-TSE-local_votacao matching machinery works, (b) within-muni seção-level party-share variance is substantial (sd ≈ 0.075), and (c) the oversample_index placebo on three well-matched independent polls is ≈ 0 across the board (max |index| = 0.003). Machinery validated.

The curated_pairs dataset (build/llm/curated_pairs/pairs_with_extractions.parquet) already has 244 sponsored × independent pairs with bairro extractions on both sides. 224 have a sponsoring-candidate party recorded. This is the natural identifying sample for the test — much stronger than the 1 sponsored poll in the methodology pilot.

Design

source/analysis/an-032-secao-oversampling-paired.py:

  1. Load the 224 well-extracted pairs from build/llm/curated_pairs/pairs_with_extractions.parquet.
  2. For each pair, parse the sponsored and independent bairro JSON lists. Each bairro can include n_entrevistas (LLM-extracted) — used as the bairro weight in the weighted-share calculation when present.
  3. Match poll bairros to TSE eleitorado_local_votacao.csv bairros within the pair's muni_id (normalized string match, exact then substring fallback).
  4. For each matched bairro, look up the bairro's seções; for each seção, get prior-cycle (2020) prefeito party-share from votacao_secao_2020.parquet.
  5. Compute:
    • weighted_party_share(poll, sponsor_party) = sum over matched bairros of (bairro_weight × bairro's seção-weighted sponsor-party share)
    • muni_party_share(muni, sponsor_party) = sponsor party's muni-wide weighted 2020 share
    • oversample_index(poll) = weighted_party_share − muni_party_share
  6. Per pair: contrast = oversample_sp − oversample_ind.
  7. Paired-t test on contrast against zero. Sign test for robustness.

Results

Paired within-pair oversample-index contrast

Sample and match rates

Value
Pairs in curated_pairs (with both-side bairros + sponsor party) 224
Pairs with usable oversample_index on BOTH sides 42 (18.8 %)
Median bairro-match share, sponsored side 0 %
Median bairro-match share, independent side 0 %
Most pairs cluster in two munis BH (12 pairs), Anápolis (10 pairs)

The match rate is the binding constraint. LLM-extracted bairro strings (e.g., "Itaguai I, II, III", "Centro Sul", "Aarão Reis") often don't match TSE eleitorado_local_votacao's coarser labels (e.g., "Itaguai", "Centro"). Exact normalized matching gave 28 usable pairs; substring fuzzy matching lifted to 42. The structural mismatch — small-muni LV coverage has 1-5 bairros while polls list 10-50 granular ones — remains.

Headline paired test

On the 42 usable pairs (any LV richness):

Statistic Value
Mean sponsored oversample_index −0.0019
Mean independent oversample_index −0.0007
Mean contrast (sp − ind) −0.0012
SD contrast 0.0091
Paired-t −0.85
Two-sided p 0.40
Sign test: positive pairs 12 / 42
Binomial p vs 50/50 0.008

The mean contrast is small but the sign test rejects 50/50 in the negative direction. Restricting to muni-rich munis (where LV bairro coverage is rich enough to make the match informative) sharpens the result dramatically:

Muni-richness sensitivity (the cleanest read)

Subset (n_LV_bairros ≥) n pairs mean contrast paired-t p sign (pos / N) binom p
0 (all usable) 42 −0.0012 −0.85 0.40 12 / 42 0.008
11 32 −0.0019 −1.03 0.31 8 / 32 0.007
20 25 −0.0015 −1.64 0.11 5 / 25 0.004
≥ 50 (large munis with rich LV) 22 −0.0029 −5.28 < 0.001 2 / 22 0.0001

The signal sharpens as we restrict to munis where the bairro-string match is informative. On the cleanest 22-pair subset (large munis, mostly state capitals with rich LV coverage), the result is a precisely-estimated negative contrast with t = −5.3 — sponsored polls' bairros are systematically LESS tilted toward the sponsor's party's prior strongholds than the matched independent poll's bairros, in 20 of 22 pairs.

The interpretation is no longer just "null"; it's reversed-sign.

Interpretation

The structural unit-mismatch problem

The match-rate constraint is structural, not a string-matching bug. TSE eleitorado_local_votacao "bairro" = the bairro where a polling station SITS (a school, public building). Poll PDF "bairro" = the RESIDENTIAL bairro where interviewers worked. In big munis these align — most bairros have their own polling stations. In small munis a single school serves the whole town, so LV has 1-2 bairros while polls list 24+ residential ones.

The diagnostic:

Muni size (electors) # munis Median LV bairros Typical poll bairros
< 10k 2,924 2 ~ 24
10-50k 2,149 8 ~ 24
50-200k 393 24 ~ 24
200-500k 54 48 ~ 24
> 500k 49 80 ~ 24

Below ~50k voters the bairro-string match cannot work without a finer geographic unit — the LV granularity isn't there. Hence the muni-rich restriction; the ≥50-LV subset is what the test can sharply identify.

Why the sign is negative

Two readings of the −0.003 contrast survive the data:

(R1) Bairro selection is a credibility signal, not a slant lever. Sponsored polls actively avoid stronghold concentration. If a sponsor pays for a poll showing their candidate strong, the poll's methodology must look defensible to the consumers (media, opponents, TSE itself). Concentrating bairros on the sponsor's prior strongholds would be cherry-picking and obviously visible in the registered methodology. Instead, the pollster includes a broader geographic spread — neutral / mixed / opposing-stronghold bairros — and slants through less visible methodology channels: population frame (mixed vs TSE-eligible; finding-paired, AN-019), quota variable choice (AN-022), audit-opacity (AN-021), coverage_class shift (AN-024). The bairro list is visible in the PDF; the population frame and audit rates are buried in the narrative text.

(R2) Independent media polls concentrate on electoral "hotspots" — bairros where the leading candidate's party is strongest, where the political story is happening. Sponsored polls of the same candidate cover more representatively because they need to defend a non-narrative methodology to TSE. This is the same explanation viewed from the independent-poll side: media polls oversample strongholds because that's where the news is.

Both readings predict the same observed pattern (sp < ind on sponsor-party oversample-index) and the data cannot fully distinguish them. The substantive consequence is the same either way: bairro selection does not carry the +7-8 pp sponsor effect.

Refined Channel A story

Combined with AN-019 (coverage_class × sponsor), AN-021 (audit-pct × sponsor), AN-022 (methodology-completeness × sponsor), AN-024 (deferral × sponsor), AN-032 sharpens the Channel A mechanism characterization:

Lever Direction Evidence
Population reference (mixed vs TSE-eligible) sp > ind finding-paired, AN-019
Census-setor as cluster frame sp > ind finding-paired
Coverage class shift (urban-only, deferred) sp > ind AN-019, AN-024
Methodology completeness (audit, training) sp < ind AN-021, AN-022
Quota variable mix mixed AN-022
Partisan bairro/stronghold selection sp < ind AN-032

The Channel A levers that DO carry sponsor slant are population frame and operational opacity (low audit, deferred coverage, less-complete methodology). The bairro list does flex (sponsored polls list fewer or different bairros, per finding-paired), but the partisan composition of the bairro list does not — sponsored polls' bairros are more geographically representative of the muni's full partisan map than independent polls'.

Caveats explicit

Setor-code addendum (2026-06-02)

source/analysis/_setor_oversampling_exploration.py probes what the LLM-extracted setor codes (s/i_bairro_detail__setor_codes_sample) can tell us without an external IBGE setor-polygon shapefile. The proper setor → seção test needs geobr / shapefiles which are not in the sandbox; the exploration runs three coarser-but-feasible checks.

(a) LLM extraction validity. Each setor code's first 7 digits should equal the poll's muni IBGE7. 100 % of 107 sponsored and 100 % of 38 independent polls with extracted setor codes pass. The LLM is extracting setor codes correctly.

(b) n_setores_total contrast (sponsored − independent). On the 25 pairs where both polls actually use setor-based sampling (n_setores_total > 0 on both sides):

Statistic Value
Median sponsored n_setores 50
Median independent n_setores 100
Mean Δ (sp − ind) −25.6 (sd 114)
Paired-t −1.12
p 0.28
Sign: positive of N 8 / 25 (binom p = 0.11)

Sponsored polls use about half as many setores as matched independent polls (median 50 vs 100). The mean Δ is not statistically significant on n=25, but the sign + magnitude lean toward sponsored polls using fewer setores. This reinforces the AN-032 finding that the bairro/setor mechanism is not a "oversample partisan strongholds" lever; sponsored polls simply cover fewer setores, and the partisan composition of those fewer setores is, if anything, less tilted toward the sponsor's strongholds (AN-032 headline).

(c) Distrito-set Jaccard similarity. First 9 digits of setor code = muni + distrito (within muni). On the 20 pairs with setor codes on both sides:

Statistic Value
Mean Jaccard 0.675
Median Jaccard 0.667
Pairs with Jaccard > 0.5 11 / 20
Pairs with Jaccard = 0 1 / 20

Sponsored and independent polls in the same muni × candidate × ±14 days share two-thirds of their distritos on average — they sample broadly similar parts of the muni. The 1 zero-overlap pair is the only case of completely disjoint geographies. The pattern is consistent with bairro/setor selection being a methodology shared across pollsters in a given race, not a sponsor-strategic lever.

Combined read. The setor codes confirm the AN-032 main reading: the bairro/setor lever doesn't carry the +7–8 pp sponsor effect. Sponsored polls use fewer setores within broadly similar distritos, and the partisan composition of those setores doesn't tilt toward the sponsor's strongholds. The proper finer test (setor → seção via IBGE polygons + spatial join) is blocked on local data (geobr / geopandas / shapefiles not in sandbox); the exploratory result is suggestive that the substantive answer wouldn't change.

Follow-ups

  1. Setor-code matching path via polls' own setor_codes_sample (highest paper-value extension). The curated_pairs parquet has s_bairro_detail__setor_codes_sample and i_bairro_detail__setor_codes_sample columns for polls that listed IBGE setor codes directly. That subset bypasses the bairro-string match entirely; setor → seção via the official IBGE↔TSE crosswalk gives a direct match. Should give a small-but-clean test on big-muni polls. Cost: ~50 lines of crosswalk-lookup code added to AN-032's script.
  2. Geographic geocoding for small munis (extension, larger lift). For polls without setor codes, geocode the LLM-extracted bairro strings to lat/lon (Brazilian postal/Google API). Match seções via lat/lon polygon containment. Bypasses the LV polling-station-bairro layer entirely. Requires API quota + per-muni shapefiles where available.
  3. Why "less stronghold concentration"? (puzzle). Two competing readings: (R1) sponsored polls actively defend methodology face-validity by including non-stronghold bairros while slanting elsewhere; (R2) independent media polls concentrate on partisan-active bairros (where the political story lives). Disambiguating requires a direct measurement of "concentration" — e.g., the variance of weighted_party_share across bairros within each poll, not just the level. Add this to AN-032's output table as a sensitivity.
  4. Substantive note for theory.md / paper.tex § Channel A. The Channel A lever inventory should be revised:
    • REAL Channel A levers: population frame, coverage_class (urban-only/deferral), methodology incompleteness (audit opacity), quota mix.
    • NOT a Channel A lever: partisan bairro / stronghold selection (AN-032 finds the OPPOSITE direction). The bairro list flexes between sponsored and independent polls (count, coverage_class — finding-paired) but the partisan composition does not tilt toward the sponsor. Update theory.md § "Polls as Bayesian persuasion" and paper.tex Channel A description with this refinement. The negative sign is substantively informative — it constrains the Channel A story away from the most obvious "cherry-pick the strongholds" lever toward the more sophisticated "manipulate the frame and methodology" levers.