id: an-056 hypothesis: methodology-flexibility-a headline: "Within-pair quota-distribution deltas (income / age / education) do not correlate with within-pair bias contrast. On 244 curated sponsored × independent pairs with cached `poll_sampling` extractions (338/338 protocols cached; quota distributions on 336): |Δ income (SM)| vs |contrast| r = +0.02 (p=0.75), |Δ age| r = +0.13 (p=0.10), |Δ pct-superior| r = −0.05 (p=0.45); signed Δ income vs signed contrast r = +0.06 (p=0.39). Across-stratum: |Δ income| ≈ 1.1-1.2 SM in all three strata (high_bias, well_behaved, understating). Only `population_reference differs` shows a directional pattern: 25.5 % of high_bias pairs vs 14.6 % of well_behaved (chi-square p = 0.12, not significant). Reading: at the cached-quota-distribution granularity, sponsored polls do not deviate from indep-media polls in income / age / education quotas in a way that correlates with the observed bias. Combined with AN-055 (coverage × candidate-base null), the structural Channel A search continues to come up empty on the coverage + quota axis. Live frontiers remain AN-051 (scenario rotation, positive) and the unextracted ponderação description." type: descriptive status: interpreted status_date: 2026-06-14 confidence: yellow created: 2026-06-14 script: source/analysis/an-056-quota-distance-by-bias.py target: build/table/an-056-quota-distance-by-bias.csv design: sample: 244 curated sponsored × independent pairs from build/llm/curated_pairs/pairs.parquet (stratified: 110 high_bias, 89 well_behaved, 45 understating). 338/338 unique protocols have cached poll_sampling extractions (the cancellation of the universe-scale sampling batch left only the curated-pair pilot extractions on disk); 336/338 have quota_distributions, 334/338 have non-null population_reference. The pair-level usable n varies by variable due to bin-parsing coverage (income 220 pairs, age 164, education 235). specification: "Per protocol, compute scalar quota summaries from the LLM-extracted bin_labels + bin_percentages: weighted-mean income (SM, midpoints with R$/SM auto-detect and 30-SM sanity cap), weighted-mean age (years), and percentage with any superior education. Per pair, compute within-pair deltas (sponsored − indep). Pearson correlations of |delta| vs |contrast| (manipulation-intensity test) and signed delta vs signed contrast (directional Channel A test). Population_reference differs as a binary cross-tabulated against stratum (chi-square)." comparator: indep-media poll on same muni × same candidate × within 14 days (pair structure inherited from curated_pairs) cluster: pair (correlations and chi-square treat pair as unit)

AN-056: Within-pair quota distribution distance × bias contrast

Question

The blinded LLM-judge pilot (docs/briefs/blinded_channel_a_pilot.md) flagged income-quota distributions, weighting, and coverage as the dominant high-plausibility Channel-A mechanism domains, with 14 / 16 high-plausibility hypotheses agreeing with the actual sponsored side (87.5 %, p ≈ 0.004). AN-055 ruled coverage × candidate-base out at cheap-Tier-2 granularity. This analysis tests the structural counterpart to the LLM-judge's income-quota theme: do sponsored polls' quota distributions systematically differ from the matched independent poll's in a way that correlates with the observed bias?

A positive result would identify quota choice as the Channel A lever (complementing AN-051's scenario-rotation finding). A null directs attention back toward the unextracted ponderação description or elsewhere.

Design

source/analysis/an-056-quota-distance-by-bias.py:

Load build/llm/curated_pairs/pairs.parquet (244 pairs, 338 protocols).
Load cached poll_sampling LLM extractions from pipelines/politica/build/llm/poll_sampling/. All 338 pair protocols are cached (a fortunate side effect of the cancelled universe-scale batch: the curated-pair pilot ran first and survived). 336/338 have quota_distributions, 334/338 have non-null population_reference.
Per protocol, summarise each variable's bin_labels+bin_percentages to a scalar: mean_income_sm, mean_age, pct_superior. The income parser handles SM-only, R$-only (converts at the 2024 minimum wage, R$ 1,412), and mixed labels; midpoints > 30 SM are treated as parse errors (sanity guard).
Per pair, compute delta_* = sponsored − indep, abs_delta_* = absolute value.
Tests:
- Intensity (manipulation magnitude): Pearson r of |delta| vs |contrast|.
- Direction (Channel A directional): Pearson r of delta vs contrast.
- Population frame mismatch: chi-square of population_reference differs × stratum.

Results

Across strata — mean |delta| does not differentiate

| metric | high_bias (n) | well_behaved (n) | understating (n) | |---|---:|---:|---:| | |Δ income (SM)| | 1.12 (98) | 1.23 (80) | 1.21 (42) | | |Δ age (years)| | 3.42 (78) | 3.10 (58) | 3.47 (28) | | |Δ % superior ed| | 4.65 (106) | 4.18 (85) | 2.24 (44) | | pop_ref_differs | 25.5 % (110) | 14.6 % (89) | 15.6 % (45) |

Income, age, and education quota deviations are essentially indistinguishable across the three pair strata. Only pop_ref_differs shows a directional pattern: high-bias pairs are ~1.7× more likely to have a sponsored vs indep mismatch on the declared population reference than well-behaved pairs. Chi-square test of independence: χ² = 4.26 on 2 d.f., p = 0.12 — not statistically conclusive at this sample size.

Correlations — quota intensity does not predict bias

| metric | r | 95 % CI | p | n | |---|---:|:---:|---:|---:| | |Δ income| vs |contrast| | +0.02 | [−0.11, +0.15] | 0.751 | 220 | | |Δ age| vs |contrast| | +0.13 | [−0.03, +0.28] | 0.103 | 164 | | |Δ %superior| vs |contrast| | −0.05 | [−0.18, +0.08] | 0.451 | 235 | | Δ income (signed) vs contrast (signed) | +0.06 | [−0.08, +0.19] | 0.394 | 220 | | Δ %superior (signed) vs contrast (signed) | −0.02 | [−0.15, +0.10] | 0.721 | 235 |

All correlations are within ±0.13 with confidence intervals straddling zero. The signed tests (direction-of-quota-shift × direction-of-bias) are essentially zero — sponsored polls do not systematically over-quota the demographic strata that would mechanically inflate the candidate.

Interpretation

What the null does and does not rule out

The 95 % CIs on the three |Δ| × |contrast| correlations (±0.13-0.16) rule out moderate-to-large positive associations (r > 0.25 effectively excluded) but admit small associations (|r| < 0.15). With 110 high-bias pairs the analysis has good but not great power. This is not a precise null.

What it does say: the cheap-Tier-2 / cached-quota-distribution route finds no evidence that sponsored polls' quota distributions in income, age, or education differ from the matched independent poll's in a way that explains the within-pair bias contrast.

Reconciling with the blinded LLM-judge brief

The LLM-judge pilot pointed strongly at cotas de renda, ponderação por renda, and distribuição de renda as recurring high-plausibility themes in the high-bias pairs. The structural test on the SAME pairs does not corroborate at the quota-distribution level.

Three readings survive:

The LLM was confabulating. The 87.5 % agreement with the actual sponsored side may have been driven by the LLM picking up unrelated features of the texts (firm boilerplate, phrasing patterns) and confidently reading them as quota / coverage stories. The structural null here is consistent with this skepticism.
The mechanism is ponderação, not quotas. Sponsored and indep polls may use similar quota distributions but apply different weighting / post-stratification corrections (or describe them with different specificity). The PollSampling schema does not capture ponderação description; this remains the open extension.
The mechanism is finer than scalar summaries capture. Mean income (in SM) collapses a 4-5-bin distribution to one number; differential over-quoting of specific income bands (say, the 1-2 SM bin) could move bias without changing the weighted mean. Future test: per-bin TVD between aligned distributions.

Mechanism inventory after AN-055 + AN-056

Lever	Status	Evidence
Bairro partisan composition	Reversed sign	AN-032
Coverage class (flat)	Underpowered + 0	AN-019
Coverage × candidate-base (cheap Tier 2)	Null	AN-055
Coverage deferral	Wrong-signed	AN-024
Audit pct	Heavy overlap	AN-021
Methodology completeness	Wrong-signed	AN-022
Interviewer training	Wrong-signed	AN-042
Mode	Wrong-signed	AN-041
Nonresponse handling	Null-by-data-design	AN-043
Income quota distribution (scalar)	Null	AN-056
Age / education quota distribution	Null	AN-056
Population reference frame	Weakly directional (p=0.12)	AN-056
Name / scenario rotation	Positive (sp under-doc 5×, p ≈ 4×10⁻⁸)	AN-051
Ponderação description	Not yet extracted	—

The pattern: every structural lever tested has come back null, wrong-signed, or — in AN-051's case — positive on a disclosure-quantity axis (under-documentation of scenario rotation) rather than a design-substantive axis.

Follow-ups

Ponderação description extractor (next on the critical path). The one remaining theme from the blinded LLM-judge brief that hasn't been structurally tested. Schema fields to extract: described (bool), variables_weighted (list), target_population_claimed (enum), post_stratification_explicit (bool — does the text actually state weights normalize back to population shares?), correction_method (free-text). Synchronous pilot on the same 338 pair protocols (~$2 with gpt-4o); analysis script mirrors AN-056's shape (within-pair "ponderação description differs" × contrast).
Per-bin TVD test (robustness — same pair set, no new LLM). Re-do AN-056 at the bin level using a label-normalization + alignment step (canonical income bands <2 SM, 2-5 SM, 5+ SM). This catches "shift mass from low to mid" patterns that don't move the weighted mean but could be Channel A.
Population-reference-differs follow-through. The directional but not-significant pop_ref pattern (25.5 % vs 14.6 %) could be the tip of a real iceberg if universe-scale extraction is restored — the cancelled sampling batch would unlock 14k+ protocols' population_reference, pushing this test from chi-square-on-244 to a vastly better-powered regression. The cancelled-batch resubmission decision now has a substantive argument behind it: the cheap pilot finds a weak directional pop-ref signal worth confirming at scale.
Re-read the LLM brief against the structural nulls. The 87.5 % agreement statistic deserves a more careful audit: are the high-plausibility hypotheses concentrated on features the structural extractors would capture (and don't), or on features outside the extractor scope? If the latter, the schema needs new fields; if the former, the LLM is confabulating and 87.5 % is overstating actual discrimination.