id: an-056 hypothesis: methodology-flexibility-a headline: "Within-pair quota-distribution deltas (income / age / education) do not correlate with within-pair bias contrast. On 244 curated sponsored × independent pairs with cached poll_sampling extractions (338/338 protocols cached; quota distributions on 336): |Δ income (SM)| vs |contrast| r = +0.02 (p=0.75), |Δ age| r = +0.13 (p=0.10), |Δ pct-superior| r = −0.05 (p=0.45); signed Δ income vs signed contrast r = +0.06 (p=0.39). Across-stratum: |Δ income| ≈ 1.1-1.2 SM in all three strata (high_bias, well_behaved, understating). Only population_reference differs shows a directional pattern: 25.5 % of high_bias pairs vs 14.6 % of well_behaved (chi-square p = 0.12, not significant). Reading: at the cached-quota-distribution granularity, sponsored polls do not deviate from indep-media polls in income / age / education quotas in a way that correlates with the observed bias. Combined with AN-055 (coverage × candidate-base null), the structural Channel A search continues to come up empty on the coverage + quota axis. Live frontiers remain AN-051 (scenario rotation, positive) and the unextracted ponderação description." type: descriptive status: interpreted status_date: 2026-06-14 confidence: yellow created: 2026-06-14 script: source/analysis/an-056-quota-distance-by-bias.py target: build/table/an-056-quota-distance-by-bias.csv design: sample: 244 curated sponsored × independent pairs from build/llm/curated_pairs/pairs.parquet (stratified: 110 high_bias, 89 well_behaved, 45 understating). 338/338 unique protocols have cached poll_sampling extractions (the cancellation of the universe-scale sampling batch left only the curated-pair pilot extractions on disk); 336/338 have quota_distributions, 334/338 have non-null population_reference. The pair-level usable n varies by variable due to bin-parsing coverage (income 220 pairs, age 164, education 235). specification: "Per protocol, compute scalar quota summaries from the LLM-extracted bin_labels + bin_percentages: weighted-mean income (SM, midpoints with R$/SM auto-detect and 30-SM sanity cap), weighted-mean age (years), and percentage with any superior education. Per pair, compute within-pair deltas (sponsored − indep). Pearson correlations of |delta| vs |contrast| (manipulation-intensity test) and signed delta vs signed contrast (directional Channel A test). Population_reference differs as a binary cross-tabulated against stratum (chi-square)." comparator: indep-media poll on same muni × same candidate × within 14 days (pair structure inherited from curated_pairs) cluster: pair (correlations and chi-square treat pair as unit)

AN-056: Within-pair quota distribution distance × bias contrast

Question

The blinded LLM-judge pilot (docs/briefs/blinded_channel_a_pilot.md) flagged income-quota distributions, weighting, and coverage as the dominant high-plausibility Channel-A mechanism domains, with 14 / 16 high-plausibility hypotheses agreeing with the actual sponsored side (87.5 %, p ≈ 0.004). AN-055 ruled coverage × candidate-base out at cheap-Tier-2 granularity. This analysis tests the structural counterpart to the LLM-judge's income-quota theme: do sponsored polls' quota distributions systematically differ from the matched independent poll's in a way that correlates with the observed bias?

A positive result would identify quota choice as the Channel A lever (complementing AN-051's scenario-rotation finding). A null directs attention back toward the unextracted ponderação description or elsewhere.

Design

source/analysis/an-056-quota-distance-by-bias.py:

  1. Load build/llm/curated_pairs/pairs.parquet (244 pairs, 338 protocols).
  2. Load cached poll_sampling LLM extractions from pipelines/politica/build/llm/poll_sampling/. All 338 pair protocols are cached (a fortunate side effect of the cancelled universe-scale batch: the curated-pair pilot ran first and survived). 336/338 have quota_distributions, 334/338 have non-null population_reference.
  3. Per protocol, summarise each variable's bin_labels+bin_percentages to a scalar: mean_income_sm, mean_age, pct_superior. The income parser handles SM-only, R$-only (converts at the 2024 minimum wage, R$ 1,412), and mixed labels; midpoints > 30 SM are treated as parse errors (sanity guard).
  4. Per pair, compute delta_* = sponsored − indep, abs_delta_* = absolute value.
  5. Tests:
    • Intensity (manipulation magnitude): Pearson r of |delta| vs |contrast|.
    • Direction (Channel A directional): Pearson r of delta vs contrast.
    • Population frame mismatch: chi-square of population_reference differs × stratum.

Results

Across strata — mean |delta| does not differentiate

| metric | high_bias (n) | well_behaved (n) | understating (n) | |---|---:|---:|---:| | |Δ income (SM)| | 1.12 (98) | 1.23 (80) | 1.21 (42) | | |Δ age (years)| | 3.42 (78) | 3.10 (58) | 3.47 (28) | | |Δ % superior ed| | 4.65 (106) | 4.18 (85) | 2.24 (44) | | pop_ref_differs | 25.5 % (110) | 14.6 % (89) | 15.6 % (45) |

Income, age, and education quota deviations are essentially indistinguishable across the three pair strata. Only pop_ref_differs shows a directional pattern: high-bias pairs are ~1.7× more likely to have a sponsored vs indep mismatch on the declared population reference than well-behaved pairs. Chi-square test of independence: χ² = 4.26 on 2 d.f., p = 0.12 — not statistically conclusive at this sample size.

Correlations — quota intensity does not predict bias

| metric | r | 95 % CI | p | n | |---|---:|:---:|---:|---:| | |Δ income| vs |contrast| | +0.02 | [−0.11, +0.15] | 0.751 | 220 | | |Δ age| vs |contrast| | +0.13 | [−0.03, +0.28] | 0.103 | 164 | | |Δ %superior| vs |contrast| | −0.05 | [−0.18, +0.08] | 0.451 | 235 | | Δ income (signed) vs contrast (signed) | +0.06 | [−0.08, +0.19] | 0.394 | 220 | | Δ %superior (signed) vs contrast (signed) | −0.02 | [−0.15, +0.10] | 0.721 | 235 |

All correlations are within ±0.13 with confidence intervals straddling zero. The signed tests (direction-of-quota-shift × direction-of-bias) are essentially zero — sponsored polls do not systematically over-quota the demographic strata that would mechanically inflate the candidate.

Interpretation

What the null does and does not rule out

The 95 % CIs on the three |Δ| × |contrast| correlations (±0.13-0.16) rule out moderate-to-large positive associations (r > 0.25 effectively excluded) but admit small associations (|r| < 0.15). With 110 high-bias pairs the analysis has good but not great power. This is not a precise null.

What it does say: the cheap-Tier-2 / cached-quota-distribution route finds no evidence that sponsored polls' quota distributions in income, age, or education differ from the matched independent poll's in a way that explains the within-pair bias contrast.

Reconciling with the blinded LLM-judge brief

The LLM-judge pilot pointed strongly at cotas de renda, ponderação por renda, and distribuição de renda as recurring high-plausibility themes in the high-bias pairs. The structural test on the SAME pairs does not corroborate at the quota-distribution level.

Three readings survive:

  1. The LLM was confabulating. The 87.5 % agreement with the actual sponsored side may have been driven by the LLM picking up unrelated features of the texts (firm boilerplate, phrasing patterns) and confidently reading them as quota / coverage stories. The structural null here is consistent with this skepticism.
  2. The mechanism is ponderação, not quotas. Sponsored and indep polls may use similar quota distributions but apply different weighting / post-stratification corrections (or describe them with different specificity). The PollSampling schema does not capture ponderação description; this remains the open extension.
  3. The mechanism is finer than scalar summaries capture. Mean income (in SM) collapses a 4-5-bin distribution to one number; differential over-quoting of specific income bands (say, the 1-2 SM bin) could move bias without changing the weighted mean. Future test: per-bin TVD between aligned distributions.

Mechanism inventory after AN-055 + AN-056

Lever Status Evidence
Bairro partisan composition Reversed sign AN-032
Coverage class (flat) Underpowered + 0 AN-019
Coverage × candidate-base (cheap Tier 2) Null AN-055
Coverage deferral Wrong-signed AN-024
Audit pct Heavy overlap AN-021
Methodology completeness Wrong-signed AN-022
Interviewer training Wrong-signed AN-042
Mode Wrong-signed AN-041
Nonresponse handling Null-by-data-design AN-043
Income quota distribution (scalar) Null AN-056
Age / education quota distribution Null AN-056
Population reference frame Weakly directional (p=0.12) AN-056
Name / scenario rotation Positive (sp under-doc 5×, p ≈ 4×10⁻⁸) AN-051
Ponderação description Not yet extracted

The pattern: every structural lever tested has come back null, wrong-signed, or — in AN-051's case — positive on a disclosure-quantity axis (under-documentation of scenario rotation) rather than a design-substantive axis.

Follow-ups

  1. Ponderação description extractor (next on the critical path). The one remaining theme from the blinded LLM-judge brief that hasn't been structurally tested. Schema fields to extract: described (bool), variables_weighted (list), target_population_claimed (enum), post_stratification_explicit (bool — does the text actually state weights normalize back to population shares?), correction_method (free-text). Synchronous pilot on the same 338 pair protocols (~$2 with gpt-4o); analysis script mirrors AN-056's shape (within-pair "ponderação description differs" × contrast).
  2. Per-bin TVD test (robustness — same pair set, no new LLM). Re-do AN-056 at the bin level using a label-normalization + alignment step (canonical income bands <2 SM, 2-5 SM, 5+ SM). This catches "shift mass from low to mid" patterns that don't move the weighted mean but could be Channel A.
  3. Population-reference-differs follow-through. The directional but not-significant pop_ref pattern (25.5 % vs 14.6 %) could be the tip of a real iceberg if universe-scale extraction is restored — the cancelled sampling batch would unlock 14k+ protocols' population_reference, pushing this test from chi-square-on-244 to a vastly better-powered regression. The cancelled-batch resubmission decision now has a substantive argument behind it: the cheap pilot finds a weak directional pop-ref signal worth confirming at scale.
  4. Re-read the LLM brief against the structural nulls. The 87.5 % agreement statistic deserves a more careful audit: are the high-plausibility hypotheses concentrated on features the structural extractors would capture (and don't), or on features outside the extractor scope? If the latter, the schema needs new fields; if the former, the LLM is confabulating and 87.5 % is overstating actual discrimination.