id: an-056
hypothesis: methodology-flexibility-a
headline: "Within-pair quota-distribution deltas (income / age / education) do not correlate with within-pair bias contrast. On 244 curated sponsored × independent pairs with cached poll_sampling extractions (338/338 protocols cached; quota distributions on 336): |Δ income (SM)| vs |contrast| r = +0.02 (p=0.75), |Δ age| r = +0.13 (p=0.10), |Δ pct-superior| r = −0.05 (p=0.45); signed Δ income vs signed contrast r = +0.06 (p=0.39). Across-stratum: |Δ income| ≈ 1.1-1.2 SM in all three strata (high_bias, well_behaved, understating). Only population_reference differs shows a directional pattern: 25.5 % of high_bias pairs vs 14.6 % of well_behaved (chi-square p = 0.12, not significant). Reading: at the cached-quota-distribution granularity, sponsored polls do not deviate from indep-media polls in income / age / education quotas in a way that correlates with the observed bias. Combined with AN-055 (coverage × candidate-base null), the structural Channel A search continues to come up empty on the coverage + quota axis. Live frontiers remain AN-051 (scenario rotation, positive) and the unextracted ponderação description."
type: descriptive
status: interpreted
status_date: 2026-06-14
confidence: yellow
created: 2026-06-14
script: source/analysis/an-056-quota-distance-by-bias.py
target: build/table/an-056-quota-distance-by-bias.csv
design:
sample: 244 curated sponsored × independent pairs from build/llm/curated_pairs/pairs.parquet (stratified: 110 high_bias, 89 well_behaved, 45 understating). 338/338 unique protocols have cached poll_sampling extractions (the cancellation of the universe-scale sampling batch left only the curated-pair pilot extractions on disk); 336/338 have quota_distributions, 334/338 have non-null population_reference. The pair-level usable n varies by variable due to bin-parsing coverage (income 220 pairs, age 164, education 235).
specification: "Per protocol, compute scalar quota summaries from the LLM-extracted bin_labels + bin_percentages: weighted-mean income (SM, midpoints with R$/SM auto-detect and 30-SM sanity cap), weighted-mean age (years), and percentage with any superior education. Per pair, compute within-pair deltas (sponsored − indep). Pearson correlations of |delta| vs |contrast| (manipulation-intensity test) and signed delta vs signed contrast (directional Channel A test). Population_reference differs as a binary cross-tabulated against stratum (chi-square)."
comparator: indep-media poll on same muni × same candidate × within 14 days (pair structure inherited from curated_pairs)
cluster: pair (correlations and chi-square treat pair as unit)
AN-056: Within-pair quota distribution distance × bias contrast
Question
The blinded LLM-judge pilot
(docs/briefs/blinded_channel_a_pilot.md)
flagged income-quota distributions, weighting, and coverage as the
dominant high-plausibility Channel-A mechanism domains, with 14 / 16
high-plausibility hypotheses agreeing with the actual sponsored side
(87.5 %, p ≈ 0.004). AN-055 ruled coverage × candidate-base out at
cheap-Tier-2 granularity. This analysis tests the structural
counterpart to the LLM-judge's income-quota theme: do sponsored
polls' quota distributions systematically differ from the matched
independent poll's in a way that correlates with the observed bias?
A positive result would identify quota choice as the Channel A lever (complementing AN-051's scenario-rotation finding). A null directs attention back toward the unextracted ponderação description or elsewhere.
Design
source/analysis/an-056-quota-distance-by-bias.py:
- Load
build/llm/curated_pairs/pairs.parquet(244 pairs, 338 protocols). - Load cached
poll_samplingLLM extractions frompipelines/politica/build/llm/poll_sampling/. All 338 pair protocols are cached (a fortunate side effect of the cancelled universe-scale batch: the curated-pair pilot ran first and survived). 336/338 havequota_distributions, 334/338 have non-nullpopulation_reference. - Per protocol, summarise each variable's
bin_labels+bin_percentagesto a scalar:mean_income_sm,mean_age,pct_superior. The income parser handles SM-only, R$-only (converts at the 2024 minimum wage, R$ 1,412), and mixed labels; midpoints > 30 SM are treated as parse errors (sanity guard). - Per pair, compute
delta_*= sponsored − indep,abs_delta_*= absolute value. - Tests:
- Intensity (manipulation magnitude): Pearson r of |delta| vs |contrast|.
- Direction (Channel A directional): Pearson r of delta vs contrast.
- Population frame mismatch: chi-square of
population_reference differs× stratum.
Results
Across strata — mean |delta| does not differentiate
| metric | high_bias (n) | well_behaved (n) | understating (n) | |---|---:|---:|---:| | |Δ income (SM)| | 1.12 (98) | 1.23 (80) | 1.21 (42) | | |Δ age (years)| | 3.42 (78) | 3.10 (58) | 3.47 (28) | | |Δ % superior ed| | 4.65 (106) | 4.18 (85) | 2.24 (44) | | pop_ref_differs | 25.5 % (110) | 14.6 % (89) | 15.6 % (45) |
Income, age, and education quota deviations are essentially
indistinguishable across the three pair strata. Only
pop_ref_differs shows a directional pattern: high-bias pairs are
~1.7× more likely to have a sponsored vs indep mismatch on the
declared population reference than well-behaved pairs. Chi-square
test of independence: χ² = 4.26 on 2 d.f., p = 0.12 — not
statistically conclusive at this sample size.
Correlations — quota intensity does not predict bias
| metric | r | 95 % CI | p | n | |---|---:|:---:|---:|---:| | |Δ income| vs |contrast| | +0.02 | [−0.11, +0.15] | 0.751 | 220 | | |Δ age| vs |contrast| | +0.13 | [−0.03, +0.28] | 0.103 | 164 | | |Δ %superior| vs |contrast| | −0.05 | [−0.18, +0.08] | 0.451 | 235 | | Δ income (signed) vs contrast (signed) | +0.06 | [−0.08, +0.19] | 0.394 | 220 | | Δ %superior (signed) vs contrast (signed) | −0.02 | [−0.15, +0.10] | 0.721 | 235 |
All correlations are within ±0.13 with confidence intervals straddling zero. The signed tests (direction-of-quota-shift × direction-of-bias) are essentially zero — sponsored polls do not systematically over-quota the demographic strata that would mechanically inflate the candidate.
Interpretation
What the null does and does not rule out
The 95 % CIs on the three |Δ| × |contrast| correlations (±0.13-0.16) rule out moderate-to-large positive associations (r > 0.25 effectively excluded) but admit small associations (|r| < 0.15). With 110 high-bias pairs the analysis has good but not great power. This is not a precise null.
What it does say: the cheap-Tier-2 / cached-quota-distribution route finds no evidence that sponsored polls' quota distributions in income, age, or education differ from the matched independent poll's in a way that explains the within-pair bias contrast.
Reconciling with the blinded LLM-judge brief
The LLM-judge pilot pointed strongly at cotas de renda,
ponderação por renda, and distribuição de renda as recurring
high-plausibility themes in the high-bias pairs. The structural test
on the SAME pairs does not corroborate at the quota-distribution level.
Three readings survive:
- The LLM was confabulating. The 87.5 % agreement with the actual sponsored side may have been driven by the LLM picking up unrelated features of the texts (firm boilerplate, phrasing patterns) and confidently reading them as quota / coverage stories. The structural null here is consistent with this skepticism.
- The mechanism is ponderação, not quotas. Sponsored and indep polls may use similar quota distributions but apply different weighting / post-stratification corrections (or describe them with different specificity). The PollSampling schema does not capture ponderação description; this remains the open extension.
- The mechanism is finer than scalar summaries capture. Mean income (in SM) collapses a 4-5-bin distribution to one number; differential over-quoting of specific income bands (say, the 1-2 SM bin) could move bias without changing the weighted mean. Future test: per-bin TVD between aligned distributions.
Mechanism inventory after AN-055 + AN-056
| Lever | Status | Evidence |
|---|---|---|
| Bairro partisan composition | Reversed sign | AN-032 |
| Coverage class (flat) | Underpowered + 0 | AN-019 |
| Coverage × candidate-base (cheap Tier 2) | Null | AN-055 |
| Coverage deferral | Wrong-signed | AN-024 |
| Audit pct | Heavy overlap | AN-021 |
| Methodology completeness | Wrong-signed | AN-022 |
| Interviewer training | Wrong-signed | AN-042 |
| Mode | Wrong-signed | AN-041 |
| Nonresponse handling | Null-by-data-design | AN-043 |
| Income quota distribution (scalar) | Null | AN-056 |
| Age / education quota distribution | Null | AN-056 |
| Population reference frame | Weakly directional (p=0.12) | AN-056 |
| Name / scenario rotation | Positive (sp under-doc 5×, p ≈ 4×10⁻⁸) | AN-051 |
| Ponderação description | Not yet extracted | — |
The pattern: every structural lever tested has come back null, wrong-signed, or — in AN-051's case — positive on a disclosure-quantity axis (under-documentation of scenario rotation) rather than a design-substantive axis.
Follow-ups
- Ponderação description extractor (next on the critical path).
The one remaining theme from the blinded LLM-judge brief that hasn't
been structurally tested. Schema fields to extract:
described(bool),variables_weighted(list),target_population_claimed(enum),post_stratification_explicit(bool — does the text actually state weights normalize back to population shares?),correction_method(free-text). Synchronous pilot on the same 338 pair protocols (~$2 with gpt-4o); analysis script mirrors AN-056's shape (within-pair "ponderação description differs" × contrast). - Per-bin TVD test (robustness — same pair set, no new LLM).
Re-do AN-056 at the bin level using a label-normalization + alignment
step (canonical income bands
<2 SM,2-5 SM,5+ SM). This catches "shift mass from low to mid" patterns that don't move the weighted mean but could be Channel A. - Population-reference-differs follow-through. The directional but not-significant pop_ref pattern (25.5 % vs 14.6 %) could be the tip of a real iceberg if universe-scale extraction is restored — the cancelled sampling batch would unlock 14k+ protocols' population_reference, pushing this test from chi-square-on-244 to a vastly better-powered regression. The cancelled-batch resubmission decision now has a substantive argument behind it: the cheap pilot finds a weak directional pop-ref signal worth confirming at scale.
- Re-read the LLM brief against the structural nulls. The 87.5 % agreement statistic deserves a more careful audit: are the high-plausibility hypotheses concentrated on features the structural extractors would capture (and don't), or on features outside the extractor scope? If the latter, the schema needs new fields; if the former, the LLM is confabulating and 87.5 % is overstating actual discrimination.