id: an-057 hypothesis: methodology-flexibility-a headline: "Within-pair ponderação description does differ systematically between sponsored and indep polls — but in the AN-042 selective-disclosure direction, not the simple Channel A 'sponsors hide methodology' direction. On 244 curated sponsored × independent pairs (338/338 protocols extracted via the new poll_weighting LLM extractor at gpt-4o-mini): (i) the binary flag described_differs is 27.3 % in high_bias vs 16.9 % in well_behaved vs 11.1 % in understating (chi-square p = 0.044, monotone), (ii) postst_differs (post_stratification_explicit binary differs between sides) correlates with |contrast| at r = +0.13 (p = 0.042), (iii) directional asymmetry: when post_stratification_explicit differs between sides, the indep side is MORE LIKELY to omit it (68 / 114 = 59.6 %, binomial p = 0.049). Reading: sponsored polls describe weighting AND post-stratification with greater specificity than the matched indep comparator. This matches AN-042's reframing — sponsored polls under-document sample-shape but over-document visible-rigor dimensions. The pattern is real but does not directly explain the +7 pp bias; the simplest reconciliation is that the claimed corrections do not actually neutralise the quota choices (claimed-but-ineffective ponderação), but identifying that would require a downstream effectiveness test we don't yet have." type: descriptive status: interpreted status_date: 2026-06-14 confidence: yellow created: 2026-06-14 script: source/analysis/an-057-weighting-by-bias.py target: build/table/an-057-weighting-by-bias.csv design: sample: 244 curated sponsored × independent pairs (110 high_bias, 89 well_behaved, 45 understating) from build/llm/curated_pairs/pairs.parquet. Backing extraction: pipelines/politica/source/llm/poll_weighting.py (new, schema_name="poll_weighting", schema_version="v1") ran synchronously on all 338 pair protocols 2026-06-14, gpt-4o-mini, 101s wall, 337 fresh calls + 1 cached. Per-protocol marginals: weighting described in 290/338 (85.8 %); application conditional in 231 (68.3 %); target tse_eleitorado 111 / ibge_2022 82 / ibge_2010 70 / not_specified 63; post_stratification_explicit 172/338 (50.9 %). specification: "Per pair, compute six within-pair *_differs binary indicators: described_differs, application_differs, target_differs, postst_differs, vars_set_differs, plus directional sp_uniquely_no_postst and ip_uniquely_no_postst. Tests: (a) cross-stratum chi-square of each binary × stratum, (b) Pearson r of each binary vs |contrast|, (c) two-sided exact binomial test of directional asymmetries (sp_uniquely_no_X vs ip_uniquely_no_X)." comparator: indep-media poll on same muni × same candidate × within 14 days cluster: pair

AN-057: Within-pair ponderação description × bias contrast

Question

After AN-055 (coverage × candidate-base, null) and AN-056 (quota-distribution deltas, null), the only remaining theme the blinded LLM-judge brief identified is the weighting / ponderação description. PollSampling captures quota design but not whether or how a post-fielding correction normalises the realised sample back to a population reference. This analysis fills that gap with a new extractor and tests within-pair ponderação description differences against the observed bias contrast.

The expectation under the simple "sponsors hide methodology" Channel A story: sponsored polls should describe weighting LESS often and less explicitly than the matched indep poll, with the disparity larger in high-bias pairs. AN-042 already complicated this by finding the opposite pattern on interviewer-training and supervisor-role descriptions. AN-057 tests whether weighting follows the AN-042 "selective disclosure" pattern or breaks it.

Design

source/analysis/an-057-weighting-by-bias.py:

  1. Load build/llm/curated_pairs/pairs.parquet (244 pairs, 338 protocols).
  2. Load pipelines/politica/build/llm/poll_weighting/*.json — 338 extractions cached from the synchronous pilot (source/llm/curated_pairs_weighting.py, gpt-4o-mini, 101s wall).
  3. Per pair, compute six within-pair *_differs binaries: described_differs, application_differs, target_differs, postst_differs (post_stratification_explicit), vars_set_differs (Jaccard ≠ 1 on variables_weighted set), and directional flags sp_uniquely_no_postst / ip_uniquely_no_postst.
  4. Tests:
    • Chi-square of each binary × stratum (high_bias / well_behaved / understating)
    • Pearson r of each binary vs |contrast|
    • Two-sided binomial test of sp_uniquely_no_postst vs ip_uniquely_no_postst (directional asymmetry).

Results

Within-pair 'differs' rates by stratum

metric high_bias (n=110) well_behaved (n=89) understating (n=45) chi-square p
described_differs 27.3 % 16.9 % 11.1 % 0.044
application_differs 43.6 % 37.1 % 28.9 % 0.217
target_differs 69.1 % 76.4 % 68.9 % 0.469
postst_differs 50.9 % 42.7 % 44.4 % 0.485
vars_set_differs 74.5 % 74.2 % 73.3 % 0.988
sp_uniquely_no_postst 17.3 % 24.7 % 11.1 %
ip_uniquely_no_postst 33.6 % 18.0 % 33.3 %
sp_uniquely_no_described 10.0 % 9.0 % 2.2 %

described_differs shows the cleanest monotone gradient — high-bias pairs are 2.5× more likely than understating pairs to have one side describe weighting while the other does not.

Correlation with |contrast|

metric r 95 % CI p
postst_differs +0.130 [+0.00, +0.25] 0.042
application_differs +0.109 [−0.02, +0.23] 0.090
described_differs +0.098 [−0.03, +0.22] 0.126
vars_set_differs +0.064 [−0.06, +0.19] 0.320
target_differs +0.019 [−0.11, +0.14] 0.773
sp_uniquely_no_postst −0.024 [−0.15, +0.10] 0.704
sp_uniquely_no_described +0.020 [−0.11, +0.15] 0.756

postst_differs carries a small but significant correlation: pairs where the two polls disagree about whether post-stratification is explicit are pairs with larger absolute bias.

Directional asymmetry — opposite of the naïve prediction

sponsored uniquely missing indep uniquely missing total asymmetric binomial p (50/50)
post_stratification_explicit 46 68 114 0.049
described 20 30 50 0.203

When the two sides disagree, the indep side is the one more likely to omit explicit post-stratification language. Sponsored polls describe post-stratification MORE specifically than the comparator. The 46-vs-68 split is statistically distinguishable from 50/50 at p = 0.05.

Interpretation

What this confirms

Three signals reach p < 0.05:

  1. described_differs × stratum: monotone gradient (high_bias > well_behaved > understating).
  2. postst_differs × |contrast|: positive correlation.
  3. Directional asymmetry: sponsored polls describe post-stratification more often than indep polls.

These are real patterns. The first two together imply that something about the ponderação description scales with the within-pair bias — pairs where the bias is large are more likely to have one side describing weighting and one side not, or one explicit about post-stratification and the other not.

What's surprising — the direction

The directional pattern is opposite the naïve "sponsors hide methodology" prediction. Sponsored polls over-document ponderação and post-stratification, just as AN-042 found for interviewer training and supervisor role. The selective-disclosure reframing now extends to weighting language.

Two readings of the direction:

(R1) Claimed-but-ineffective post-stratification. Sponsored polls describe rigorous post-stratification in the registration text, but the actual correction does not neutralise the quota choices that move the candidate's share. This would explain why describing weighting more (sponsored) coexists with bias of +7 pp (sponsored). Testing this requires a downstream "effectiveness" measurement we don't have — but the pattern is consistent with a story in which the visible methodology language is a credibility signal aimed at TSE / consumers, while the substantive bias is downstream.

(R2) The mechanism is not in the ponderação text at all. The 27 % vs 17 % vs 11 % described_differs gradient is real but small; the r = +0.13 correlation is significant but explains < 2 % of variance in |contrast|. The +7 pp effect requires explanation; ponderação description differences are not large enough to be the dominant lever. The reading is then: ponderação is correlated with the bias but is not the mechanism.

Combined mechanism inventory (post-AN-057)

| Lever | Status | Direction | Evidence | |---|---|---|---| | Bairro partisan composition | Tested | Reversed sign | AN-032 | | Coverage class flat | Tested | Null/noisy positive | AN-019 | | Coverage × candidate-base | Tested | Null | AN-055 | | Coverage deferral | Tested | Wrong-signed | AN-024 | | Audit pct | Tested | Heavy overlap | AN-021 | | Methodology completeness | Tested | Wrong-signed | AN-022 | | Interviewer training | Tested | Wrong-signed (selective disclosure) | AN-042 | | Mode (phone/in-person) | Tested | Wrong-signed | AN-041 | | Nonresponse handling | Tested | Null-by-data-design | AN-043 | | Income / age / education quota deltas | Tested | Null | AN-056 | | Population reference frame | Tested | Weakly directional (p=0.12) | AN-056 | | Weighting described differs | Tested | Positive (sp over-doc, monotone gradient) | AN-057 | | Post-stratification explicit differs | Tested | Positive (r=0.13 with |contrast|, p=0.04) | AN-057 | | Name / scenario rotation | Tested | Positive (sp under-doc 5×, p ≈ 4×10⁻⁸) | AN-051 |

The structural picture that emerges:

R1 vs R2 effectiveness probe (added 2026-06-14)

Restricted AN-056's quota-delta × |contrast| correlations to subsamples defined by the post_stratification_explicit flag from AN-057:

| metric | r (BOTH claim, n=76) | r (NEITHER claims, n=46) | Δr (BOTH − NEITHER) | Fisher z-test p | |---|---:|---:|---:|---:| | |Δ income| × |contrast| | −0.07 | +0.14 | −0.21 | 0.26 | | |Δ age| × |contrast| | +0.12 | −0.10 | +0.22 | 0.33 | | |Δ % superior ed| × |contrast| | +0.11 | −0.15 | +0.26 | 0.17 |

The income row matches the R1 ("claimed correction works") prediction: when both sides claim post-stratification, larger quota deltas don't correlate with larger bias; when neither claims, they do. Direction-of-difference is right, but Δr = −0.21 is not statistically distinguishable from zero (z = −1.12, p = 0.26) at this sample size.

Age and education go the other way — BOTH has higher correlation than NEITHER. Could be noise (n < 50 in NEITHER); could be that post-stratification on income is more effective than on age / education (plausible: income is the noisiest variable and the most-corrected).

|Δ income| means within the high_bias stratum: BOTH-claim 1.42 SM vs NEITHER 1.03 SM. Polls that claim post-stratification have larger quota deltas on average. Consistent with "polls tolerate unusual quotas precisely because they expect to correct them"; whether the correction actually neutralises the slant is the still-open question. Sources: source/analysis/an-058-postst-effectiveness.py, build/table/an-058-postst-effectiveness.csv.

Resolution path: universe-scale poll_weighting extraction (≈ $3 at gpt-4o-mini Batch rates on 14k protocols) would push this test from chi-square-on-244 / z-test-on-122 to a regression on 14k pairs, which is the natural next step — but with the caveat noted in the "Strategic context" section of docs/todo.md (added 2026-06-14): the structural search has been asymptoting at modest signals, and scaling is not guaranteed to surface the missing mechanism behind the +7 pp headline.

Follow-ups

  1. Effectiveness test follow-through. AN-058 above is the v1 of this test on the 244-pair set. Universe scale would resolve the inconclusive direction.
  2. Universe-scale weighting extraction (low risk, modest cost). The cancelled sampling batch unblocks this naturally: poll_weighting now exists; running it on the 14k universe would take ~$3 at gpt-4o-mini Batch rates and would convert AN-057's r=+0.13 chi-square p=0.04 finding into a regression with 100× the power. Worth submitting next time we revisit the cancelled batches.
  3. Re-read the LLM-judge brief in light of AN-057's selective-disclosure direction. The brief's high-plausibility weighting hypotheses (ponderação por renda, cotas de renda) implicitly assume sponsored polls under-describe weighting to hide bias. AN-057 shows the opposite — sponsors over-describe weighting. The LLM-judge's 87.5 % agreement may therefore be picking up whichever side describes more, not whichever side is biased. This is testable: re-run the blinded brief's side_favoured predictions against the data showing sponsored polls more often have the more-described methodology, and see if the LLM was tracking the disclosure side rather than the bias side.