id: an-057
hypothesis: methodology-flexibility-a
headline: "Within-pair ponderação description does differ systematically between sponsored and indep polls — but in the AN-042 selective-disclosure direction, not the simple Channel A 'sponsors hide methodology' direction. On 244 curated sponsored × independent pairs (338/338 protocols extracted via the new poll_weighting LLM extractor at gpt-4o-mini): (i) the binary flag described_differs is 27.3 % in high_bias vs 16.9 % in well_behaved vs 11.1 % in understating (chi-square p = 0.044, monotone), (ii) postst_differs (post_stratification_explicit binary differs between sides) correlates with |contrast| at r = +0.13 (p = 0.042), (iii) directional asymmetry: when post_stratification_explicit differs between sides, the indep side is MORE LIKELY to omit it (68 / 114 = 59.6 %, binomial p = 0.049). Reading: sponsored polls describe weighting AND post-stratification with greater specificity than the matched indep comparator. This matches AN-042's reframing — sponsored polls under-document sample-shape but over-document visible-rigor dimensions. The pattern is real but does not directly explain the +7 pp bias; the simplest reconciliation is that the claimed corrections do not actually neutralise the quota choices (claimed-but-ineffective ponderação), but identifying that would require a downstream effectiveness test we don't yet have."
type: descriptive
status: interpreted
status_date: 2026-06-14
confidence: yellow
created: 2026-06-14
script: source/analysis/an-057-weighting-by-bias.py
target: build/table/an-057-weighting-by-bias.csv
design:
sample: 244 curated sponsored × independent pairs (110 high_bias, 89 well_behaved, 45 understating) from build/llm/curated_pairs/pairs.parquet. Backing extraction: pipelines/politica/source/llm/poll_weighting.py (new, schema_name="poll_weighting", schema_version="v1") ran synchronously on all 338 pair protocols 2026-06-14, gpt-4o-mini, 101s wall, 337 fresh calls + 1 cached. Per-protocol marginals: weighting described in 290/338 (85.8 %); application conditional in 231 (68.3 %); target tse_eleitorado 111 / ibge_2022 82 / ibge_2010 70 / not_specified 63; post_stratification_explicit 172/338 (50.9 %).
specification: "Per pair, compute six within-pair *_differs binary indicators: described_differs, application_differs, target_differs, postst_differs, vars_set_differs, plus directional sp_uniquely_no_postst and ip_uniquely_no_postst. Tests: (a) cross-stratum chi-square of each binary × stratum, (b) Pearson r of each binary vs |contrast|, (c) two-sided exact binomial test of directional asymmetries (sp_uniquely_no_X vs ip_uniquely_no_X)."
comparator: indep-media poll on same muni × same candidate × within 14 days
cluster: pair
AN-057: Within-pair ponderação description × bias contrast
Question
After AN-055 (coverage × candidate-base, null) and AN-056 (quota-distribution deltas, null), the only remaining theme the blinded LLM-judge brief identified is the weighting / ponderação description. PollSampling captures quota design but not whether or how a post-fielding correction normalises the realised sample back to a population reference. This analysis fills that gap with a new extractor and tests within-pair ponderação description differences against the observed bias contrast.
The expectation under the simple "sponsors hide methodology" Channel A story: sponsored polls should describe weighting LESS often and less explicitly than the matched indep poll, with the disparity larger in high-bias pairs. AN-042 already complicated this by finding the opposite pattern on interviewer-training and supervisor-role descriptions. AN-057 tests whether weighting follows the AN-042 "selective disclosure" pattern or breaks it.
Design
source/analysis/an-057-weighting-by-bias.py:
- Load
build/llm/curated_pairs/pairs.parquet(244 pairs, 338 protocols). - Load
pipelines/politica/build/llm/poll_weighting/*.json— 338 extractions cached from the synchronous pilot (source/llm/curated_pairs_weighting.py, gpt-4o-mini, 101s wall). - Per pair, compute six within-pair
*_differsbinaries:described_differs,application_differs,target_differs,postst_differs(post_stratification_explicit),vars_set_differs(Jaccard ≠ 1 on variables_weighted set), and directional flagssp_uniquely_no_postst/ip_uniquely_no_postst. - Tests:
- Chi-square of each binary × stratum (high_bias / well_behaved / understating)
- Pearson r of each binary vs |contrast|
- Two-sided binomial test of
sp_uniquely_no_poststvsip_uniquely_no_postst(directional asymmetry).
Results
Within-pair 'differs' rates by stratum
| metric | high_bias (n=110) | well_behaved (n=89) | understating (n=45) | chi-square p |
|---|---|---|---|---|
| described_differs | 27.3 % | 16.9 % | 11.1 % | 0.044 ✓ |
| application_differs | 43.6 % | 37.1 % | 28.9 % | 0.217 |
| target_differs | 69.1 % | 76.4 % | 68.9 % | 0.469 |
| postst_differs | 50.9 % | 42.7 % | 44.4 % | 0.485 |
| vars_set_differs | 74.5 % | 74.2 % | 73.3 % | 0.988 |
| sp_uniquely_no_postst | 17.3 % | 24.7 % | 11.1 % | — |
| ip_uniquely_no_postst | 33.6 % | 18.0 % | 33.3 % | — |
| sp_uniquely_no_described | 10.0 % | 9.0 % | 2.2 % | — |
described_differs shows the cleanest monotone gradient — high-bias pairs are 2.5× more likely than understating pairs to have one side describe weighting while the other does not.
Correlation with |contrast|
| metric | r | 95 % CI | p |
|---|---|---|---|
| postst_differs | +0.130 | [+0.00, +0.25] | 0.042 ✓ |
| application_differs | +0.109 | [−0.02, +0.23] | 0.090 |
| described_differs | +0.098 | [−0.03, +0.22] | 0.126 |
| vars_set_differs | +0.064 | [−0.06, +0.19] | 0.320 |
| target_differs | +0.019 | [−0.11, +0.14] | 0.773 |
| sp_uniquely_no_postst | −0.024 | [−0.15, +0.10] | 0.704 |
| sp_uniquely_no_described | +0.020 | [−0.11, +0.15] | 0.756 |
postst_differs carries a small but significant correlation: pairs where the two polls disagree about whether post-stratification is explicit are pairs with larger absolute bias.
Directional asymmetry — opposite of the naïve prediction
| sponsored uniquely missing | indep uniquely missing | total asymmetric | binomial p (50/50) | |
|---|---|---|---|---|
| post_stratification_explicit | 46 | 68 | 114 | 0.049 ✓ |
| described | 20 | 30 | 50 | 0.203 |
When the two sides disagree, the indep side is the one more likely to omit explicit post-stratification language. Sponsored polls describe post-stratification MORE specifically than the comparator. The 46-vs-68 split is statistically distinguishable from 50/50 at p = 0.05.
Interpretation
What this confirms
Three signals reach p < 0.05:
described_differs × stratum: monotone gradient (high_bias > well_behaved > understating).postst_differs × |contrast|: positive correlation.- Directional asymmetry: sponsored polls describe post-stratification more often than indep polls.
These are real patterns. The first two together imply that something about the ponderação description scales with the within-pair bias — pairs where the bias is large are more likely to have one side describing weighting and one side not, or one explicit about post-stratification and the other not.
What's surprising — the direction
The directional pattern is opposite the naïve "sponsors hide methodology" prediction. Sponsored polls over-document ponderação and post-stratification, just as AN-042 found for interviewer training and supervisor role. The selective-disclosure reframing now extends to weighting language.
Two readings of the direction:
(R1) Claimed-but-ineffective post-stratification. Sponsored polls describe rigorous post-stratification in the registration text, but the actual correction does not neutralise the quota choices that move the candidate's share. This would explain why describing weighting more (sponsored) coexists with bias of +7 pp (sponsored). Testing this requires a downstream "effectiveness" measurement we don't have — but the pattern is consistent with a story in which the visible methodology language is a credibility signal aimed at TSE / consumers, while the substantive bias is downstream.
(R2) The mechanism is not in the ponderação text at all. The 27 %
vs 17 % vs 11 % described_differs gradient is real but small; the
r = +0.13 correlation is significant but explains < 2 % of variance
in |contrast|. The +7 pp effect requires explanation; ponderação
description differences are not large enough to be the dominant
lever. The reading is then: ponderação is correlated with the
bias but is not the mechanism.
Combined mechanism inventory (post-AN-057)
| Lever | Status | Direction | Evidence | |---|---|---|---| | Bairro partisan composition | Tested | Reversed sign | AN-032 | | Coverage class flat | Tested | Null/noisy positive | AN-019 | | Coverage × candidate-base | Tested | Null | AN-055 | | Coverage deferral | Tested | Wrong-signed | AN-024 | | Audit pct | Tested | Heavy overlap | AN-021 | | Methodology completeness | Tested | Wrong-signed | AN-022 | | Interviewer training | Tested | Wrong-signed (selective disclosure) | AN-042 | | Mode (phone/in-person) | Tested | Wrong-signed | AN-041 | | Nonresponse handling | Tested | Null-by-data-design | AN-043 | | Income / age / education quota deltas | Tested | Null | AN-056 | | Population reference frame | Tested | Weakly directional (p=0.12) | AN-056 | | Weighting described differs | Tested | Positive (sp over-doc, monotone gradient) | AN-057 | | Post-stratification explicit differs | Tested | Positive (r=0.13 with |contrast|, p=0.04) | AN-057 | | Name / scenario rotation | Tested | Positive (sp under-doc 5×, p ≈ 4×10⁻⁸) | AN-051 |
The structural picture that emerges:
- Almost everything tested at the design-substantive level is null or wrong-signed.
- Two design-substantive levers carry the small directional Channel A pattern: population reference frame (AN-056, p=0.12) and post-stratification description (AN-057, p=0.04).
- One disclosure-quantity lever carries a strong directional signal: scenario rotation (AN-051, sp under-doc 5×, p ≈ 4 × 10⁻⁸).
- Sponsored polls over-document visible-rigor dimensions (interviewer training, weighting, post-stratification) — AN-042 + AN-057 selective-disclosure pattern.
R1 vs R2 effectiveness probe (added 2026-06-14)
Restricted AN-056's quota-delta × |contrast| correlations to subsamples defined by the post_stratification_explicit flag from AN-057:
| metric | r (BOTH claim, n=76) | r (NEITHER claims, n=46) | Δr (BOTH − NEITHER) | Fisher z-test p | |---|---:|---:|---:|---:| | |Δ income| × |contrast| | −0.07 | +0.14 | −0.21 | 0.26 | | |Δ age| × |contrast| | +0.12 | −0.10 | +0.22 | 0.33 | | |Δ % superior ed| × |contrast| | +0.11 | −0.15 | +0.26 | 0.17 |
The income row matches the R1 ("claimed correction works") prediction: when both sides claim post-stratification, larger quota deltas don't correlate with larger bias; when neither claims, they do. Direction-of-difference is right, but Δr = −0.21 is not statistically distinguishable from zero (z = −1.12, p = 0.26) at this sample size.
Age and education go the other way — BOTH has higher correlation than NEITHER. Could be noise (n < 50 in NEITHER); could be that post-stratification on income is more effective than on age / education (plausible: income is the noisiest variable and the most-corrected).
|Δ income| means within the high_bias stratum: BOTH-claim 1.42 SM
vs NEITHER 1.03 SM. Polls that claim post-stratification have
larger quota deltas on average. Consistent with "polls tolerate
unusual quotas precisely because they expect to correct them"; whether
the correction actually neutralises the slant is the still-open
question. Sources: source/analysis/an-058-postst-effectiveness.py,
build/table/an-058-postst-effectiveness.csv.
Resolution path: universe-scale poll_weighting extraction (≈ $3
at gpt-4o-mini Batch rates on 14k protocols) would push this test
from chi-square-on-244 / z-test-on-122 to a regression on 14k pairs,
which is the natural next step — but with the caveat noted in the
"Strategic context" section of docs/todo.md (added 2026-06-14):
the structural search has been asymptoting at modest signals, and
scaling is not guaranteed to surface the missing mechanism behind
the +7 pp headline.
Follow-ups
- Effectiveness test follow-through. AN-058 above is the v1 of this test on the 244-pair set. Universe scale would resolve the inconclusive direction.
- Universe-scale weighting extraction (low risk, modest cost). The cancelled sampling batch unblocks this naturally: poll_weighting now exists; running it on the 14k universe would take ~$3 at gpt-4o-mini Batch rates and would convert AN-057's r=+0.13 chi-square p=0.04 finding into a regression with 100× the power. Worth submitting next time we revisit the cancelled batches.
- Re-read the LLM-judge brief in light of AN-057's selective-disclosure direction.
The brief's high-plausibility weighting hypotheses (
ponderação por renda,cotas de renda) implicitly assume sponsored polls under-describe weighting to hide bias. AN-057 shows the opposite — sponsors over-describe weighting. The LLM-judge's 87.5 % agreement may therefore be picking up whichever side describes more, not whichever side is biased. This is testable: re-run the blinded brief'sside_favouredpredictions against the data showing sponsored polls more often have the more-described methodology, and see if the LLM was tracking the disclosure side rather than the bias side.