Per-poll detectability and blind-audit auditability of the +7 pp bias

🟢 The +7 pp bias is visible per-poll to an observer with sponsor information and an unbiased benchmark (AN-108: z ≈ 2.8 on the median sponsored poll against binomial SE) but buried for ~96 % of sponsored polls against the realized empirical cross-firm noise floor (AN-110: pooled DEFF* = 12.59 implies z ≈ 0.79). A calibrated blind audit nonetheless catches sponsored polls at 12.3 % vs 4.8 % on independent controls — a 2.6× lift — at protocol-level Bonferroni (AN-109 v2).

The detectability picture has three load-bearing comparators, each relevant to a different observer:

Binomial SE on the sponsor's-candidate share (AN-108). On the median sponsored poll, n = 360, p = 54.8 %, binomial SE = 2.49 pp. A +7 pp shift is z ≈ 2.8, and 99.3 % of sponsored polls have binomial SE < 3.5 pp. Even with DEFF = 2 the median z stays at 2.0. The bias is statistically loud against any benchmark that approximates the true share at the sponsored poll's sample size.
Empirical cross-firm SD (AN-110). Pooled empirical DEFF* across 673 race × week × candidate cells with ≥ 3 independent polls is 12.59 (median cell 3.93, P75 9.59); realized cross-poll SD on a candidate's share is ≈ 5.0 pp. Under the empirical noise floor a +7 pp single-poll shift is ≈ 0.8 SDs — within the typical cross-firm fluctuation. Only 4.3 % of sponsor's-candidate rows are tail-extreme individually at empirical-DEFF Bonferroni.
Calibrated blind multi-candidate audit (AN-109 v2). With empirical-DEFF-inflated SE + Bonferroni keyed to n_candidates_in_race + protocol-level "at least one share extreme" rollup, the test is calibrated (independent-protocol FP = 4.8 % at Bonferroni, 5.1 % at Holm, ≈ nominal 5 %). Sponsored protocols catch at 12.3 % — a 2.6× lift. The bias has tail concentration that a calibrated regulator-facing instrument identifies even though the median sponsored poll's share is buried in the noise floor.

Reputation implication. A sponsor's own pollster running internal sanity checks against a tight benchmark (an unbiased prior, the true share at fielding, careful QA) would see the bias as statistically extreme. A pollster benchmarking against published competitor polls (the realized cross-firm distribution) would not. Reputation pressure is therefore conditional on firm-level QA practice and on whether firms hold an internal unbiased estimate of the true share that the published number is allowed to drift from.

Policy implication. A TSE / journalist blind audit using (a) empirical-DEFF-calibrated SE from the registered-polls universe, (b) Bonferroni / Holm correction per protocol's k, and (c) a protocol-level "any share extreme" rollup yields a 2.6× lift on sponsored vs independent polls. This is operationally deployable from existing public data. The instrument would be calibrated (≈ 5 % FP), target the tail of biased polls (≈ 12 % catch rate on sponsored), and support an aggregate excess-detection-rate audit by sponsorship class rather than only per-poll enforcement.

Scope: this finding is silent on the channel decomposition. Whether the +7 pp comes from lawful design tilt (Channel A) or from fabrication (Channel B) is not addressed here — a design tilt and a fabrication of the same magnitude have identical detectability properties. The channel question is tracked separately in channel-decomposition-open.

Caveats.

AN-109's analyzable subset is 114 of 450 sponsored protocols (25 % carry-through) — those with a same-candidate × race-week independent comparator. The carried-through subset is plausibly higher-profile races; the 2.6× lift is a lower bound for that subset and does not directly speak to the 75 % of sponsored polls without within-cell comparators.
AN-110's empirical DEFF* absorbs (a) true sampling DEFF, (b) within-week share drift, (c) firm-mode-methodology heterogeneity, and (d) firm-level systematic bias variance. It is an upper bound on the pure sampling DEFF and the right number for an outside- observer noise floor.
AN-108's "per-poll loud" call assumes single-hypothesis testing on the known sponsor. Outside-auditor framing without sponsor information is the AN-109 question.

Sources.

Own analysis: AN-108 (build/table/an-108-sampling-se-detectability.csv; source/analysis/an-108-sampling-se-detectability.py); AN-109 (build/table/an-109-per-poll-z-blind-audit.csv; source/analysis/an-109-per-poll-z-blind-audit.py); AN-110 (build/table/an-110-empirical-deff.csv; source/analysis/an-110-empirical-deff.py)
Reports: none direct
News anchors: none direct
Cross-refs: headline-sponsor-bias (aggregate β is the underlying empirical fact whose per-poll detectability is studied here); pollster-heterogeneity (per-firm β heterogeneity is a contributor to the empirical DEFF* upper bound); channel-decomposition-open (mechanism — separate question)

Cited analyses

AN-108 green descriptive

Binomial SE on the sponsor candidate's share is ≤ 3.5 pp on 99.3 % of sponsored polls (median 2.49 pp at median n=360); +7 pp is z ≈ 2.8 against binomial SE on the median sponsored poll. Per-poll bias is statistically visible against a tight (binomial / unbiased) benchmark — informs the reputation / auditability story, not the channel decomposition. AN-110 walks this back against the empirical noise floor.

AN-110 green descriptive

Pooled empirical DEFF* = 12.59 across 673 race × week × candidate cells with ≥3 independent polls (median cell DEFF* = 3.93, P75 = 9.59); realized cross-poll SD on a candidate's share is ≈ 5.0 pp, well above the binomial + DEFF≤3 model. Explains AN-109's miscalibration and qualifies AN-108's per-poll-loud reading: against the empirical cross-firm noise floor (not the binomial benchmark), a +7 pp single-poll shift is ≈ 0.8 SDs — within the noise floor. Aggregate regression β survives because N=568 averages the noise out. Informative about reputation / auditability comparators, not about channel decomposition.

AN-109 green descriptive

**v2 (2026-06-19; supersedes):** calibrated at AN-110 pooled empirical DEFF\\* = 12.59, the blind audit reaches nominal FP (independent protocols 4.8 % at Bonferroni vs nominal 5 %); sponsored excess detection persists at **+7.5 pp Bonferroni / +11.9 pp single-hypothesis** — a calibrated regulator-facing blind audit distinguishes sponsored from independent at a 2.6× lift. The bias has tail concentration: only 4.3 % of sponsor's-candidate rows fail at the empirical DEFF (per-poll bias is buried for 95.7 %), but 12.3 % of sponsored protocols catch via 'at least one share extreme'. **v1 (binomial + DEFF ≤ 3):** the test was uncalibrated (FP 30–60 %); sponsored excess was +13 to +19 pp but the gap inflated by miscalibration.