Three forensics targeting the AN-013 blind spots — design-respecting fabrication — return **two clean nulls and one ambiguous-cause positive**. (T1) Standardised-error variance test: var(z|sponsored, demeaned) = 79.7 vs var(z|indep, demeaned) = 315.3; F-test ratio 0.253 p<0.0001 but Levene's median-centred p = 0.122 — the F-test result is outlier-driven; sponsored polls are NOT 'too clean' (var of 80 σ² means SD ≈ 9 σ around the bias mean, plenty of spread). (T2) Bias-concentration test: sponsored polls have 11.5 % within ±2pp of own mean vs 19.0 % for indep (z = −4.75, p < 0.0001 in the *anti*-fabrication direction — sponsored polls are MORE spread, not less). (T3) Within-firm rounding TVD on tenths-digit of poll_percent_raw: 12 firms qualify with ≥10 sponsored and ≥10 indep each; mean TVD = 0.39 (vs ~0.15-0.20 expected under H0); 3 of 12 firms significant at chi-square p<0.05 (vs 0.6 expected). Read: T1+T2 argue *against* simple sample-design-consistent fabrication as the headline mechanism — the sponsored-error distribution is wider, not tighter, than the indep distribution. T3 picks up real within-firm processing differences (could be differential subcontracting, customer-specific reporting templates, or fabrication; cause is not separable from this test). Combined with AN-013 v1 (no per-row digit-tampering signature), the cumulative weight of evidence is that **the +7 pp is not concentrated in a single big fabrication lever**. The residual likely lives in a constellation of small effects across sample-frame contamination (1-4 pp prior), interviewer scripting (0-2 pp), and strategic timing (1-2 pp), each individually modest.
Question
AN-013 ruled out crude per-row tampering via digit-frequency tests (uniform last-digit, Benford leading-digit, round-number frequency within sponsored polls). AN-013 explicitly listed three blind spots: sophisticated manipulation preserving digit distributions, proportional within-poll rescaling, and pre-publication data work that leaves no digit signature.
The residual decomposition added to docs/thinking.md on 2026-06-14
flagged "sample-design-consistent fabrication" (prior magnitude 2-5
pp) as the largest remaining untested mechanism after AN-059 zeroed
out firm-level selection. AN-013 v2 designs three tests sensitive to
design-respecting fabrication, which AN-013 v1 could not detect.
Tests
source/analysis/an-013v2-fabrication-forensics.py:
T1 — Standardised-error variance ("too clean" test). Under honest sampling at declared n with true share p, SE(error in pp) = 100×√(p(1-p)/n). Standardise: z = error / SE_expected. Under H0 (honest sampling + systematic bias mean) the demeaned z should have variance ≈ 1. Under design-respecting fabrication the manipulator chooses biases without adding sampling noise, so var(z | sponsored, demeaned) should fall below var(z | indep, demeaned). Tests: F-test of variance ratio (sensitive to outliers) and Levene's median-centred test (robust).
T2 — Bias concentration test. Fraction of polls within ±2 pp of own-group mean error. Under honest slant + sampling noise the sponsored-error distribution should be wide around the bias mean. Under fabrication the sponsored errors cluster tightly around the chosen bias. Two-sample binomial-proportion z-test on the difference.
T3 — Within-firm rounding-pattern shift. Tenths-digit of
poll_percent_raw modulo 10. For each firm with ≥10 sponsored AND
≥10 indep rows, compare the digit distribution across the two
sponsorship types. Per-firm chi-square test; aggregate via mean and
median Total Variation Distance (TVD).
Results
| Test | Direction predicted under fabrication | Observed | Reading |
|---|---|---|---|
| T1 — F-test on demeaned z variances | ratio < 1 | ratio = 0.253, F p < 0.0001 | Driven by indep outliers; Levene p = 0.12 not sig |
| T1 — Levene's median-centred | indep − sponsored > 0 | p = 0.122 | Variances not significantly different |
| T2 — frac within ±2pp of own mean | sponsored > indep | sponsored 11.5 % < indep 19.0 % | z = −4.75, p < 0.0001 anti-fabrication |
| T3 — within-firm rounding TVD | TVD elevated | mean TVD = 0.39 (vs ~0.15-0.20 baseline); 3/12 firms chi-square sig | Real within-firm processing differential; cause ambiguous |
T1 detail — sponsored polls are not "too clean"
Sponsored z mean = +5.87 (= +7 pp bias ÷ ~1.2 pp expected SE for n=500, p=0.4), so the bias is huge relative to sampling noise. But var(z | sponsored, demeaned) = 79.68, i.e. SD ≈ 8.9 σ around the bias mean. The sponsored-error distribution is wide, not tight. A classical fabrication that targets a fixed +7 pp would produce var(z) ≈ 1 (just sampling noise on top of a fixed shift). The observed 80 rules that out.
The F-test's significant ratio (0.253) is misleading: indep has var(z) = 315 — dominated by outliers from small candidates with near-zero true share and small expected SE. Levene's median-centred test (robust to outliers) shows no significant variance difference.
T2 detail — sponsored polls spread more, not less
Sponsored mean error = +12.12 pp (note: this is the raw cross-sectional mean, much larger than the within-cand FE +7.85 from the headline, because sponsored polls cluster on leading candidates who are naturally over-stated). 11.5 % of sponsored polls fall within ±2pp of this mean. The indep distribution has 19.0 % within ±2pp of its own mean of +2.86 pp.
Sponsored polls are less concentrated around their group mean than indep polls. Mechanism-agnostic reading: the sponsor effect is heterogeneous (some sponsored polls slant much, some little), which is what we'd expect under design-driven slant calibrated per race, not under uniform fabrication.
T3 detail — within-firm rounding shows real signal
Of 12 firms with ≥10 sponsored and ≥10 indep rows: 11/12 show TVD > 0.20 between sponsored and indep tenths-digit distributions (vs ~0.15-0.20 baseline under sampling noise at this n). 3 of 12 firms reach chi-square p<0.05 vs 0.6 expected under H0 — a 5× elevation.
The signal is real but the cause is not separable from this test alone. Candidate explanations:
- Fabrication of sponsored-poll data with a different rounding habit.
- Subcontracting: the firm farms sponsored work to a partner with different tabulation software.
- Customer-specific reporting: candidate customers may demand integer-percentage reports while media customers accept tenths.
- Different question-batteries: sponsored polls may report a different scenario than indep polls and the tenths-distribution reflects scenario differences, not tampering.
The third and fourth explanations are likely most of the T3 signal. Discriminating between them requires a within-firm × scenario test that isn't built here.
Interpretation
Net result: simple fabrication is unlikely
T1 and T2 together argue against a single-lever fabrication story: sponsored polls spread MORE than indep around their own bias mean, not less. A fabricator targeting +7 pp would leave a tighter distribution; we see a wider one. AN-013 v1's null on per-row digit tampering + AN-013 v2's null on distribution-shape "too clean" tests shrink the plausible magnitude of the fabrication category from the prior 2-5 pp toward 0-2 pp.
T3's within-firm rounding signal is real and warrants documentation, but mundane explanations (customer-specific reporting formats, subcontracting, scenario differences) are at least as plausible as fabrication.
Where the residual now lives
The docs/thinking.md residual decomposition (2026-06-14, updated
2026-06-14 with AN-059 + AN-013 v2):
| Category | Prior magnitude | Post-update (this AN) |
|---|---|---|
| Sample-design-consistent fabrication | 2-5 pp | 0-2 pp (T1+T2 nulls) |
| Firm-level slant-for-hire selection | 2-4 pp | 0 pp (AN-059) |
| Wave selection | 1-3 pp | 0 pp (AN-003 placebo) |
| Sample frame contamination | 1-4 pp | unchanged (untestable from registered data) |
| Interviewer scripting | 0-2 pp | unchanged (untestable from registered data) |
| Strategic timing × news events | 1-2 pp | unchanged (needs event db) |
The unexplained 2-6 pp is now concentrated in the three "effectively untestable from registered data" categories. The mechanism story for the paper increasingly looks like a constellation:
- ~1-2 pp from scenario rotation under-doc (AN-051)
- ~0-2 pp from ponderação selective disclosure (AN-057)
- ~0-1 pp from population-reference mismatch (AN-056)
- A residual 2-6 pp from sample-frame contamination, interviewer scripting, and strategic timing — small individually, additive collectively, individually untestable from public data.
Follow-ups
- T3 disambiguation (low-priority, modest paper value). Within-firm × scenario test of the tenths-digit distribution would separate "different scenarios reported" from genuine processing differential. Cheap (no new data) but the signal is already documented; pursuing it further is a refinement, not a discovery.
- Event-database build for strategic-timing test (#6 in the residual decomposition) (highest remaining test value). The pipelines/justica EJ pipeline already has event-side data (campaign-event filings, lawsuit cycles); a join to poll field_period_week would test whether sponsored polls cluster on post-event windows. Estimated lift: ~1-2 days for the join + analysis.
- Update paper's Channel A vs B narrative. The accumulating null/wrong-signed pattern across structural levers + the fabrication-unlikely finding here together argue the +7 pp is a constellation, not a single lever. The paper currently leaves the mechanism unresolved; this set of analyses turns "unresolved" into "concentrated in the data-inaccessible categories" — a sharper claim worth making explicitly.