AN-065: Publication-selection proxy via firm-race coverage

Firm-race coverage proxy for the publication-selection alternative. In the runoff-eligible 172-muni sample, sponsored polls of firms with NO media coverage show MORE bias (mean error +15.2 pp, n=10) than sponsored polls of firms WITH coverage (+7.2 pp, n=28) — the opposite of what publication selection predicts. n=10 is too thin for inference, but the direction does not support publication selection driving the headline.

Hypothesis: H1: Self-sponsored polls overstate the sponsoring candidate
Confidence: yellow
Type: descriptive

Design

Sample: All cand_poll rows with matched_share==1, joined to AN-025's 172-muni runoff-eligible sample on (muni_id, institute). Three strata: outside_an025 (no coverage data), in_an025_no_hit, in_an025_any_hit.
Specification: Descriptive: mean error by (sponsored × coverage stratum). No regression; n=10 in the no-hit sponsored stratum is too thin.
Comparator: Firm-race coverage indicator from AN-025 (any_hit = n_hits ≥ 1 on the Google News firm-race query).
Notes: Firm-race coverage is a coarse proxy for per-poll publication. The clean test (per-poll publication status) would require a per-protocol scrape — flagged in todo.md G2 as data-limited.

Script: source/analysis/an-065-coverage-proxy.py
Target: build/table/an-065-coverage-proxy.csv
Status: interpreted · 2026-06-14
Created: 2026-06-14

Question

GPT-5-pro's 2026-06-14 pre-submission review flagged that the intro's "the universe of registered polls plausibly includes slanted polls that never reached the wider electorate" assertion deserves a direct empirical test (G2 in docs/todo.md).

The clean test — does β differ between published and unpublished sponsored polls? — requires per-poll publication status. The Google News scrape from AN-025 queries firm-name × race-name, not per-protocol; so the cleanest test the data permits is a firm-race coverage proxy.

Design

Join cand_poll.parquet to AN-025's media_amplification.csv on (muni_id, institute). Three strata for each candidate-poll row:

Stratum	Meaning	n (total)	n (sponsored_by==1)
`outside_an025`	muni not in the 172-muni runoff-eligible AN-025 sample (no data)	15,674	412
`in_an025_no_hit`	firm-race in the sample, any_hit = 0	955	10
`in_an025_any_hit`	firm-race in the sample, any_hit ≥ 1	6,036	28

Only 38 of the 450 sponsored polls fall in the runoff-eligible sample where the coverage indicator exists. The runoff-eligible sample over-represents large pollsters in big munis — exactly the firms that §sec:within-firm shows do not carry the headline slant. Sponsored polls from small low-volume firms (where the slant concentrates) are mostly in outside_an025, where no coverage data exists.

Results

Coverage stratum	n	Sponsored mean error (pp)	Non-sponsored mean error (pp)	Within-stratum sponsored–not gap (pp)
`outside_an025` (no coverage data)	412 / 15,262	+8.49	+0.56	+7.93
`in_an025_no_hit`	10 / 945	+15.17	+1.21	+13.96
`in_an025_any_hit`	28 / 6,008	+7.24	+1.69	+5.55

Interpretation

The publication-selection alternative predicts that sponsored polls of firms with no media coverage should show lower error than sponsored polls of covered firms — because slanted polls would be filtered out before reaching media. The observed pattern is the opposite:

Sponsored mean error in in_an025_no_hit cells: +15.2 pp (n=10)
Sponsored mean error in in_an025_any_hit cells: +7.2 pp (n=28)

The within-stratum sponsored-vs-not gap is +14 pp in the no-hit stratum vs +5.6 pp in the any-hit stratum — sponsor bias is larger in uncovered firm-race cells. n=10 is too thin for inference, but the direction does not support publication selection driving the headline.

Caveats

This is a proxy, not the clean test:

Firm-race, not per-poll. A firm could have any media hits in a race but specific polls of theirs might not have been published. We cannot identify which protocols were published.
Small n in the no-hit sponsored stratum. Only 10 sponsored polls fall in the no-hit cells; the mean estimate is noisy.
Runoff-eligible sample is the wrong tail. The small low-volume pollsters that carry the headline slant (§sec:within-firm) serve smaller munis where AN-025's Google News scrape doesn't have data at all. The descriptive evidence here is informative about the outside_an025 polls only indirectly (mean error +8.5 pp on n=412, broadly consistent with the headline).

A direct per-poll test would require a per-protocol scrape with poll dates and firm names — flagged in docs/todo.md G2 as data-limited.

Follow-ups

A per-protocol Google News scrape would enable the clean version of this test. Cost: re-scrape with poll-date queries; non-trivial.
Within-firm β by coverage tier (already in §sec:within-firm via the AN-025 cross-link) is a complementary cut.