Firm-race coverage proxy for the publication-selection alternative. In the runoff-eligible 172-muni sample, sponsored polls of firms with NO media coverage show MORE bias (mean error +15.2 pp, n=10) than sponsored polls of firms WITH coverage (+7.2 pp, n=28) — the opposite of what publication selection predicts. n=10 is too thin for inference, but the direction does not support publication selection driving the headline.

Confidence
yellow
Type
descriptive
Design
Sample
All cand_poll rows with matched_share==1, joined to AN-025's 172-muni runoff-eligible sample on (muni_id, institute). Three strata: outside_an025 (no coverage data), in_an025_no_hit, in_an025_any_hit.
Specification
Descriptive: mean error by (sponsored × coverage stratum). No regression; n=10 in the no-hit sponsored stratum is too thin.
Comparator
Firm-race coverage indicator from AN-025 (any_hit = n_hits ≥ 1 on the Google News firm-race query).
Notes
Firm-race coverage is a coarse proxy for per-poll publication. The clean test (per-poll publication status) would require a per-protocol scrape — flagged in todo.md G2 as data-limited.
Script
source/analysis/an-065-coverage-proxy.py
Target
build/table/an-065-coverage-proxy.csv
Status
interpreted · 2026-06-14
Created
2026-06-14

Question

GPT-5-pro's 2026-06-14 pre-submission review flagged that the intro's "the universe of registered polls plausibly includes slanted polls that never reached the wider electorate" assertion deserves a direct empirical test (G2 in docs/todo.md).

The clean test — does β differ between published and unpublished sponsored polls? — requires per-poll publication status. The Google News scrape from AN-025 queries firm-name × race-name, not per-protocol; so the cleanest test the data permits is a firm-race coverage proxy.

Design

Join cand_poll.parquet to AN-025's media_amplification.csv on (muni_id, institute). Three strata for each candidate-poll row:

Stratum Meaning n (total) n (sponsored_by==1)
outside_an025 muni not in the 172-muni runoff-eligible AN-025 sample (no data) 15,674 412
in_an025_no_hit firm-race in the sample, any_hit = 0 955 10
in_an025_any_hit firm-race in the sample, any_hit ≥ 1 6,036 28

Only 38 of the 450 sponsored polls fall in the runoff-eligible sample where the coverage indicator exists. The runoff-eligible sample over-represents large pollsters in big munis — exactly the firms that §sec:within-firm shows do not carry the headline slant. Sponsored polls from small low-volume firms (where the slant concentrates) are mostly in outside_an025, where no coverage data exists.

Results

Coverage stratum n Sponsored mean error (pp) Non-sponsored mean error (pp) Within-stratum sponsored–not gap (pp)
outside_an025 (no coverage data) 412 / 15,262 +8.49 +0.56 +7.93
in_an025_no_hit 10 / 945 +15.17 +1.21 +13.96
in_an025_any_hit 28 / 6,008 +7.24 +1.69 +5.55

Interpretation

The publication-selection alternative predicts that sponsored polls of firms with no media coverage should show lower error than sponsored polls of covered firms — because slanted polls would be filtered out before reaching media. The observed pattern is the opposite:

The within-stratum sponsored-vs-not gap is +14 pp in the no-hit stratum vs +5.6 pp in the any-hit stratum — sponsor bias is larger in uncovered firm-race cells. n=10 is too thin for inference, but the direction does not support publication selection driving the headline.

Caveats

This is a proxy, not the clean test:

  1. Firm-race, not per-poll. A firm could have any media hits in a race but specific polls of theirs might not have been published. We cannot identify which protocols were published.
  2. Small n in the no-hit sponsored stratum. Only 10 sponsored polls fall in the no-hit cells; the mean estimate is noisy.
  3. Runoff-eligible sample is the wrong tail. The small low-volume pollsters that carry the headline slant (§sec:within-firm) serve smaller munis where AN-025's Google News scrape doesn't have data at all. The descriptive evidence here is informative about the outside_an025 polls only indirectly (mean error +8.5 pp on n=412, broadly consistent with the headline).

A direct per-poll test would require a per-protocol scrape with poll dates and firm names — flagged in docs/todo.md G2 as data-limited.

Follow-ups