Firm-race coverage proxy for the publication-selection alternative. In the runoff-eligible 172-muni sample, sponsored polls of firms with NO media coverage show MORE bias (mean error +15.2 pp, n=10) than sponsored polls of firms WITH coverage (+7.2 pp, n=28) — the opposite of what publication selection predicts. n=10 is too thin for inference, but the direction does not support publication selection driving the headline.
Question
GPT-5-pro's 2026-06-14 pre-submission review flagged that the
intro's "the universe of registered polls plausibly includes
slanted polls that never reached the wider electorate" assertion
deserves a direct empirical test (G2 in docs/todo.md).
The clean test — does β differ between published and unpublished
sponsored polls? — requires per-poll publication status. The
Google News scrape from AN-025 queries firm-name × race-name,
not per-protocol; so the cleanest test the data permits is a
firm-race coverage proxy.
Design
Join cand_poll.parquet to AN-025's media_amplification.csv on
(muni_id, institute). Three strata for each candidate-poll row:
| Stratum | Meaning | n (total) | n (sponsored_by==1) |
|---|---|---|---|
outside_an025 |
muni not in the 172-muni runoff-eligible AN-025 sample (no data) | 15,674 | 412 |
in_an025_no_hit |
firm-race in the sample, any_hit = 0 | 955 | 10 |
in_an025_any_hit |
firm-race in the sample, any_hit ≥ 1 | 6,036 | 28 |
Only 38 of the 450 sponsored polls fall in the runoff-eligible
sample where the coverage indicator exists. The runoff-eligible
sample over-represents large pollsters in big munis — exactly the
firms that §sec:within-firm shows do not carry the headline
slant. Sponsored polls from small low-volume firms (where the
slant concentrates) are mostly in outside_an025, where no
coverage data exists.
Results
| Coverage stratum | n | Sponsored mean error (pp) | Non-sponsored mean error (pp) | Within-stratum sponsored–not gap (pp) |
|---|---|---|---|---|
outside_an025 (no coverage data) |
412 / 15,262 | +8.49 | +0.56 | +7.93 |
in_an025_no_hit |
10 / 945 | +15.17 | +1.21 | +13.96 |
in_an025_any_hit |
28 / 6,008 | +7.24 | +1.69 | +5.55 |
Interpretation
The publication-selection alternative predicts that sponsored polls of firms with no media coverage should show lower error than sponsored polls of covered firms — because slanted polls would be filtered out before reaching media. The observed pattern is the opposite:
- Sponsored mean error in
in_an025_no_hitcells: +15.2 pp (n=10) - Sponsored mean error in
in_an025_any_hitcells: +7.2 pp (n=28)
The within-stratum sponsored-vs-not gap is +14 pp in the no-hit stratum vs +5.6 pp in the any-hit stratum — sponsor bias is larger in uncovered firm-race cells. n=10 is too thin for inference, but the direction does not support publication selection driving the headline.
Caveats
This is a proxy, not the clean test:
- Firm-race, not per-poll. A firm could have any media hits in a race but specific polls of theirs might not have been published. We cannot identify which protocols were published.
- Small n in the no-hit sponsored stratum. Only 10 sponsored polls fall in the no-hit cells; the mean estimate is noisy.
- Runoff-eligible sample is the wrong tail. The small low-volume
pollsters that carry the headline slant (§sec:within-firm) serve
smaller munis where AN-025's Google News scrape doesn't have data
at all. The descriptive evidence here is informative about the
outside_an025polls only indirectly (mean error +8.5 pp on n=412, broadly consistent with the headline).
A direct per-poll test would require a per-protocol scrape with poll
dates and firm names — flagged in docs/todo.md G2 as data-limited.
Follow-ups
- A per-protocol Google News scrape would enable the clean version of this test. Cost: re-scrape with poll-date queries; non-trivial.
- Within-firm β by coverage tier (already in §sec:within-firm via the AN-025 cross-link) is a complementary cut.