AN-066: Benjamini-Hochberg FDR correction on the design-inventory table

Benjamini-Hochberg FDR correction on the 10 within-pair Channel-A directional tests. 6 of 10 survive at q < 0.05 and 7 of 10 at q < 0.10. The only 'small positive, marginal' lever (population-frame mismatch, p = 0.12) does NOT survive correction (q_BH = 0.15), sharpening the 'no single lever carries +7 pp' conclusion.

Hypothesis: H1: Self-sponsored polls overstate the sponsoring candidate
Confidence: green
Type: robustness

Design

Sample: 10 within-pair tests displayed in Table 3 of paper/paper.tex (paper Channel-A design-inventory).
Specification: BH q_i = p_i * m / rank_i with enforced monotonicity. m = 10.
Notes: p-values hardcoded from the published table to avoid recomputing across 10 different analysis scripts; each row in the script cites the AN page or analysis that produced its p.

Script: source/analysis/an-066-fdr-bh.py
Target: build/table/an-066-fdr-bh.csv
Status: interpreted · 2026-06-14
Created: 2026-06-14

Question

GPT-5-pro's 2026-06-14 pre-submission review flagged that the Channel-A design-inventory in Table 3 of paper/paper.tex runs 10 within-pair directional tests on different design levers and invites a multiple-testing critique. The standard remedy is to BH-correct the displayed p-values and report q-values alongside.

Design

10 within-pair tests, one per lever, ordered as in Table 3:

Rank by p	Lever	p (displayed)	Source AN
1	Scenario-rotation documentation	~4e-8	AN-051
2	Sample-design-consistent fabrication	<10⁻⁴	AN-013v2
3	Phone-mode substitution	0.0003	AN-041
4	Partisan stronghold over-sampling	<0.001	bairro-string oversample
5	Interviewer-training omission	0.002	AN-042
6	Coverage deferral at registration	0.02	AN-024
7	Ponderação specificity	0.04	AN-057
8	Population-frame mismatch (mixed)	0.12	AN-020 + frame
9	Methodology completeness gap	0.22	AN-022
10	Audit-rate floor	1.00	AN-021

BH q-value: q_i = p_i * m / rank_i, with right-to-left running min for monotonicity. m = 10.

Results

Rank	Lever	p	q (BH)	Survives q<0.05?	Survives q<0.10?
1	Scenario-rotation documentation	4e-8	4e-7	✓	✓
2	Sample-design-consistent fabrication	1e-6	5e-6	✓	✓
3	Phone-mode substitution	0.0003	0.001	✓	✓
4	Partisan stronghold over-sampling	0.001	0.0025	✓	✓
5	Interviewer-training omission	0.002	0.004	✓	✓
6	Coverage deferral at registration	0.02	0.033	✓	✓
7	Ponderação specificity	0.04	0.057	✗	✓
8	Population-frame mismatch	0.12	0.15	✗	✗
9	Methodology completeness gap	0.22	0.244	✗	✗
10	Audit-rate floor	1.00	1.00	✗	✗

6 of 10 survive at q<0.05; 7 of 10 at q<0.10.

Interpretation

The strong rejections (reversed-sign Channel-A predictions: rotation documentation, fabrication, phone-mode, partisan-stronghold, interviewer-training, coverage-deferral) all survive correction comfortably.
The "small positive, marginal" rows in the table — the only same-direction-as-Channel-A entries — do NOT survive correction: population-frame mismatch (p=0.12, q=0.15) and methodology completeness (p=0.22, q=0.24) are both above any standard FDR threshold.
Combined verdict: under multiple-testing discipline, the table contains no surviving lever whose direction matches Channel A's prediction with the headline-required magnitude. Sharpens the "no single lever carries +7 pp" claim.

Caveats

The p-values are hardcoded off the published table rather than recomputed from each underlying analysis script. This is a trade-off (faster + traceable to the table the reader sees, but requires manual sync if any upstream analysis changes). Each entry in an-066-fdr-bh.py:LEVERS cites the source AN page so the trace is grep-checkable.
BH controls FDR under positive regression dependence (PRDS) or independence; under arbitrary dependence the more conservative BY procedure applies. The 10 tests are on different design levers in partly overlapping samples, so PRDS is plausible but not formally shown. Conservative readers should treat the q-values as a sharp lower bound on the family-wise discipline.

Follow-ups

A pre-registered lever list with q-values + effect-size bounds rather than just signs (per GPT-5-pro's review) is the natural next step. Most of the analyses already report effect sizes; a small wrapper to extract and harmonize them would close the loop.