Benjamini-Hochberg FDR correction on the 10 within-pair Channel-A directional tests. 6 of 10 survive at q < 0.05 and 7 of 10 at q < 0.10. The only 'small positive, marginal' lever (population-frame mismatch, p = 0.12) does NOT survive correction (q_BH = 0.15), sharpening the 'no single lever carries +7 pp' conclusion.
Question
GPT-5-pro's 2026-06-14 pre-submission review flagged that the
Channel-A design-inventory in Table 3 of paper/paper.tex runs 10
within-pair directional tests on different design levers and invites a
multiple-testing critique. The standard remedy is to BH-correct the
displayed p-values and report q-values alongside.
Design
10 within-pair tests, one per lever, ordered as in Table 3:
| Rank by p | Lever | p (displayed) | Source AN |
|---|---|---|---|
| 1 | Scenario-rotation documentation | ~4e-8 | AN-051 |
| 2 | Sample-design-consistent fabrication | <10⁻⁴ | AN-013v2 |
| 3 | Phone-mode substitution | 0.0003 | AN-041 |
| 4 | Partisan stronghold over-sampling | <0.001 | bairro-string oversample |
| 5 | Interviewer-training omission | 0.002 | AN-042 |
| 6 | Coverage deferral at registration | 0.02 | AN-024 |
| 7 | Ponderação specificity | 0.04 | AN-057 |
| 8 | Population-frame mismatch (mixed) | 0.12 | AN-020 + frame |
| 9 | Methodology completeness gap | 0.22 | AN-022 |
| 10 | Audit-rate floor | 1.00 | AN-021 |
BH q-value: q_i = p_i * m / rank_i, with right-to-left running min for monotonicity. m = 10.
Results
| Rank | Lever | p | q (BH) | Survives q<0.05? | Survives q<0.10? |
|---|---|---|---|---|---|
| 1 | Scenario-rotation documentation | 4e-8 | 4e-7 | ✓ | ✓ |
| 2 | Sample-design-consistent fabrication | 1e-6 | 5e-6 | ✓ | ✓ |
| 3 | Phone-mode substitution | 0.0003 | 0.001 | ✓ | ✓ |
| 4 | Partisan stronghold over-sampling | 0.001 | 0.0025 | ✓ | ✓ |
| 5 | Interviewer-training omission | 0.002 | 0.004 | ✓ | ✓ |
| 6 | Coverage deferral at registration | 0.02 | 0.033 | ✓ | ✓ |
| 7 | Ponderação specificity | 0.04 | 0.057 | ✗ | ✓ |
| 8 | Population-frame mismatch | 0.12 | 0.15 | ✗ | ✗ |
| 9 | Methodology completeness gap | 0.22 | 0.244 | ✗ | ✗ |
| 10 | Audit-rate floor | 1.00 | 1.00 | ✗ | ✗ |
6 of 10 survive at q<0.05; 7 of 10 at q<0.10.
Interpretation
- The strong rejections (reversed-sign Channel-A predictions: rotation documentation, fabrication, phone-mode, partisan-stronghold, interviewer-training, coverage-deferral) all survive correction comfortably.
- The "small positive, marginal" rows in the table — the only same-direction-as-Channel-A entries — do NOT survive correction: population-frame mismatch (p=0.12, q=0.15) and methodology completeness (p=0.22, q=0.24) are both above any standard FDR threshold.
- Combined verdict: under multiple-testing discipline, the table contains no surviving lever whose direction matches Channel A's prediction with the headline-required magnitude. Sharpens the "no single lever carries +7 pp" claim.
Caveats
- The p-values are hardcoded off the published table rather than
recomputed from each underlying analysis script. This is a
trade-off (faster + traceable to the table the reader sees, but
requires manual sync if any upstream analysis changes). Each entry
in
an-066-fdr-bh.py:LEVERScites the source AN page so the trace is grep-checkable. - BH controls FDR under positive regression dependence (PRDS) or independence; under arbitrary dependence the more conservative BY procedure applies. The 10 tests are on different design levers in partly overlapping samples, so PRDS is plausible but not formally shown. Conservative readers should treat the q-values as a sharp lower bound on the family-wise discipline.
Follow-ups
- A pre-registered lever list with q-values + effect-size bounds rather than just signs (per GPT-5-pro's review) is the natural next step. Most of the analyses already report effect sizes; a small wrapper to extract and harmonize them would close the loop.