Double-machine-learning (race-confounder residualization via cross-fitted XGBoost regression, then XGBoost classifier on residuals) achieves OOS AUC 0.717 — sits between AN-105 Mode B linear demeaning (0.693) and AN-104 raw (0.742). Non-linear race effects worth +0.024 AUC vs linear demean; remaining +0.025 to AN-104 raw is race-correlated signal that DML cannot remove (race confounders are imperfect proxies). R² of residualization shows the race-leakiest features are log_sample (0.825), n_peers (0.775), error_concentration (0.693), n_cand_rows (0.690) — those were carrying mostly race signal. Spec ladder shows is_shell coefficient flips back to NEGATIVE at S0 (−0.015\*\*) under DML — substantively cleanest 'shells mimic media' finding. is_other_firm at S1 (+0.008\*) borderline positive but weakens further at S3 (+0.007 ns).
User-flagged follow-up #3 from AN-105: linear within-race demeaning may miss non-linear race effects. The cleaner approach is double-machine-learning (Robinson 1988 / Chernozhukov et al. 2018 style):
- Stage 1: for each feature f, train a flexible model predicting f from race-level confounders (XGBoost regression, 5-fold OOS). → residual_f = f − f̂(race_confounders)
- Stage 2: train bias classifier on residuals.
This handles non-linear interactions of race characteristics that within-race linear demeaning misses.
Race-level confounders (35 columns)
- rc_n_polls (total polls in race)
- rc_n_distinct_firms
- rc_n_candidates
- rc_n_weeks (polling spread)
- rc_max_log_sample, rc_mean_log_sample (investment level)
- rc_first_poll_dte, rc_last_poll_dte (timing range)
- rc_polling_density (polls per week)
- UF one-hot (~26 columns)
R² of residualization per feature
How much of each feature variance is explained by race confounders. Higher R² = feature was mostly race-predicted; the residual is the genuine within-poll signal.
| Feature | R² | Race-leak interpretation |
|---|---|---|
| log_sample | 0.825 | Almost entirely race-predicted |
| n_peers | 0.775 | Heavily race-predicted |
| error_concentration | 0.693 | Mostly race-predicted |
| n_cand_rows | 0.690 | Mostly race-predicted |
| herfindahl_shares | 0.432 | Moderately |
| top1_share | 0.375 | Moderately |
| days_to_election | 0.331 | Moderately |
| n_cands_at_zero | 0.300 | Moderate |
| peer_std | 0.288 | Moderate |
| share_skew | 0.281 | Moderate |
| signed_spike | 0.237 | Mostly within-poll |
| poll_std_dev | 0.235 | Mostly within-poll |
| mean_abs_dev | 0.228 | Mostly within-poll |
| max_signed_dev | 0.152 | Mostly within-poll |
| skew_dev | −0.037 | Pure within-poll |
The race-leakiest features were exactly the suspected ones: log_sample (race investment), n_peers (race attention), error_concentration / n_cand_rows (race structure). These were 75-85% race signal. The within-poll consensus features (signed_spike, poll_std_dev, mean_abs_dev, max_signed_dev, skew_dev) are >75% within-poll signal — those are the cleanest slant fingerprints.
Classifier results
5-fold CV OOS on race-residualized 28 features:
| Model | AUC | Log-loss | AP |
|---|---|---|---|
| XGBoost | 0.717 | 0.390 | 0.312 |
| LightGBM | 0.709 | 0.408 | 0.309 |
Detection AUC progression — complete table
| Method | Features | AUC | Reading |
|---|---|---|---|
| AN-100 / AN-101 | 7 basic | 0.69 | Baseline |
| AN-105 Mode A | 7 theoretical signals | 0.614 | Pure slant signatures, weak |
| AN-105 Mode B | 28 features, linear demean | 0.693 | Linear within-race signal |
| AN-106 DML | 28 features, non-linear residualize | 0.717 | + non-linear race effects |
| AN-104 raw | 28 features, unfiltered | 0.742 | (Partly race-proxy) |
| AN-103 full | + firm aggregates | 0.911 | (+ firm-identity leak) |
Three-way decomposition of the AN-104 → AN-101 gap:
- 0.69 → 0.693: noise (essentially same as basic)
- 0.693 → 0.717: non-linear race effects (+0.024)
- 0.717 → 0.742: remaining race-correlated signal DML can't remove (+0.025) — imperfect race confounders
- Genuine within-poll slant signal: ~0.69-0.72
Spec ladder with DML predictions
Reference = media. Cluster SE at race.
| Bucket | S0 No FE | S1 Race FE | S2 Firm FE | S3 Firm + Race FE |
|---|---|---|---|---|
| is_candidate | +0.071*** | −0.004 | +0.033*** | −0.004 |
| is_pollster_self | −0.000 | +0.000 | +0.002 | −0.004 |
| is_other_firm | +0.011** | +0.008* | +0.003 | +0.007 |
| is_shell | −0.015** | −0.006 | +0.008 | +0.011 |
| log_sample | −0.118*** | +0.022*** | −0.123*** | +0.009 |
Cross-method comparison of key coefficients
| Method | is_other_firm S1 | is_shell S0 |
|---|---|---|
| AN-103 (full, GBM, firm-leak) | +0.012 | −0.005 |
| AN-104 raw (28 features) | +0.006 | −0.055*** |
| AN-105 Mode A (7 theoretical) | −0.000 | +0.016*** |
| AN-105 Mode B (linear demean) | +0.007 | +0.020*** |
| AN-106 DML (non-linear) | +0.008* | −0.015** |
The is_shell sign flips across methods are diagnostic:
- AN-105 Mode A (small theoretical feature set): +0.016 → shells look slightly MORE sponsored on pure slant signatures
- AN-105 Mode B (linear demean): +0.020 → similar small positive after removing linear race signal
- AN-106 DML (non-linear): −0.015 → shells look LESS sponsored after PROPER (flexible) race control
- AN-104 raw (no race control): −0.055 → exaggerated negative
The DML version is the most methodologically defensible "do shells look sponsored?" answer: −0.015 at S0, null under FE. Under proper non-linear race control, shells DO mimic media polls — but the magnitude (−0.015) is much smaller than AN-104's −0.055 suggested. The shell "professional evasion" interpretation is real but modest, not dramatic.
The is_other_firm signal also weakens further: from +0.013** at S3 in AN-104 raw to +0.007 (ns) at S3 in AN-106 DML. The genuine within-firm-within-race shell-style signal in the unaudited tail is small.
Substantive synthesis
The DML analysis converges on three honest claims:
Genuine blind detection from public TSE data: AUC ~0.69–0.72. The 0.69 (linear) and 0.72 (non-linear) bracket the honest "what within-poll signal achieves" range. The AN-104 0.742 was inflated by ~0.05 AUC of race-proxy correlation that even DML can't fully remove.
Shells modestly mimic media. Under proper race control, is_shell coefficient is −0.015 (p<0.05) at S0 — meaningful but modest. The "professional evasion" framing from AN-103/ AN-104 was over-quantified; shells achieve partial structural blend-in, not absolute invisibility.
The within-poll consensus-deviation features are the cleanest slant signatures. signed_spike, poll_std_dev, mean_abs_dev, max_signed_dev all have R² < 0.25 in residualization — meaning >75% within-poll variance. These are the features the §Policy story should highlight as public, blindly-detectable signs of slant.
Honest §Policy framing — final version
| Detection regime | Method | AUC | Cleanest interpretation |
|---|---|---|---|
| Theoretical-feature-only | Hand-curated slant signals | 0.61 | Pure interpretable slant — small |
| Public-data blind | Within-poll + race-controlled | 0.69–0.72 | What public registry can do |
| Firm-augmented | + firm-level aggregates | 0.91 | Requires firm pooling |
| CNPJ-side audit | Capital + CNAE + web presence | qualitative | High-precision shell ID |
The defensible "blind detection from public data" range is 0.69–0.72. That's the public-policy-relevant number. AUC
0.72 requires firm-level pooling; AUC < 0.65 is what theory- only achieves. Both AN-094 CNPJ-side audit and AN-100/101/106 within-poll signals are needed to cover both axes of evasion.
Caveats
- DML is sensitive to race-confounder definition. Adding more flexible confounders (e.g., 3-state interactions) might push AUC down further. The 0.717 is a reasonable upper bound for "race-controlled" detection.
- 5-fold cross-fitting is standard DML protocol. Cross- validation within each stage is required to avoid leakage.
- Race confounders include UF one-hot (~26 dummy columns). Together with continuous race aggregates, this provides reasonable race-characteristic coverage. Adding firm-level confounders (e.g., firm × UF interaction terms) would push toward the AN-103 firm-augmented regime.
- Two-stage residualization with cross-fitting at each stage is the proper DML protocol; this script uses single cross-fitted residualization then plain CV at stage 2 (less rigorous but computationally cheaper). The 0.717 estimate is approximately right.
Follow-ups
- Train DML on 2020 + 2024 jointly with cycle as additional confounder. Tests cross-cycle generalization.
- Add firm-level confounders to stage 1 for a "race AND firm controlled" detector — closer to the firm-leak adjusted AN-103 regime.
- SHAP attribution on the DML-residualized classifier to identify the within-poll features that actually drive the 0.72 ceiling.
- Cross-fitted nested CV for fully-orthogonal DML inference. More compute-intensive but methodologically sharper.
Artifacts
- Script:
source/analysis/an-106-double-ml-detection.py - Model comparison:
build/table/an-106-double-ml-detection.csv - Headline JSON:
build/table/an-106-double-ml-detection.json - Distributable:
build/analysis/poll_ml_predictions_dml.parquet(race-residualized OOS DML predictions per poll)