AN-106: Double-ML race residualization

Double-machine-learning (race-confounder residualization via cross-fitted XGBoost regression, then XGBoost classifier on residuals) achieves OOS AUC 0.717 — sits between AN-105 Mode B linear demeaning (0.693) and AN-104 raw (0.742). Non-linear race effects worth +0.024 AUC vs linear demean; remaining +0.025 to AN-104 raw is race-correlated signal that DML cannot remove (race confounders are imperfect proxies). R² of residualization shows the race-leakiest features are log_sample (0.825), n_peers (0.775), error_concentration (0.693), n_cand_rows (0.690) — those were carrying mostly race signal. Spec ladder shows is_shell coefficient flips back to NEGATIVE at S0 (−0.015\*\*) under DML — substantively cleanest 'shells mimic media' finding. is_other_firm at S1 (+0.008\*) borderline positive but weakens further at S3 (+0.007 ns).

Hypothesis: H13: Shell-contratante polls show larger residual β
Confidence: green
Type: ml-double-ml

Script: source/analysis/an-106-double-ml-detection.py
Target: build/table/an-106-double-ml-detection.csv
Status: interpreted · 2026-06-17
Created: 2026-06-17

User-flagged follow-up #3 from AN-105: linear within-race demeaning may miss non-linear race effects. The cleaner approach is double-machine-learning (Robinson 1988 / Chernozhukov et al. 2018 style):

Stage 1: for each feature f, train a flexible model predicting f from race-level confounders (XGBoost regression, 5-fold OOS). → residual_f = f − f̂(race_confounders)
Stage 2: train bias classifier on residuals.

This handles non-linear interactions of race characteristics that within-race linear demeaning misses.

Race-level confounders (35 columns)

rc_n_polls (total polls in race)
rc_n_distinct_firms
rc_n_candidates
rc_n_weeks (polling spread)
rc_max_log_sample, rc_mean_log_sample (investment level)
rc_first_poll_dte, rc_last_poll_dte (timing range)
rc_polling_density (polls per week)
UF one-hot (~26 columns)

R² of residualization per feature

How much of each feature variance is explained by race confounders. Higher R² = feature was mostly race-predicted; the residual is the genuine within-poll signal.

Feature	R²	Race-leak interpretation
log_sample	0.825	Almost entirely race-predicted
n_peers	0.775	Heavily race-predicted
error_concentration	0.693	Mostly race-predicted
n_cand_rows	0.690	Mostly race-predicted
herfindahl_shares	0.432	Moderately
top1_share	0.375	Moderately
days_to_election	0.331	Moderately
n_cands_at_zero	0.300	Moderate
peer_std	0.288	Moderate
share_skew	0.281	Moderate
signed_spike	0.237	Mostly within-poll
poll_std_dev	0.235	Mostly within-poll
mean_abs_dev	0.228	Mostly within-poll
max_signed_dev	0.152	Mostly within-poll
skew_dev	−0.037	Pure within-poll

The race-leakiest features were exactly the suspected ones: log_sample (race investment), n_peers (race attention), error_concentration / n_cand_rows (race structure). These were 75-85% race signal. The within-poll consensus features (signed_spike, poll_std_dev, mean_abs_dev, max_signed_dev, skew_dev) are >75% within-poll signal — those are the cleanest slant fingerprints.

Classifier results

5-fold CV OOS on race-residualized 28 features:

Model	AUC	Log-loss	AP
XGBoost	0.717	0.390	0.312
LightGBM	0.709	0.408	0.309

Detection AUC progression — complete table

Method	Features	AUC	Reading
AN-100 / AN-101	7 basic	0.69	Baseline
AN-105 Mode A	7 theoretical signals	0.614	Pure slant signatures, weak
AN-105 Mode B	28 features, linear demean	0.693	Linear within-race signal
AN-106 DML	28 features, non-linear residualize	0.717	+ non-linear race effects
AN-104 raw	28 features, unfiltered	0.742	(Partly race-proxy)
AN-103 full	+ firm aggregates	0.911	(+ firm-identity leak)

Three-way decomposition of the AN-104 → AN-101 gap:

0.69 → 0.693: noise (essentially same as basic)
0.693 → 0.717: non-linear race effects (+0.024)
0.717 → 0.742: remaining race-correlated signal DML can't remove (+0.025) — imperfect race confounders
Genuine within-poll slant signal: ~0.69-0.72

Spec ladder with DML predictions

Reference = media. Cluster SE at race.

Bucket	S0 No FE	S1 Race FE	S2 Firm FE	S3 Firm + Race FE
is_candidate	+0.071***	−0.004	+0.033***	−0.004
is_pollster_self	−0.000	+0.000	+0.002	−0.004
is_other_firm	+0.011**	+0.008*	+0.003	+0.007
is_shell	−0.015**	−0.006	+0.008	+0.011
log_sample	−0.118***	+0.022***	−0.123***	+0.009

Cross-method comparison of key coefficients

Method	is_other_firm S1	is_shell S0
AN-103 (full, GBM, firm-leak)	+0.012	−0.005
AN-104 raw (28 features)	+0.006	−0.055***
AN-105 Mode A (7 theoretical)	−0.000	+0.016***
AN-105 Mode B (linear demean)	+0.007	+0.020***
AN-106 DML (non-linear)	+0.008*	−0.015**

The is_shell sign flips across methods are diagnostic:

AN-105 Mode A (small theoretical feature set): +0.016 → shells look slightly MORE sponsored on pure slant signatures
AN-105 Mode B (linear demean): +0.020 → similar small positive after removing linear race signal
AN-106 DML (non-linear): −0.015 → shells look LESS sponsored after PROPER (flexible) race control
AN-104 raw (no race control): −0.055 → exaggerated negative

The DML version is the most methodologically defensible "do shells look sponsored?" answer: −0.015 at S0, null under FE. Under proper non-linear race control, shells DO mimic media polls — but the magnitude (−0.015) is much smaller than AN-104's −0.055 suggested. The shell "professional evasion" interpretation is real but modest, not dramatic.

The is_other_firm signal also weakens further: from +0.013** at S3 in AN-104 raw to +0.007 (ns) at S3 in AN-106 DML. The genuine within-firm-within-race shell-style signal in the unaudited tail is small.

Substantive synthesis

The DML analysis converges on three honest claims:

Genuine blind detection from public TSE data: AUC ~0.69–0.72. The 0.69 (linear) and 0.72 (non-linear) bracket the honest "what within-poll signal achieves" range. The AN-104 0.742 was inflated by ~0.05 AUC of race-proxy correlation that even DML can't fully remove.
Shells modestly mimic media. Under proper race control, is_shell coefficient is −0.015 (p<0.05) at S0 — meaningful but modest. The "professional evasion" framing from AN-103/ AN-104 was over-quantified; shells achieve partial structural blend-in, not absolute invisibility.
The within-poll consensus-deviation features are the cleanest slant signatures. signed_spike, poll_std_dev, mean_abs_dev, max_signed_dev all have R² < 0.25 in residualization — meaning >75% within-poll variance. These are the features the §Policy story should highlight as public, blindly-detectable signs of slant.

Honest §Policy framing — final version

Detection regime	Method	AUC	Cleanest interpretation
Theoretical-feature-only	Hand-curated slant signals	0.61	Pure interpretable slant — small
Public-data blind	Within-poll + race-controlled	0.69–0.72	What public registry can do
Firm-augmented	+ firm-level aggregates	0.91	Requires firm pooling
CNPJ-side audit	Capital + CNAE + web presence	qualitative	High-precision shell ID

The defensible "blind detection from public data" range is 0.69–0.72. That's the public-policy-relevant number. AUC

0.72 requires firm-level pooling; AUC < 0.65 is what theory- only achieves. Both AN-094 CNPJ-side audit and AN-100/101/106 within-poll signals are needed to cover both axes of evasion.

Caveats

DML is sensitive to race-confounder definition. Adding more flexible confounders (e.g., 3-state interactions) might push AUC down further. The 0.717 is a reasonable upper bound for "race-controlled" detection.
5-fold cross-fitting is standard DML protocol. Cross- validation within each stage is required to avoid leakage.
Race confounders include UF one-hot (~26 dummy columns). Together with continuous race aggregates, this provides reasonable race-characteristic coverage. Adding firm-level confounders (e.g., firm × UF interaction terms) would push toward the AN-103 firm-augmented regime.
Two-stage residualization with cross-fitting at each stage is the proper DML protocol; this script uses single cross-fitted residualization then plain CV at stage 2 (less rigorous but computationally cheaper). The 0.717 estimate is approximately right.

Follow-ups

Train DML on 2020 + 2024 jointly with cycle as additional confounder. Tests cross-cycle generalization.
Add firm-level confounders to stage 1 for a "race AND firm controlled" detector — closer to the firm-leak adjusted AN-103 regime.
SHAP attribution on the DML-residualized classifier to identify the within-poll features that actually drive the 0.72 ceiling.
Cross-fitted nested CV for fully-orthogonal DML inference. More compute-intensive but methodologically sharper.

Artifacts

Script: source/analysis/an-106-double-ml-detection.py
Model comparison: build/table/an-106-double-ml-detection.csv
Headline JSON: build/table/an-106-double-ml-detection.json
Distributable: build/analysis/poll_ml_predictions_dml.parquet (race-residualized OOS DML predictions per poll)