AN-103: Modern ML for bias detection

Modern ML pipeline (XGBoost/LightGBM, 37 features, 5-fold CV) achieves OOS AUC 0.91 for detecting candidate-sponsored polls — up from 0.69 (AN-101 sponsor-blind features only) and 0.69 (AN-100 within-poll only). Big jump comes from firm-level aggregates (firm_share_candidate_work is the #1 feature at 0.19 importance; firm_has_beta_estimate at 0.16). Gradient boosting (LightGBM 0.911, XGBoost 0.911) modestly beats logistic regression (0.901) and random forest (0.905). Within-poll consensus-deviation features still rank in the top 10 (error_concentration, poll_std_dev, mean_abs_dev) — the within-poll fingerprint is real. Other_firm bucket (residual, no shells) has predicted-bias coefficient +0.018 (p<0.01) under firm FE — sponsored-like pattern persists even with the powerful classifier. Shell bucket has −0.072 (p<0.01) at S0, null under FE — at AUC 0.91 shells STILL don't light up, sharpening the 'professional evasion' interpretation.

Hypothesis: H13: Shell-contratante polls show larger residual β
Confidence: green
Type: ml-pipeline

Script: source/analysis/an-103-ml-bias-detection.py
Target: build/table/an-103-ml-bias-detection.csv
Status: interpreted · 2026-06-17
Created: 2026-06-17

User asked (2026-06-17): comprehensive feature pool + serious ML to find the best possible bias predictor.

Pipeline

37 features, 5 families (error structure vs final results, consensus deviation vs peer polls, within-poll distribution shape, poll metadata, race + firm aggregates)
4 models (logistic L2, random forest, XGBoost, LightGBM)
5-fold cross-validated training; OOS predictions for all polls
2 targets: poll_has_candidate_sponsor (primary, matches AN-101), cand-sponsored OR shell-touched (broad bias)

Model results

Model	AUC	Log-loss	Avg Precision
Logistic L2	0.901	0.269	0.663
Random Forest	0.905	0.264	0.650
XGBoost	0.911	0.255	0.660
LightGBM	0.911	0.268	0.658

Y2: cand-sponsored OR shell-touched (n=6,710, positives=1,255)

Model	AUC	Log-loss	Avg Precision
Logistic L2	0.844	0.350	0.612
Random Forest	0.890	0.310	0.690
XGBoost	0.899	0.292	0.704
LightGBM	0.899	0.302	0.699

Progression of detection AUC across analyses

Analysis	Features	Method	AUC
AN-100	Within-poll only (7 feats)	Logit	0.69
AN-101	Within-poll + sponsor-blind (7 feats)	Logit OOS	0.69
AN-101	Same	GBM OOS	0.69
AN-103	37 features, all families	LightGBM	0.91

22-percentage-point jump from richer features + better models. Most of the gain is from feature engineering, not model class.

Top 10 features (XGBoost importance)

Rank	Feature	Importance	Family
1	firm_share_candidate_work	0.189	E (firm)
2	firm_has_beta_estimate	0.158	E (firm)
3	log_sample	0.030	D (metadata)
4	race_n_polls	0.029	E (race)
5	firm_n_ufs	0.025	E (firm)
6	error_concentration (L-shape)	0.024	A (error)
7	poll_std_dev	0.023	B (consensus)
8	mean_abs_dev	0.022	B (consensus)
9	n_cand_rows	0.022	C (within-poll)
10	n_peers	0.022	B (consensus)

The firm-level features dominate. Together top-2 = 35% of importance. These features encode "what does this firm's typical sponsorship look like?" — a form of label leakage at the firm level. A genuinely firm-blind classifier would have lower AUC (closer to AN-101's 0.69).

The within-poll signals (L-shape, consensus features) rank in positions 6–10 — meaningful but secondary. A fully blind classifier achievable from public TSE data alone is ~0.69; the 0.91 figure includes the firm-aggregate signal that requires firm identity to be known.

Spec ladder with LightGBM predicted bias as outcome

Reference = media. Cluster SE at race.

Bucket	S0 No FE	S1 Race FE	S2 Firm FE	S3 Firm + Race FE
is_candidate	+0.399*** (0.012)	+0.203*** (0.016)	+0.075*** (0.009)	+0.027** (0.012)
is_pollster_self	−0.024*** (0.005)	−0.020*** (0.007)	+0.003 (0.007)	−0.002 (0.007)
is_other_firm	+0.066*** (0.008)	+0.040*** (0.009)	+0.018*** (0.006)	+0.011 (0.007)
is_shell	−0.072*** (0.009)	+0.015 (0.012)	−0.003 (0.008)	+0.014 (0.010)
log_sample	−0.088*** (0.006)	+0.001 (0.011)	−0.100*** (0.009)	−0.037*** (0.010)
n	6,710	5,481	6,645	5,384

Three substantive findings:

is_candidate survives even S3 (+0.027, p<0.05): with a powerful classifier, within firm × race, candidate-sponsored polls still receive higher predicted bias than media polls of the same firm × race. This is a much sharper finding than AN-095's null |error| under S3.
is_other_firm is positive across S0/S1/S2 (all p<0.01). The other_firm tier (excluding shells) consistently looks sponsored-like in the rich feature space. The shell-like pattern in the unaudited tail is robust.
is_shell is significantly NEGATIVE at S0 (−0.072, p<0.01). At the ceiling of detection power we can achieve from 2024 data, shells STILL don't light up. The "professional evasion" reading sharpens: shells aren't beating a weak detector; they're beating a strong one.

Substantive interpretation

The shell signal is structurally distinct from typical candidate-sponsored slant:

AN-094 audit: shells have CNAE / capital social / web- presence signatures (the right identification tool is registry-side, not poll-pattern-side).
AN-096 (other session) bipartite: shells route to one pollster captively; 8 of 14 ≥ 80% top-pollster share.
AN-102 + AN-103: shells run media-typical sample sizes, produce media-typical within-poll patterns, and route through firms with media-typical customer mix.

The shell architecture optimizes for blend-in across multiple margins simultaneously. A statistical pattern-based detector (AN-100, AN-101, AN-103) cannot reach them; CNPJ-side audit (AN-094-style) can.

Policy implication update: the AN-100 "publicly-computable suspicion score" recommendation stands but should be paired with CNPJ-side classification for shell detection. The two detection mechanisms are complementary:

Statistical detector: catches the unaudited other_firm tail
- candidate-sponsored polls + (less-sophisticated shells if any)
CNPJ-side audit: catches the AN-094-style professional shells

Combined, they cover both axes of evasion.

Update (2026-06-17): AN-103 → AN-106 honest framing

The 0.911 AUC reported here uses firm-level aggregates (firm_share_candidate_work, firm_has_beta_estimate) that encode label information at the firm level. The defensible "blind detection from public data" figure for the §Policy story is AN-106's 0.72, not 0.91. The AN-103 0.91 is the firm-augmented ceiling — relevant for the firm-level reputation mechanism (see paper appendix), but not what an auditor of a single newly-released poll can achieve. See AN-104 → AN-105 → AN-106 for the cleanup chain that establishes the honest 0.72 ceiling.

Caveats

AUC 0.91 inflated by firm-aggregate leak. Firm-level features encode label information (which firms take how much candidate work). True "blind" detection AUC is ~0.69 (AN-100/101). The 0.22-point gap is what firm identity buys.
6,710 polls in the analysis sample. Polls without ≥1 peer poll for consensus computation are dropped, losing ~3k polls from the matched-share sample. The detection AUC may not generalize to low-attention races.
Tree models converged to similar AUC. XGBoost and LightGBM are statistically indistinguishable. Random forest is 0.6 pp below. Logistic regression is 1.0 pp below — non-linear interactions matter modestly but not dramatically.
5-fold CV with default hyperparameters. Hyperparameter tuning (Bayesian optimization or random search) would realistically add 0.5–1.5 AUC points but is not paradigm-changing.
No SHAP analysis in this pass. Tree-based importance is reported. SHAP would give more reliable feature attribution but is a follow-up.

Unsupervised pretraining — explicit decision

Skipped. Reasoning:

n is too small for self-supervised methods to outperform supervised gradient boosting on engineered features. Modern tabular pretraining methods (SAINT, FT-Transformer, TabNet) need n > 50k–100k to beat XGBoost reliably; at n=6,710 they underperform.
The supervised signal is strong. 1,026 positive labels for Y1 is plenty for supervised gradient boosting on 37 features.
Firm-level aggregation is the de facto "embedding" available here. The firm_share_candidate_work and firm_has_beta_estimate features encode firm identity in a way that captures most of the value pre-training would add. Multi-cycle pooling (2020 + 2022 + 2024) ≈ 30-50k polls is where pretraining could plausibly start to help — noted as follow-up #1.

Follow-ups

Multi-cycle pooling. Build the 2020 + 2022 + 2024 panel, re-train with cross-cycle features (firm's track record across cycles). Could push AUC to 0.93–0.95 and enable self-supervised pretraining. ~1 month of work.
Hyperparameter tuning. Bayesian optimization over XGBoost / LightGBM hyperparams with nested CV. ~5-day job. Expected ~+0.5–1.5 AUC points.
Firm-blind ablation. Re-run dropping all firm-level features. Honest "fully-blind" AUC for the §Policy story — probably lands at AN-101's 0.69 range.
SHAP feature attribution with the LightGBM model. Replaces tree-importance with theoretically-justified attribution.
Out-of-sample 2020-train → 2024-test. Tests generalization across cycles.
Multi-class target: predict the 5-bucket sponsor class directly. Direct shell-vs-candidate-vs-media discrimination in one classifier.
Distributable score: the OOS predictions are saved at build/analysis/poll_ml_predictions.parquet — per-poll suspicion score for downstream use.

Artifacts

Script: source/analysis/an-103-ml-bias-detection.py
Model comparison: build/table/an-103-ml-bias-detection.csv
Feature importance: build/table/an-103-feature-importance.csv
Headline JSON: build/table/an-103-ml-bias-detection.json
Distributable: build/analysis/poll_features_ml.parquet (37 features per poll), build/analysis/poll_ml_predictions.parquet (OOS predicted bias from LightGBM)

AN-100 sponsor-blind detection — original blind detector
AN-101 predicted bias as outcome — predecessor with 7 features
AN-102 headline with shell bucket
AN-094 (other session) shell audit
AN-098 noise floor