Modern ML pipeline (XGBoost/LightGBM, 37 features, 5-fold CV) achieves OOS AUC 0.91 for detecting candidate-sponsored polls — up from 0.69 (AN-101 sponsor-blind features only) and 0.69 (AN-100 within-poll only). Big jump comes from firm-level aggregates (firm_share_candidate_work is the #1 feature at 0.19 importance; firm_has_beta_estimate at 0.16). Gradient boosting (LightGBM 0.911, XGBoost 0.911) modestly beats logistic regression (0.901) and random forest (0.905). Within-poll consensus-deviation features still rank in the top 10 (error_concentration, poll_std_dev, mean_abs_dev) — the within-poll fingerprint is real. Other_firm bucket (residual, no shells) has predicted-bias coefficient +0.018 (p<0.01) under firm FE — sponsored-like pattern persists even with the powerful classifier. Shell bucket has −0.072 (p<0.01) at S0, null under FE — at AUC 0.91 shells STILL don't light up, sharpening the 'professional evasion' interpretation.
User asked (2026-06-17): comprehensive feature pool + serious ML to find the best possible bias predictor.
Pipeline
- 37 features, 5 families (error structure vs final results, consensus deviation vs peer polls, within-poll distribution shape, poll metadata, race + firm aggregates)
- 4 models (logistic L2, random forest, XGBoost, LightGBM)
- 5-fold cross-validated training; OOS predictions for all polls
- 2 targets:
poll_has_candidate_sponsor(primary, matches AN-101),cand-sponsored OR shell-touched(broad bias)
Model results
Y1: poll_has_candidate_sponsor (n=6,710, positives=1,026)
| Model | AUC | Log-loss | Avg Precision |
|---|---|---|---|
| Logistic L2 | 0.901 | 0.269 | 0.663 |
| Random Forest | 0.905 | 0.264 | 0.650 |
| XGBoost | 0.911 | 0.255 | 0.660 |
| LightGBM | 0.911 | 0.268 | 0.658 |
Y2: cand-sponsored OR shell-touched (n=6,710, positives=1,255)
| Model | AUC | Log-loss | Avg Precision |
|---|---|---|---|
| Logistic L2 | 0.844 | 0.350 | 0.612 |
| Random Forest | 0.890 | 0.310 | 0.690 |
| XGBoost | 0.899 | 0.292 | 0.704 |
| LightGBM | 0.899 | 0.302 | 0.699 |
Progression of detection AUC across analyses
| Analysis | Features | Method | AUC |
|---|---|---|---|
| AN-100 | Within-poll only (7 feats) | Logit | 0.69 |
| AN-101 | Within-poll + sponsor-blind (7 feats) | Logit OOS | 0.69 |
| AN-101 | Same | GBM OOS | 0.69 |
| AN-103 | 37 features, all families | LightGBM | 0.91 |
22-percentage-point jump from richer features + better models. Most of the gain is from feature engineering, not model class.
Top 10 features (XGBoost importance)
| Rank | Feature | Importance | Family |
|---|---|---|---|
| 1 | firm_share_candidate_work | 0.189 | E (firm) |
| 2 | firm_has_beta_estimate | 0.158 | E (firm) |
| 3 | log_sample | 0.030 | D (metadata) |
| 4 | race_n_polls | 0.029 | E (race) |
| 5 | firm_n_ufs | 0.025 | E (firm) |
| 6 | error_concentration (L-shape) | 0.024 | A (error) |
| 7 | poll_std_dev | 0.023 | B (consensus) |
| 8 | mean_abs_dev | 0.022 | B (consensus) |
| 9 | n_cand_rows | 0.022 | C (within-poll) |
| 10 | n_peers | 0.022 | B (consensus) |
The firm-level features dominate. Together top-2 = 35% of importance. These features encode "what does this firm's typical sponsorship look like?" — a form of label leakage at the firm level. A genuinely firm-blind classifier would have lower AUC (closer to AN-101's 0.69).
The within-poll signals (L-shape, consensus features) rank in positions 6–10 — meaningful but secondary. A fully blind classifier achievable from public TSE data alone is ~0.69; the 0.91 figure includes the firm-aggregate signal that requires firm identity to be known.
Spec ladder with LightGBM predicted bias as outcome
Reference = media. Cluster SE at race.
| Bucket | S0 No FE | S1 Race FE | S2 Firm FE | S3 Firm + Race FE |
|---|---|---|---|---|
| is_candidate | +0.399*** (0.012) | +0.203*** (0.016) | +0.075*** (0.009) | +0.027** (0.012) |
| is_pollster_self | −0.024*** (0.005) | −0.020*** (0.007) | +0.003 (0.007) | −0.002 (0.007) |
| is_other_firm | +0.066*** (0.008) | +0.040*** (0.009) | +0.018*** (0.006) | +0.011 (0.007) |
| is_shell | −0.072*** (0.009) | +0.015 (0.012) | −0.003 (0.008) | +0.014 (0.010) |
| log_sample | −0.088*** (0.006) | +0.001 (0.011) | −0.100*** (0.009) | −0.037*** (0.010) |
| n | 6,710 | 5,481 | 6,645 | 5,384 |
Three substantive findings:
is_candidate survives even S3 (+0.027, p<0.05): with a powerful classifier, within firm × race, candidate-sponsored polls still receive higher predicted bias than media polls of the same firm × race. This is a much sharper finding than AN-095's null |error| under S3.
is_other_firm is positive across S0/S1/S2 (all p<0.01). The other_firm tier (excluding shells) consistently looks sponsored-like in the rich feature space. The shell-like pattern in the unaudited tail is robust.
is_shell is significantly NEGATIVE at S0 (−0.072, p<0.01). At the ceiling of detection power we can achieve from 2024 data, shells STILL don't light up. The "professional evasion" reading sharpens: shells aren't beating a weak detector; they're beating a strong one.
Substantive interpretation
The shell signal is structurally distinct from typical candidate-sponsored slant:
- AN-094 audit: shells have CNAE / capital social / web- presence signatures (the right identification tool is registry-side, not poll-pattern-side).
- AN-096 (other session) bipartite: shells route to one pollster captively; 8 of 14 ≥ 80% top-pollster share.
- AN-102 + AN-103: shells run media-typical sample sizes, produce media-typical within-poll patterns, and route through firms with media-typical customer mix.
The shell architecture optimizes for blend-in across multiple margins simultaneously. A statistical pattern-based detector (AN-100, AN-101, AN-103) cannot reach them; CNPJ-side audit (AN-094-style) can.
Policy implication update: the AN-100 "publicly-computable suspicion score" recommendation stands but should be paired with CNPJ-side classification for shell detection. The two detection mechanisms are complementary:
- Statistical detector: catches the unaudited other_firm tail
- candidate-sponsored polls + (less-sophisticated shells if any)
- CNPJ-side audit: catches the AN-094-style professional shells
Combined, they cover both axes of evasion.
Update (2026-06-17): AN-103 → AN-106 honest framing
The 0.911 AUC reported here uses firm-level aggregates
(firm_share_candidate_work, firm_has_beta_estimate) that
encode label information at the firm level. The defensible
"blind detection from public data" figure for the §Policy
story is AN-106's 0.72, not 0.91. The AN-103 0.91 is the
firm-augmented ceiling — relevant for the firm-level
reputation mechanism (see paper appendix), but not what an
auditor of a single newly-released poll can achieve. See AN-104
→ AN-105 → AN-106 for the cleanup chain that establishes the
honest 0.72 ceiling.
Caveats
- AUC 0.91 inflated by firm-aggregate leak. Firm-level features encode label information (which firms take how much candidate work). True "blind" detection AUC is ~0.69 (AN-100/101). The 0.22-point gap is what firm identity buys.
- 6,710 polls in the analysis sample. Polls without ≥1 peer poll for consensus computation are dropped, losing ~3k polls from the matched-share sample. The detection AUC may not generalize to low-attention races.
- Tree models converged to similar AUC. XGBoost and LightGBM are statistically indistinguishable. Random forest is 0.6 pp below. Logistic regression is 1.0 pp below — non-linear interactions matter modestly but not dramatically.
- 5-fold CV with default hyperparameters. Hyperparameter tuning (Bayesian optimization or random search) would realistically add 0.5–1.5 AUC points but is not paradigm-changing.
- No SHAP analysis in this pass. Tree-based importance is reported. SHAP would give more reliable feature attribution but is a follow-up.
Unsupervised pretraining — explicit decision
Skipped. Reasoning:
- n is too small for self-supervised methods to outperform supervised gradient boosting on engineered features. Modern tabular pretraining methods (SAINT, FT-Transformer, TabNet) need n > 50k–100k to beat XGBoost reliably; at n=6,710 they underperform.
- The supervised signal is strong. 1,026 positive labels for Y1 is plenty for supervised gradient boosting on 37 features.
- Firm-level aggregation is the de facto "embedding" available here. The firm_share_candidate_work and firm_has_beta_estimate features encode firm identity in a way that captures most of the value pre-training would add. Multi-cycle pooling (2020 + 2022 + 2024) ≈ 30-50k polls is where pretraining could plausibly start to help — noted as follow-up #1.
Follow-ups
- Multi-cycle pooling. Build the 2020 + 2022 + 2024 panel, re-train with cross-cycle features (firm's track record across cycles). Could push AUC to 0.93–0.95 and enable self-supervised pretraining. ~1 month of work.
- Hyperparameter tuning. Bayesian optimization over XGBoost / LightGBM hyperparams with nested CV. ~5-day job. Expected ~+0.5–1.5 AUC points.
- Firm-blind ablation. Re-run dropping all firm-level features. Honest "fully-blind" AUC for the §Policy story — probably lands at AN-101's 0.69 range.
- SHAP feature attribution with the LightGBM model. Replaces tree-importance with theoretically-justified attribution.
- Out-of-sample 2020-train → 2024-test. Tests generalization across cycles.
- Multi-class target: predict the 5-bucket sponsor class directly. Direct shell-vs-candidate-vs-media discrimination in one classifier.
- Distributable score: the OOS predictions are saved at
build/analysis/poll_ml_predictions.parquet— per-poll suspicion score for downstream use.
Artifacts
- Script:
source/analysis/an-103-ml-bias-detection.py - Model comparison:
build/table/an-103-ml-bias-detection.csv - Feature importance:
build/table/an-103-feature-importance.csv - Headline JSON:
build/table/an-103-ml-bias-detection.json - Distributable:
build/analysis/poll_features_ml.parquet(37 features per poll),build/analysis/poll_ml_predictions.parquet(OOS predicted bias from LightGBM)
Related
- AN-100 sponsor-blind detection — original blind detector
- AN-101 predicted bias as outcome — predecessor with 7 features
- AN-102 headline with shell bucket
- AN-094 (other session) shell audit
- AN-098 noise floor