Modern ML pipeline (XGBoost/LightGBM, 37 features, 5-fold CV) achieves OOS AUC 0.91 for detecting candidate-sponsored polls — up from 0.69 (AN-101 sponsor-blind features only) and 0.69 (AN-100 within-poll only). Big jump comes from firm-level aggregates (firm_share_candidate_work is the #1 feature at 0.19 importance; firm_has_beta_estimate at 0.16). Gradient boosting (LightGBM 0.911, XGBoost 0.911) modestly beats logistic regression (0.901) and random forest (0.905). Within-poll consensus-deviation features still rank in the top 10 (error_concentration, poll_std_dev, mean_abs_dev) — the within-poll fingerprint is real. Other_firm bucket (residual, no shells) has predicted-bias coefficient +0.018 (p<0.01) under firm FE — sponsored-like pattern persists even with the powerful classifier. Shell bucket has −0.072 (p<0.01) at S0, null under FE — at AUC 0.91 shells STILL don't light up, sharpening the 'professional evasion' interpretation.

Confidence
green
Type
ml-pipeline
Script
source/analysis/an-103-ml-bias-detection.py
Target
build/table/an-103-ml-bias-detection.csv
Status
interpreted · 2026-06-17
Created
2026-06-17

User asked (2026-06-17): comprehensive feature pool + serious ML to find the best possible bias predictor.

Pipeline

Model results

Y1: poll_has_candidate_sponsor (n=6,710, positives=1,026)

Model AUC Log-loss Avg Precision
Logistic L2 0.901 0.269 0.663
Random Forest 0.905 0.264 0.650
XGBoost 0.911 0.255 0.660
LightGBM 0.911 0.268 0.658

Y2: cand-sponsored OR shell-touched (n=6,710, positives=1,255)

Model AUC Log-loss Avg Precision
Logistic L2 0.844 0.350 0.612
Random Forest 0.890 0.310 0.690
XGBoost 0.899 0.292 0.704
LightGBM 0.899 0.302 0.699

Progression of detection AUC across analyses

Analysis Features Method AUC
AN-100 Within-poll only (7 feats) Logit 0.69
AN-101 Within-poll + sponsor-blind (7 feats) Logit OOS 0.69
AN-101 Same GBM OOS 0.69
AN-103 37 features, all families LightGBM 0.91

22-percentage-point jump from richer features + better models. Most of the gain is from feature engineering, not model class.

Top 10 features (XGBoost importance)

Rank Feature Importance Family
1 firm_share_candidate_work 0.189 E (firm)
2 firm_has_beta_estimate 0.158 E (firm)
3 log_sample 0.030 D (metadata)
4 race_n_polls 0.029 E (race)
5 firm_n_ufs 0.025 E (firm)
6 error_concentration (L-shape) 0.024 A (error)
7 poll_std_dev 0.023 B (consensus)
8 mean_abs_dev 0.022 B (consensus)
9 n_cand_rows 0.022 C (within-poll)
10 n_peers 0.022 B (consensus)

The firm-level features dominate. Together top-2 = 35% of importance. These features encode "what does this firm's typical sponsorship look like?" — a form of label leakage at the firm level. A genuinely firm-blind classifier would have lower AUC (closer to AN-101's 0.69).

The within-poll signals (L-shape, consensus features) rank in positions 6–10 — meaningful but secondary. A fully blind classifier achievable from public TSE data alone is ~0.69; the 0.91 figure includes the firm-aggregate signal that requires firm identity to be known.

Spec ladder with LightGBM predicted bias as outcome

Reference = media. Cluster SE at race.

Bucket S0 No FE S1 Race FE S2 Firm FE S3 Firm + Race FE
is_candidate +0.399*** (0.012) +0.203*** (0.016) +0.075*** (0.009) +0.027** (0.012)
is_pollster_self −0.024*** (0.005) −0.020*** (0.007) +0.003 (0.007) −0.002 (0.007)
is_other_firm +0.066*** (0.008) +0.040*** (0.009) +0.018*** (0.006) +0.011 (0.007)
is_shell −0.072*** (0.009) +0.015 (0.012) −0.003 (0.008) +0.014 (0.010)
log_sample −0.088*** (0.006) +0.001 (0.011) −0.100*** (0.009) −0.037*** (0.010)
n 6,710 5,481 6,645 5,384

Three substantive findings:

  1. is_candidate survives even S3 (+0.027, p<0.05): with a powerful classifier, within firm × race, candidate-sponsored polls still receive higher predicted bias than media polls of the same firm × race. This is a much sharper finding than AN-095's null |error| under S3.

  2. is_other_firm is positive across S0/S1/S2 (all p<0.01). The other_firm tier (excluding shells) consistently looks sponsored-like in the rich feature space. The shell-like pattern in the unaudited tail is robust.

  3. is_shell is significantly NEGATIVE at S0 (−0.072, p<0.01). At the ceiling of detection power we can achieve from 2024 data, shells STILL don't light up. The "professional evasion" reading sharpens: shells aren't beating a weak detector; they're beating a strong one.

Substantive interpretation

The shell signal is structurally distinct from typical candidate-sponsored slant:

The shell architecture optimizes for blend-in across multiple margins simultaneously. A statistical pattern-based detector (AN-100, AN-101, AN-103) cannot reach them; CNPJ-side audit (AN-094-style) can.

Policy implication update: the AN-100 "publicly-computable suspicion score" recommendation stands but should be paired with CNPJ-side classification for shell detection. The two detection mechanisms are complementary:

Combined, they cover both axes of evasion.

Update (2026-06-17): AN-103 → AN-106 honest framing

The 0.911 AUC reported here uses firm-level aggregates (firm_share_candidate_work, firm_has_beta_estimate) that encode label information at the firm level. The defensible "blind detection from public data" figure for the §Policy story is AN-106's 0.72, not 0.91. The AN-103 0.91 is the firm-augmented ceiling — relevant for the firm-level reputation mechanism (see paper appendix), but not what an auditor of a single newly-released poll can achieve. See AN-104AN-105AN-106 for the cleanup chain that establishes the honest 0.72 ceiling.

Caveats

Unsupervised pretraining — explicit decision

Skipped. Reasoning:

Follow-ups

  1. Multi-cycle pooling. Build the 2020 + 2022 + 2024 panel, re-train with cross-cycle features (firm's track record across cycles). Could push AUC to 0.93–0.95 and enable self-supervised pretraining. ~1 month of work.
  2. Hyperparameter tuning. Bayesian optimization over XGBoost / LightGBM hyperparams with nested CV. ~5-day job. Expected ~+0.5–1.5 AUC points.
  3. Firm-blind ablation. Re-run dropping all firm-level features. Honest "fully-blind" AUC for the §Policy story — probably lands at AN-101's 0.69 range.
  4. SHAP feature attribution with the LightGBM model. Replaces tree-importance with theoretically-justified attribution.
  5. Out-of-sample 2020-train → 2024-test. Tests generalization across cycles.
  6. Multi-class target: predict the 5-bucket sponsor class directly. Direct shell-vs-candidate-vs-media discrimination in one classifier.
  7. Distributable score: the OOS predictions are saved at build/analysis/poll_ml_predictions.parquet — per-poll suspicion score for downstream use.

Artifacts