AN-101: Predicted bias as outcome — shell detection via ML

Out-of-sample predicted bias (from 5-fold CV gradient boosting trained on AN-100 sponsor-blind features → poll_has_candidate_sponsor label) shows other_firm polls have +1.6 to +2.4 pp higher predicted-bias probability than major-media polls under S0/S1/S2 (p<0.10 to p<0.01). The classifier was never trained on the other_firm label; it only learned the candidate-sponsorship pattern. The fact that other_firm polls light up as positively biased is independent algorithmic evidence for shell-style slant. The signal collapses under joint S3 (firm + race FE) — same composition story as the |error| version. AUC of out-of-sample classification = 0.69 (matches AN-100). Provides a publicly distributable per-poll suspicion score at build/analysis/poll_suspicion_score.parquet.

Hypothesis: H13: Shell-contratante polls show larger residual β
Confidence: green
Type: ml-policy-mechanism

Script: source/analysis/an-101-predicted-bias-as-outcome.py
Target: build/table/an-101-predicted-bias-as-outcome.csv
Status: interpreted · 2026-06-17
Created: 2026-06-17

User suggestion (2026-06-17): "Apply [the AN-100 detector] to all polls and then see if predicted bias varies statistically with sponsor type etc (same big table as before but with predicted bias as outcome). We could eventually do this with ML methods to improve precision of the prediction further."

This script does exactly that: produces out-of-sample predicted bias for every poll with computable consensus, then regresses the predicted bias on the 5-bucket sponsor classification with the spec ladder.

Pipeline

Build consensus_dev per cand-poll row (AN-099 logic: median of OTHER unsponsored polls of same cand × race in ±14 days).
Compute poll-level features (AN-100): max_signed_dev, signed_spike, poll_std_dev, mean_abs_dev, skew_dev, max_abs_dev, log_sample.
5-fold cross-validation: train a classifier on poll_has_candidate_sponsor label, predict probabilities out-of-sample.
Two classifiers:
- Logistic regression (linear, interpretable)
- Gradient boosting (200 trees, max depth 3, learning rate 0.05)
Regression: pred_bias ~ sponsor_buckets + log_sample | FE spec ladder.

The classifier sees only one of two labels — sponsored candidate-linked or not — and learns the within-poll pattern that distinguishes them. It is NOT trained on small_media, pollster_self, or other_firm labels. If those buckets' predicted-bias values differ from major-media's, the classifier is detecting their similarity to candidate- sponsored polls in its learned feature space.

Out-of-sample AUC

Classifier	OOS AUC
Logistic regression	0.688
Gradient boosting	0.692

Both classifiers are in the same "fair triage" range as AN-100. GBM marginally better.

Spec ladder: predicted bias as outcome (GBM)

Cluster SE at race. Reference = major_media. Coefficients are in probability-points (0.05 = a 5 pp probability shift).

Bucket (ref = major_media)	S0 No FE	S1 Race FE	S2 Firm FE	S3 Firm + Race FE
is_small_media	+0.0093* (0.005)	+0.0096* (0.005)	+0.0159 (0.010)	+0.0090 (0.010)
is_candidate	+0.0436*** (0.009)	+0.0026 (0.009)	+0.0319** (0.014)	−0.0140 (0.016)
is_pollster_self	+0.0084 (0.005)	+0.0070 (0.006)	+0.0096 (0.011)	−0.0024 (0.012)
is_other_firm	+0.0236*** (0.008)	+0.0162** (0.007)	+0.0213* (0.011)	+0.0046 (0.013)
log_sample	−0.078***	−0.044***	−0.077***	−0.054***
n	2,428	2,403	2,360	2,294

Spec ladder (logit)

Bucket	S0 No FE	S1 Race FE	S2 Firm FE	S3 Firm + Race FE
is_small_media	−0.0007 (0.005)	+0.0014 (0.004)	−0.0085 (0.010)	−0.0041 (0.005)
is_candidate	+0.0140** (0.006)	+0.0024 (0.005)	−0.0024 (0.011)	−0.0118 (0.008)
is_pollster_self	+0.0000 (0.005)	+0.0017 (0.003)	−0.0072 (0.010)	−0.0089 (0.006)
is_other_firm	+0.0090 (0.006)	+0.0044 (0.005)	−0.0013 (0.011)	−0.0052 (0.006)
log_sample	−0.093***	−0.073***	−0.089***	−0.071***

Logit shows weaker signal — the GBM captures non-linear interactions between features that the linear model misses.

Interpretation

The key result: is_other_firm lights up positively (+1.6 to +2.4 pp probability) without the classifier ever seeing the other_firm label. The classifier was trained on the binary question "is this poll candidate-sponsored?" and learned the within-poll deviation pattern. Applied to other_firm polls, it predicts they look candidate-sponsored — even though their formal contratante is a third-party firm. This is the algorithmic correlate of the shell-sponsoring hypothesis: shell polls share the within-poll fingerprint of candidate- sponsored polls.

Three pieces of evidence aligned:

AN-082 / AN-085 / AN-094 — other_firm polls' raw deviation on margin/winner outcomes
AN-099 — other_firm polls' consensus deviation (cross- sectional pattern)
AN-101 — classifier trained only on candidate-sponsorship predicts other_firm polls as sponsored-like

The third is the cleanest independent evidence because the classifier has no contact with the other_firm label.

Small_media also shows positive predicted bias (+0.009 to +0.010 at S0/S1, p<0.10). This corroborates the AN-085 finding that the media bucket is heterogeneous: some "media-sponsored" polls share the within-poll pattern of candidate-sponsored polls. Consistent with the "small media as shell channel" hypothesis but with weaker signal than other_firm.

pollster_self is null across all specs — the firm-self- contracted polls don't show the within-poll pattern of candidate-sponsored polls. Consistent with AN-085's finding that 2024 pollster_self is dominated by trusted-firm showcase polls (Datafolha / Quaest doing brand-protection work), not the IPOP-style fraud channel that migrated to other_firm.

Under S3 (firm + race FE) everything collapses — same composition story as AN-093 / AN-094 / AN-095. The within-firm within-race comparison is sharp enough to absorb the predicted-bias signal too. Aggregate sorting (which firms hire which sponsors, which races attract which sponsors) carries most of the signal.

log_sample dominates the classifier. The strongest predictor of "sponsored" is small sample size. Sponsored polls run systematically smaller samples (median ~360–400 vs ~408 for media). The within-poll deviation features add signal but sample-size leakage is the bigger channel.

§Policy contribution: a distributable poll suspicion score

Saved at: build/analysis/poll_suspicion_score.parquet

Per-protocol record:

pred_bias_logit: probability ∈ [0,1] from cross-validated logit
pred_bias_gbm: probability ∈ [0,1] from cross-validated GBM
bucket: 5-class sponsor classification
poll_has_candidate_sponsor: ground-truth label
muni_id, institute, field_end, log_sample: identifiers

This artifact is distributable — every per-poll score is computed from public TSE data without revealing private firm information beyond what's already in the registry. A regulator, journalist, or academic can:

Triage: flag the top-N polls by suspicion score for closer audit.
Aggregate: pool scores across a firm's polls — even at AUC 0.69 per-poll, the firm-level z-score is large for firms with ≥10 polls.
Track: cross-cycle changes in a firm's average suspicion score = a reputation signal.
Cross-validate sponsor-identity disclosure: polls with high suspicion scores AND shell-style contratantes (FacUnicamps, etc.) are higher-confidence shell-flagging cases.

The GBM model's exact tree structure can be exposed publicly without compromising the detection mechanism — there's no adversarial-defense argument for keeping it secret. (Sponsors who learn the rules might try to game them, but they're already not field-randomizing their slant in the obvious ways the classifier exploits.)

Caveats

Train/test labels are binary: candidate-sponsored vs not. The classifier ignores all the nuance of the 5-bucket classification. Multi-class label would be a natural extension.
log_sample is the strongest classifier feature. A sample-size-blind version (drop log_sample) drops AUC to ~0.62. Cleanest within-poll-pattern test requires this ablation; current results conflate within-poll pattern with sample-size leakage.
2,428 polls in regression sample (vs 22k cand-poll rows). Sample selection — consensus computation requires ≥1 peer poll in window. High-attention races over-represented.
5-fold CV: random fold assignment may leave some variance in the OOS predictions. K=10 or LOOCV would tighten but require more compute.
GBM has 200 trees, max depth 3, lr 0.05 — reasonable defaults but not hyperparameter-tuned. Better hyperparameter search likely buys 1–3 AUC points.
The S3 collapse is the same noise floor story as AN-095, not a property of the ML version of the analysis.

Follow-ups

Sample-size-blind classifier: drop log_sample, refit. Tests whether within-poll pattern alone is informative. Expected AUC ≈ 0.62.
Hyperparameter-tuned GBM + larger feature set (consensus deviation moments, cross-cell-position rank of max-deviation cand, etc.). Could realistically push AUC to 0.75–0.78.
Out-of-sample 2020 → 2024 validation: train on 2020 polls, predict on 2024. Tests structural stability of the classifier.
Firm-level aggregation test: for each firm with ≥10 polls, compute mean predicted_bias. Firm-level AUC should be much higher (n_firms × per-firm-pooling reduces noise).
Multi-class label: train classifier to predict the 5-bucket sponsor class instead of binary. Direct evidence of shell-detection in a single classifier rather than leakage-inference.

Artifacts

Script: source/analysis/an-101-predicted-bias-as-outcome.py
Spec-level coefficients: build/table/an-101-predicted-bias-as-outcome.csv
Headline JSON: build/table/an-101-predicted-bias-as-outcome.json
Distributable per-poll score: build/analysis/poll_suspicion_score.parquet

AN-099 consensus deviation
AN-100 sponsor-blind detection
AN-098 noise floor
AN-093 paper-ready spec ladder (|error| outcome) — predecessor with |error| as outcome