Design-driven bias: the menu of pollster design choices

Reference doc — consolidates the inventory of legal design choices a Brazilian pollster makes during TSE PesqEle registration, and how each lever can mechanically tilt a poll toward a sponsor's candidate without violating the disclosed methodology. This is the supply-side menu that populates Channel A of the decomposition.

Cross-refs:

Why design choices have room to slant at all

Textbook survey theory: a full probability sample of all eligible voters with inverse-inclusion-probability weights against the correct frame is unbiased — sponsor design choices would be irrelevant.

Brazilian electoral polls almost universally are not probability samples. They are quota samples with multi-stage stratified selection of clusters and quota-controlled selection within. Quota sampling is model-based: unbiased only when (i) the chosen quota variables fully explain response heterogeneity and (ii) the chosen population frame is the right one. Both assumptions are routinely violated, in legally-disclosed ways — which is exactly what makes Channel A possible and (in principle) measurable from the registration filings.

The disclosure regime (PesqEle) makes the methodology public before the result is released, so receivers can in principle penalize a poll whose methodology obviously favors the sponsor. But quota sampling with multi-stage selection gives substantial latitude in declared methodology that does not trigger obvious red flags (e.g. urban-only DS_DADO_MUNICIPIO is common and not by itself disqualifying).

The levers

Each entry: the choice, the slant mechanism, the TSE field(s) where it shows up, and whether it is structured or requires LLM extraction.

1. Coverage exclusion (geographic)

Choice. Which sub-areas of the municipality the sample frame covers. DS_DADO_MUNICIPIO typically lists "área urbana"; rural sub-districts, peripheral neighborhoods, or favela areas are routinely assigned probability zero.

Slant. Strongest version of the zero-probability-subpopulation move. If the sponsor's candidate has a rural / peripheral / specific- district base, urban-only coverage mechanically depresses the opponent's count (or vice versa) without anyone lying. Textbook-legal slant, declared in registration.

TSE field. DS_DADO_MUNICIPIO (free text, ~150 chars median).

Extraction. LLM → coverage_class ∈ {full-municipality, urban-only, urban-plus-selected-rural, specific-neighborhoods, other}. FLAGGED as the single most consequential field in the extraction schema.

2. Quota variables

Choice. Which demographic variables the quota cells are defined over. Brazilian practice: sex and age are universal; education, income, religion, occupation, race are picked à la carte.

Slant. Each quota variable absorbs a different slice of unobserved political heterogeneity. Excluding education when education predicts vote leaves the education-vote correlation unaccounted for — a tilt without lying. The choice of which quota variables to discipline is the lever.

TSE field. DS_PLANO_AMOSTRAL (free text, ~1,640 chars median — the longest narrative block; describes stratification scheme, PPT vs simple random, quota variables with exact percentages, n_stages).

Extraction. LLM → quota_variables: list[sex, age, education, income, religion, occupation, region, race], is_quota_sample: bool, sample_design_class ∈ {simple-random, stratified, quota, multi-stage, PPT, mixed}, n_stages ∈ {1, 2, 3+}.

3. Population reference frame

Choice. Which population the quotas are normalized against — Census-2022 residents, TSE-eligible voters, or an estimated turnout- weighted electorate. The pollster declares one. None is "wrong" by TSE standards.

Slant. Each frame gives different cell shares. Census residents over-weight non-voters and youth (non-registered); TSE-eligible over-weights low-turnout demographics relative to actual electorates; turnout-weighted requires its own model. The frame choice shifts the implicit weighting of every quota cell.

TSE field. DS_PLANO_AMOSTRAL (overlaps with quota variables).

Extraction. LLM → population_reference ∈ {census_2022_residents, TSE_eligible, turnout_weighted, other, not_specified}, population_source ∈ {TSE, IBGE, both, other}, census_sectors_used: bool.

4. Mode of contact

Choice. In-person door-to-door, phone (CATI), online (CAWI), or mixed.

Slant. Mode is the choice of who's in the frame. Door-to-door under-samples gated communities, working professionals, and night-shift voters; phone under-samples young and low-income; online over-samples high-internet-penetration. Post-stratification weights only partially fix the missingness — anyone literally not reachable by the chosen mode contributes nothing to the inputs that drive the weights.

TSE field. DS_METODOLOGIA_PESQUISA (free text, ~230 chars median) + DS_SISTEMA_CONTROLE (~660 chars; describes tablet vs paper).

Extraction. LLM → mode ∈ {in-person, phone, online, mixed}, collection_device ∈ {tablet, paper, mixed}.

5. Question wording, order, and headline-scenario choice

Choice. (a) Order in which candidate names are read in the estimulado prompt; (b) whether familiarity / approval / rejection is asked before vote intention; (c) which scenario (estimulado / espontaneo / votos válidos) is reported as the headline.

Slant. (a) Order effects shift answers 2–3pp in published estimates of order-effect magnitude. (b) Asking approval first primes positivity toward whoever the respondent already likes. (c) The sponsor's candidate may be strongest in one scenario (e.g. high estimulado, weak espontaneo) — picking that scenario as the headline is a free lever.

TSE field. DS_METODOLOGIA_PESQUISA and DS_PLANO_AMOSTRAL (narrative). Headline scenario is often only inferable from the results file (cenário with the most prominence), not from registration text alone.

Extraction. LLM → question_order_described: bool. Gap: headline scenario choice is the part of this lever we cannot easily capture from registration; it would need a separate pass over the publication output (pesquisa_eleitoral_*_cenarios files) cross- referenced with registration.

6. Non-response handling

Choice. What to do with undecideds and refusals — redistribute proportionally to leaders, redistribute by demographic similarity, hold as a separate category, or exclude from the denominator. Each gives different votos válidos numbers from the same raw responses.

Slant. Excluding undecideds inflates leaders disproportionately (since front-runners are more likely to be named first by partial- recallers). Redistributing proportionally locks in the current distribution. The choice changes the headline number without changing field collection.

TSE field. DS_PLANO_AMOSTRAL and DS_SISTEMA_CONTROLE (narrative).

Extraction. LLM → nonresponse_handling: str (free-form notes; hard to categorize sharply ex-ante).

Adjacent levers we are not capturing structurally

These move slant into Channel B in our decomposition — i.e. they will register as residual not explained by Spec 3 controls, even though they are real design choices.

The implication for interpretation: a stable β across Spec 1 → Spec 3 cannot cleanly distinguish "fabrication" from "design slant via dimensions we didn't extract." Sharpening the extracted dictionary narrows the residual but cannot eliminate it. See theory.md § Open testability concern.

What this menu does not cover: Channel B

Channel B is the residual — results that do not match the declared methodology. Same firm, same declared sample design, same nominal sample size, but the numbers for the sponsor's candidate are tilted in a way the declared design cannot explain. This is the fabrication / cooking-the-numbers channel.

The complementary fraud tests on the residual (digit-frequency / Benford on reported percentages, suspiciously clean numbers, between-pollster consistency for the same race × date) live in summary.md § Complementary fraud tests. They are not design levers and are out of scope for this doc.

Dimensions surfaced in the 2020 poll-lawsuit pilot

A 50-case llmkit pilot over PESQUISA cases in projects/electoral-justice/build/merge/proc.csv (filingyear 2020, seed 42; see projects/poll-sponsor-bias/source/llm/pilot_poll_lawsuit.py) extracted alleged-bias dimensions from the sentenca text. Two findings that bear on this menu:

Caveat: the lawsuit universe is registration-driven, not methodology-driven. Of the 50 pilot cases, 33 (66%) classified as registration_missing, 7 as divulgation_violation, 5 as enquete_not_pesquisa, 2 as registration_late. Zero had methodology_bias or fabrication_allegation as the primary cause of action. Implication: validating high-SponsoredBy_c polls against the "sued" set (the todo.md § Complementary data use #1) will mostly capture registration non-compliance, not design bias. Use the lawsuit pass for surfacing the menu of bias dimensions that petitioners can plausibly invoke, not as a quantitative proxy for biased polls.

Allegations within the six-lever menu (appearing as secondary arguments in registration cases):

New dimensions NOT in the six-lever menu — candidates for addition or for explicit Channel B treatment:

  1. Pollster-as-sponsor shell — the pollster itself is the contratante, and the firm has low capital social or recent incorporation. "fraude consiste no registro de pesquisas onde a própria empresa prestadora é a pagante, o que causa estranheza devido ao baixo capital social"; "Evidências de ocultação do contratante da pesquisa". Cross-link between NR_CNPJ_EMPRESA (pollster) and NR_CNPJ_CONTRATANTE (sponsor) indicating they're related or that the contratante is a thinly-capitalized vehicle. Already partially captured by ST_PESQUISA_PROPRIA, but the shell-CNPJ pattern is a distinct flavor.
  2. Contract value below market — "contratada por um valor muito abaixo do comum de mercado". The petitioner uses VR_PAGO_CONTRATANTE (or VR_PESQUISA) anomalously low as a signal the field operation can't have been real. This makes the already-structured cost field a bias signal, not just a control.
  3. Data reuse / no fresh collection — "realizada com dados de uma pesquisa anterior, sem a devida realização da coleta de dados". A specific fabrication signature: the declared field period was fake; numbers come from a prior wave. Belongs in Channel B but is a detectable sub-form (between-wave consistency tests).
  4. Funding-source disclosure failureDS_ORIGEM_RECURSO left blank or non-specific. "não informa a origem dos recursos despendidos". A registration-form failure that doubles as a bias signal (concealed sponsor → bias incentive).
  5. Missing required disclosure content — list of bairros, nota fiscal, period of collection. These are registration-form gaps that don't directly bias the result but co-occur with cases where the pollster is being uncooperative; useful as a confounder when treating other levers.

Schema iteration note. The llmkit pilot extracted these as free text in lever_other_text rather than mapping them to enum codes, because the prompt framed the lever enum as "design bias as theory of harm" not "type of design dimension touched". This is the right choice when the goal is to find dimensions outside the menu, but a second-pass schema with the typology framing would give us countable incidence of each lever across the lawsuit universe. Queued under todo.md (next iteration of the poll-lawsuit extractor).

Per-lever expected slant direction

The decomposition test (Spec 1 → Spec 3 shrinkage) is direction- agnostic — it tests whether the sponsor effect runs through the methodology, not which direction each lever pushes. But for the heterogeneity analyses in theory.md § Predictions (point 2), we need priors on which levers matter where:

Lever Where slant room is largest
Coverage exclusion Rural-heavy munis; munis with sharply geographically segregated bases
Quota variables Races with strong demographic skew (income / education / race correlates with vote)
Population reference Munis with high recent population change (census vs TSE registries diverge)
Mode Heterogeneous internet / phone access; gated-community-heavy munis
Question order / scenario Races with high non-recall / weak candidate name recognition
Non-response handling Races with high undecided share (early polls, low-information races)

These map directly to the muni-level moderators (rural share, demographic Gini, race competitiveness) the heterogeneity spec uses.

Status

Reference doc — content consolidated from summary.md, theory.md, and the LLM extractor schema in todo.md. No new claims; re-organization for use by (a) the LLM extractor implementation, (b) the paper's mechanism section, (c) the EJ poll-lawsuits descriptive pass — to check which of these levers shows up most often in petitioners' bias claims.