Design-driven bias: the menu of pollster design choices
Reference doc — consolidates the inventory of legal design choices a Brazilian pollster makes during TSE PesqEle registration, and how each lever can mechanically tilt a poll toward a sponsor's candidate without violating the disclosed methodology. This is the supply-side menu that populates Channel A of the decomposition.
Cross-refs:
theory.md§ Polls as Bayesian persuasion — formal model (sender chooses signal structure σ subject to disclosure constraint)methods.md§ Channel A vs B decomposition — three-spec regression ladder that operationalizes the testsummary.md§ Why design choices have room to slant — discursive framing inside the project overviewinstitutions.md— TSE PesqEle regulatory regime + CONREtodo.md§ Mechanism decomposition — LLM extractor schema this doc justifies field-by-field
Why design choices have room to slant at all
Textbook survey theory: a full probability sample of all eligible voters with inverse-inclusion-probability weights against the correct frame is unbiased — sponsor design choices would be irrelevant.
Brazilian electoral polls almost universally are not probability samples. They are quota samples with multi-stage stratified selection of clusters and quota-controlled selection within. Quota sampling is model-based: unbiased only when (i) the chosen quota variables fully explain response heterogeneity and (ii) the chosen population frame is the right one. Both assumptions are routinely violated, in legally-disclosed ways — which is exactly what makes Channel A possible and (in principle) measurable from the registration filings.
The disclosure regime (PesqEle) makes the methodology public before
the result is released, so receivers can in principle penalize a poll
whose methodology obviously favors the sponsor. But quota sampling with
multi-stage selection gives substantial latitude in declared
methodology that does not trigger obvious red flags (e.g. urban-only
DS_DADO_MUNICIPIO is common and not by itself disqualifying).
The levers
Each entry: the choice, the slant mechanism, the TSE field(s) where it shows up, and whether it is structured or requires LLM extraction.
1. Coverage exclusion (geographic)
Choice. Which sub-areas of the municipality the sample frame
covers. DS_DADO_MUNICIPIO typically lists "área urbana"; rural
sub-districts, peripheral neighborhoods, or favela areas are routinely
assigned probability zero.
Slant. Strongest version of the zero-probability-subpopulation move. If the sponsor's candidate has a rural / peripheral / specific- district base, urban-only coverage mechanically depresses the opponent's count (or vice versa) without anyone lying. Textbook-legal slant, declared in registration.
TSE field. DS_DADO_MUNICIPIO (free text, ~150 chars median).
Extraction. LLM → coverage_class ∈ {full-municipality, urban-only, urban-plus-selected-rural, specific-neighborhoods, other}. FLAGGED as
the single most consequential field in the extraction schema.
2. Quota variables
Choice. Which demographic variables the quota cells are defined over. Brazilian practice: sex and age are universal; education, income, religion, occupation, race are picked à la carte.
Slant. Each quota variable absorbs a different slice of unobserved political heterogeneity. Excluding education when education predicts vote leaves the education-vote correlation unaccounted for — a tilt without lying. The choice of which quota variables to discipline is the lever.
TSE field. DS_PLANO_AMOSTRAL (free text, ~1,640 chars median —
the longest narrative block; describes stratification scheme, PPT vs
simple random, quota variables with exact percentages, n_stages).
Extraction. LLM → quota_variables: list[sex, age, education, income, religion, occupation, region, race], is_quota_sample: bool,
sample_design_class ∈ {simple-random, stratified, quota, multi-stage, PPT, mixed}, n_stages ∈ {1, 2, 3+}.
3. Population reference frame
Choice. Which population the quotas are normalized against — Census-2022 residents, TSE-eligible voters, or an estimated turnout- weighted electorate. The pollster declares one. None is "wrong" by TSE standards.
Slant. Each frame gives different cell shares. Census residents over-weight non-voters and youth (non-registered); TSE-eligible over-weights low-turnout demographics relative to actual electorates; turnout-weighted requires its own model. The frame choice shifts the implicit weighting of every quota cell.
TSE field. DS_PLANO_AMOSTRAL (overlaps with quota variables).
Extraction. LLM → population_reference ∈ {census_2022_residents, TSE_eligible, turnout_weighted, other, not_specified},
population_source ∈ {TSE, IBGE, both, other}, census_sectors_used: bool.
4. Mode of contact
Choice. In-person door-to-door, phone (CATI), online (CAWI), or mixed.
Slant. Mode is the choice of who's in the frame. Door-to-door under-samples gated communities, working professionals, and night-shift voters; phone under-samples young and low-income; online over-samples high-internet-penetration. Post-stratification weights only partially fix the missingness — anyone literally not reachable by the chosen mode contributes nothing to the inputs that drive the weights.
TSE field. DS_METODOLOGIA_PESQUISA (free text, ~230 chars
median) + DS_SISTEMA_CONTROLE (~660 chars; describes tablet vs paper).
Extraction. LLM → mode ∈ {in-person, phone, online, mixed},
collection_device ∈ {tablet, paper, mixed}.
5. Question wording, order, and headline-scenario choice
Choice. (a) Order in which candidate names are read in the estimulado prompt; (b) whether familiarity / approval / rejection is asked before vote intention; (c) which scenario (estimulado / espontaneo / votos válidos) is reported as the headline.
Slant. (a) Order effects shift answers 2–3pp in published estimates of order-effect magnitude. (b) Asking approval first primes positivity toward whoever the respondent already likes. (c) The sponsor's candidate may be strongest in one scenario (e.g. high estimulado, weak espontaneo) — picking that scenario as the headline is a free lever.
TSE field. DS_METODOLOGIA_PESQUISA and DS_PLANO_AMOSTRAL
(narrative). Headline scenario is often only inferable from the
results file (cenário with the most prominence), not from
registration text alone.
Extraction. LLM → question_order_described: bool. Gap:
headline scenario choice is the part of this lever we cannot
easily capture from registration; it would need a separate pass over
the publication output (pesquisa_eleitoral_*_cenarios files) cross-
referenced with registration.
6. Non-response handling
Choice. What to do with undecideds and refusals — redistribute
proportionally to leaders, redistribute by demographic similarity, hold
as a separate category, or exclude from the denominator. Each gives
different votos válidos numbers from the same raw responses.
Slant. Excluding undecideds inflates leaders disproportionately (since front-runners are more likely to be named first by partial- recallers). Redistributing proportionally locks in the current distribution. The choice changes the headline number without changing field collection.
TSE field. DS_PLANO_AMOSTRAL and DS_SISTEMA_CONTROLE
(narrative).
Extraction. LLM → nonresponse_handling: str (free-form notes;
hard to categorize sharply ex-ante).
Adjacent levers we are not capturing structurally
These move slant into Channel B in our decomposition — i.e. they will register as residual not explained by Spec 3 controls, even though they are real design choices.
- Interviewer instructions beyond what
DS_SISTEMA_CONTROLErecords (verbatim probing scripts, fallback prompts). - In-field cluster rotation when targets are short — which clusters are re-tried vs replaced.
- Weighting trimming thresholds (post-stratification weight caps).
- Question-list rotation across respondents (whether order rotates per respondent or per cluster).
- Sequencing inside the questionnaire — exactly where vote intention sits relative to approval, rejection, familiarity, issue questions, and demographic backmatter.
- Sample replacement rules for non-contacts (next-door, next- cluster, drop).
The implication for interpretation: a stable β across Spec 1 → Spec 3
cannot cleanly distinguish "fabrication" from "design slant via
dimensions we didn't extract." Sharpening the extracted dictionary
narrows the residual but cannot eliminate it. See theory.md § Open
testability concern.
What this menu does not cover: Channel B
Channel B is the residual — results that do not match the declared methodology. Same firm, same declared sample design, same nominal sample size, but the numbers for the sponsor's candidate are tilted in a way the declared design cannot explain. This is the fabrication / cooking-the-numbers channel.
The complementary fraud tests on the residual (digit-frequency /
Benford on reported percentages, suspiciously clean numbers,
between-pollster consistency for the same race × date) live in
summary.md § Complementary fraud tests. They are not design
levers and are out of scope for this doc.
Dimensions surfaced in the 2020 poll-lawsuit pilot
A 50-case llmkit pilot over PESQUISA cases in
projects/electoral-justice/build/merge/proc.csv (filingyear 2020,
seed 42; see projects/poll-sponsor-bias/source/llm/pilot_poll_lawsuit.py)
extracted alleged-bias dimensions from the sentenca text. Two findings
that bear on this menu:
Caveat: the lawsuit universe is registration-driven, not
methodology-driven. Of the 50 pilot cases, 33 (66%) classified as
registration_missing, 7 as divulgation_violation, 5 as
enquete_not_pesquisa, 2 as registration_late. Zero had
methodology_bias or fabrication_allegation as the primary cause of
action. Implication: validating high-SponsoredBy_c polls against the
"sued" set (the todo.md § Complementary data use #1) will mostly
capture registration non-compliance, not design bias. Use the
lawsuit pass for surfacing the menu of bias dimensions that
petitioners can plausibly invoke, not as a quantitative proxy for
biased polls.
Allegations within the six-lever menu (appearing as secondary arguments in registration cases):
- Coverage exclusion — most common. "abrangeu apenas a zona urbana, excluindo a zona rural"; "não incluiu assentamentos específicos e alguns bairros"; "ausência de delimitação da área de aplicação".
- Question wording / order — specifically candidate omission from scenario lists. "ausência do nome do candidato ANTONIO CARLOS"; "pergunta sobre 2T não incluiu o nome do Representante e de outros candidatos". This is a sharper version of the order-effect lever worth flagging separately.
- Quota variables / declared-vs-collected mismatch — petitioners treat divergence between declared quota distribution and the collected sample as evidence of bias: "Irregularidade da faixa etária dos entrevistados, divergência no percentual de gênero, divergência nos dados econômicos na fonte de dados informada".
New dimensions NOT in the six-lever menu — candidates for addition or for explicit Channel B treatment:
- Pollster-as-sponsor shell — the pollster itself is the
contratante, and the firm has low capital social or recent
incorporation. "fraude consiste no registro de pesquisas onde a
própria empresa prestadora é a pagante, o que causa estranheza
devido ao baixo capital social"; "Evidências de ocultação do
contratante da pesquisa". Cross-link between
NR_CNPJ_EMPRESA(pollster) andNR_CNPJ_CONTRATANTE(sponsor) indicating they're related or that the contratante is a thinly-capitalized vehicle. Already partially captured byST_PESQUISA_PROPRIA, but the shell-CNPJ pattern is a distinct flavor. - Contract value below market — "contratada por um valor muito
abaixo do comum de mercado". The petitioner uses
VR_PAGO_CONTRATANTE(orVR_PESQUISA) anomalously low as a signal the field operation can't have been real. This makes the already-structured cost field a bias signal, not just a control. - Data reuse / no fresh collection — "realizada com dados de uma pesquisa anterior, sem a devida realização da coleta de dados". A specific fabrication signature: the declared field period was fake; numbers come from a prior wave. Belongs in Channel B but is a detectable sub-form (between-wave consistency tests).
- Funding-source disclosure failure —
DS_ORIGEM_RECURSOleft blank or non-specific. "não informa a origem dos recursos despendidos". A registration-form failure that doubles as a bias signal (concealed sponsor → bias incentive). - Missing required disclosure content — list of bairros, nota fiscal, period of collection. These are registration-form gaps that don't directly bias the result but co-occur with cases where the pollster is being uncooperative; useful as a confounder when treating other levers.
Schema iteration note. The llmkit pilot extracted these as free
text in lever_other_text rather than mapping them to enum codes,
because the prompt framed the lever enum as "design bias as theory of
harm" not "type of design dimension touched". This is the right
choice when the goal is to find dimensions outside the menu, but a
second-pass schema with the typology framing would give us countable
incidence of each lever across the lawsuit universe. Queued under
todo.md (next iteration of the poll-lawsuit extractor).
Per-lever expected slant direction
The decomposition test (Spec 1 → Spec 3 shrinkage) is direction-
agnostic — it tests whether the sponsor effect runs through the
methodology, not which direction each lever pushes. But for the
heterogeneity analyses in theory.md § Predictions (point 2), we
need priors on which levers matter where:
| Lever | Where slant room is largest |
|---|---|
| Coverage exclusion | Rural-heavy munis; munis with sharply geographically segregated bases |
| Quota variables | Races with strong demographic skew (income / education / race correlates with vote) |
| Population reference | Munis with high recent population change (census vs TSE registries diverge) |
| Mode | Heterogeneous internet / phone access; gated-community-heavy munis |
| Question order / scenario | Races with high non-recall / weak candidate name recognition |
| Non-response handling | Races with high undecided share (early polls, low-information races) |
These map directly to the muni-level moderators (rural share, demographic Gini, race competitiveness) the heterogeneity spec uses.
Status
Reference doc — content consolidated from summary.md, theory.md,
and the LLM extractor schema in todo.md. No new claims;
re-organization for use by (a) the LLM extractor implementation,
(b) the paper's mechanism section, (c) the EJ poll-lawsuits
descriptive pass — to check which of these levers shows up most
often in petitioners' bias claims.