title: Educloud playbook status: handoff (2026-06-01)
Educloud playbook — poll-sponsor-bias
What to do on the educloud server (/projects/ec113/henrik) now that the
laptop work has landed in git. SP was extracted via LLM on educloud
before the laptop bulk run started; the non-SP extraction is still
finishing on the laptop and will be merged in later.
1. Pull the latest commits from all four repos
cd ~/research/research && git pull
cd ~/research/pipelines/politica && git pull
cd ~/research/packages/llmkit && git pull && pip install -e .
cd ~/research/research-kit && git pull
What's new (commit summaries):
- research — new
ideas/poll-sponsor-bias/folder withsummary.md(full project pitch + positioning + mechanism / Bayesian-persuasion decomposition + EJ lawsuit complement),literature.md(64 entries + positioning notes),references.bib,todo.md,coauthor_candidates.md, and this playbook. Plusideas/index.mdupdated to point at the promoted folder. - politica —
source/clean/poll_sponsor.py(new — sponsor cleaning),source/llm/{schemas,poll_relatorio}.pyandprompts/(the llmkit migration),source/llm/poll_extract.py(rewritten runner with--states / --exclude-states / --validate-cached),docs/done.mdanddocs/todo.md. - llmkit — Structured Outputs path + schema-aware cache key (both
opt-in via
use_structured_outputs=True/schema_in_cache_key=True);LLMCache.__bool__fixed. - research-kit — skill doc updated.
2. Verify the SP LLM extraction cache is intact
The laptop and educloud share a logical cache structure but separate
file systems. SP was extracted on educloud first; non-SP is finishing on
the laptop. Both write under pipelines/politica/build/llm/poll_relatorio/
keyed by hash(doc_id, text_hash, model), so once both halves are in
the same directory the runner assembles them transparently.
On educloud, confirm:
cd ~/research/pipelines/politica
ls build/llm/poll_relatorio/ | wc -l # should be ~1635 (SP protocols + a few extras)
python source/llm/poll_extract.py --year 2024 --states SP --validate-cached
--validate-cached is the PDF-free path: it re-validates every cached
entry against the current schema and writes the assembled parquet
without an LLM call. Output: build/llm/poll_relatorio_2024.parquet.
3. Stage the 2024 candidato registry
The 2024 TSE candidate registry lives only on educloud (Henrik's BA →
educloud sync; the local data/tse/candidato.csv cleaned by the
politica pipeline covers 1998–2022 only). Confirm the registry is
present:
ls -la "$DATA_DIR/tse/consulta_cand_2024_BRASIL.csv" # or the per-UF zips
If absent, pull from TSE dadosabertos: https://dadosabertos.tse.jus.br/dataset/candidatos-2024.
4. Run Step 1b — CPF + candidate-committee → 2024 candidato join
This is the load-bearing go/no-go number for the within-candidate FE design. The laptop computed a 16.1% sponsor-political-share via heuristic; Step 1b grounds that in the 2024 candidate registry.
Two routes to resolve sponsor → candidate, both should run on educloud:
Route A — CPF sponsor → candidate. ~1,033 protocols have an
individual-CPF contratante. Join on cpf to the 2024 candidato
registry filtered to office == "PREFEITO". Returns the candidate
(if the CPF belongs to one); the residue is campaign managers /
treasurers, also informative as controls.
Route B — candidate-committee CNPJ → candidate name (deterministic
parse). The sponsor name pattern ELEIÇÃO 2024 {NOME COMPLETO} {PREFEITO| VICE-PREFEITO} lets us extract the candidate's full name from the
sponsor name without needing the registry — verified on 628 polls
in the laptop heuristic. Cross-check by joining the parsed name back
to candidato registry (fuzzy on name, exact on municipio).
Discrepancies flag committees with non-standard naming.
Route C — party CNPJ → party's prefeito candidate. ~728 polls. Match contratante CNPJ to TSE's partidos table → party → in that municipality, the party's single prefeito candidate (1:1 by the electoral-law constraint we lean on). Needs the partidos CNPJ table which TSE distributes alongside candidato.
Output the headline power number:
For each 2024 mayoral candidate, how many appear in both a self-sponsored poll (CPF, committee CNPJ, or party CNPJ) and an other-sponsored poll within their own race?
Decision rule (from summary.md): if the within-candidate overlap is
in the hundreds, proceed with the within-candidate FE design. If only
a handful, fall back to the race-level / funding-flag designs.
Script lives at pipelines/politica/source/clean/poll_sponsor.py;
add a follow-up poll_sponsor.py that does the candidato
join and writes build/clean/poll_sponsor_2024.parquet
(adding columns sponsor_candidate_cpf, sponsor_candidate_name,
sponsor_route ∈ {cpf, committee, party}).
5. Build the candidate×poll analysis table on the SP slice first
SP has ~1,635 polls fully extracted (LLM done). That's ~11% of the 2024 mayoral corpus — enough to run the headline regression as a prototype and shake out the code path while the laptop bulk run finishes the other 25 UFs.
# Sketch — put under pipelines/politica/source/clean/ or projects/electoral-justice/source/assemble/
poll_relatorio = pd.read_parquet('build/llm/poll_relatorio_2024.parquet')
sponsor = pd.read_parquet('build/clean/poll_sponsor_2024.parquet')
eleicao = pd.read_csv(f'{DATA_DIR}/tse/eleicao.csv')
# Filter to SP, prefeito, primeiro turno, estimulado scenario
sp = poll_relatorio[poll_relatorio['protocol'].str.startswith('SP')]
sp = sp[sp['scenario_type'] == 'estimulado']
# ... join sponsor by protocol; join election results by (UF, municipio, candidate)
# error_{c,p} = poll_share - final_share
# match candidate names within race (name matching is the main risk)
6. Run preliminary regressions on the SP slice
Per summary.md § Identification:
error_{c,p} = β · SponsoredBy_c_{c,p}
+ λ_pollster + μ_(c × race) + f(days_to_election) + ε
Plus the three-spec decomposition (summary.md § Mechanism):
- Sponsor only.
- Add structured methodology controls (sample size, field period,
ST_PESQUISA_PROPRIA, declared cost). - Add LLM-extracted methodology features once the methodology
extractor runs (
pipelines/politica/source/llm/poll_methodology.py, queued intodo.md).
Reading: β shrinks across specs → Channel A (Bayesian persuasion via design); β stable → Channel B (residual / fabrication).
SP is large enough to give a meaningful first read but small enough that confidence intervals will be wide; treat as a code-pipeline sanity check, not the headline. Run the opponent-sponsored symmetry test in the same pass.
7. Merge with the laptop bulk extraction
When the laptop run finishes (~2h after kickoff at 2026-06-01 09:53 sandbox time; full ~9,737 non-SP PDFs), rsync the cache up to bi-dropbox and pull on educloud:
# on laptop (host shell, not sandbox)
rclone copy ~/research/pipelines/politica/build/llm/poll_relatorio/ \
bi-dropbox:data/TSE/2024/pesquisa_eleitoral/llm_cache/
# on educloud
rclone copy bi-dropbox:data/TSE/2024/pesquisa_eleitoral/llm_cache/ \
~/research/pipelines/politica/build/llm/poll_relatorio/
# re-assemble
python source/llm/poll_extract.py --year 2024 --validate-cached
The new cache entries have the new SO metadata
(api_params.response_format == "structured_outputs"); the SP entries
have legacy json_object metadata. Mixed cache is fine — assemble is
agnostic.
8. (Optional) PDF download for the literature
The 37 paywalled PDFs flagged in
ideas/poll-sponsor-bias/_pdfs/paywalled_urls.json need a campus-net
Chrome session. Educloud doesn't have a browser, so this is a host /
laptop step on the campus VPN. See todo.md for the priority Brazilian
papers.
9. Side: methodology free-text extraction
When you're ready to do the Channel A vs Channel B decomposition, the
methodology free-text extraction (DS_PLANO_AMOSTRAL,
DS_DADO_MUNICIPIO, etc.) is the next data step. Schema and command
queued in todo.md. Cost ~$15, can run on educloud or laptop. Best run
on educloud since OPENAI_API_KEY is shared.
What the laptop will hand off
When the laptop bulk run finishes the runner prints:
Live-run counts: {'ok': N, 'cached': M, 'image_only': K, 'error': J}
Assembled X candidate-scenario rows from Y polls → build/llm/poll_relatorio_2024.parquet
I'll commit and push that parquet's summary diagnostics (zero-sum sub-scenarios,
state coverage, scenario-type distribution) to the idea folder as
build_audit.md, not the parquet itself (too large for git). The cache
files at pipelines/politica/build/llm/poll_relatorio/ go via bi-dropbox
as documented in step 7.