title: Educloud playbook status: handoff (2026-06-01)

Educloud playbook — poll-sponsor-bias

What to do on the educloud server (/projects/ec113/henrik) now that the laptop work has landed in git. SP was extracted via LLM on educloud before the laptop bulk run started; the non-SP extraction is still finishing on the laptop and will be merged in later.

1. Pull the latest commits from all four repos

cd ~/research/research          && git pull
cd ~/research/pipelines/politica && git pull
cd ~/research/packages/llmkit    && git pull && pip install -e .
cd ~/research/research-kit       && git pull

What's new (commit summaries):

2. Verify the SP LLM extraction cache is intact

The laptop and educloud share a logical cache structure but separate file systems. SP was extracted on educloud first; non-SP is finishing on the laptop. Both write under pipelines/politica/build/llm/poll_relatorio/ keyed by hash(doc_id, text_hash, model), so once both halves are in the same directory the runner assembles them transparently.

On educloud, confirm:

cd ~/research/pipelines/politica
ls build/llm/poll_relatorio/ | wc -l           # should be ~1635 (SP protocols + a few extras)
python source/llm/poll_extract.py --year 2024 --states SP --validate-cached

--validate-cached is the PDF-free path: it re-validates every cached entry against the current schema and writes the assembled parquet without an LLM call. Output: build/llm/poll_relatorio_2024.parquet.

3. Stage the 2024 candidato registry

The 2024 TSE candidate registry lives only on educloud (Henrik's BA → educloud sync; the local data/tse/candidato.csv cleaned by the politica pipeline covers 1998–2022 only). Confirm the registry is present:

ls -la "$DATA_DIR/tse/consulta_cand_2024_BRASIL.csv"  # or the per-UF zips

If absent, pull from TSE dadosabertos: https://dadosabertos.tse.jus.br/dataset/candidatos-2024.

4. Run Step 1b — CPF + candidate-committee → 2024 candidato join

This is the load-bearing go/no-go number for the within-candidate FE design. The laptop computed a 16.1% sponsor-political-share via heuristic; Step 1b grounds that in the 2024 candidate registry.

Two routes to resolve sponsor → candidate, both should run on educloud:

Route A — CPF sponsor → candidate. ~1,033 protocols have an individual-CPF contratante. Join on cpf to the 2024 candidato registry filtered to office == "PREFEITO". Returns the candidate (if the CPF belongs to one); the residue is campaign managers / treasurers, also informative as controls.

Route B — candidate-committee CNPJ → candidate name (deterministic parse). The sponsor name pattern ELEIÇÃO 2024 {NOME COMPLETO} {PREFEITO| VICE-PREFEITO} lets us extract the candidate's full name from the sponsor name without needing the registry — verified on 628 polls in the laptop heuristic. Cross-check by joining the parsed name back to candidato registry (fuzzy on name, exact on municipio). Discrepancies flag committees with non-standard naming.

Route C — party CNPJ → party's prefeito candidate. ~728 polls. Match contratante CNPJ to TSE's partidos table → party → in that municipality, the party's single prefeito candidate (1:1 by the electoral-law constraint we lean on). Needs the partidos CNPJ table which TSE distributes alongside candidato.

Output the headline power number:

For each 2024 mayoral candidate, how many appear in both a self-sponsored poll (CPF, committee CNPJ, or party CNPJ) and an other-sponsored poll within their own race?

Decision rule (from summary.md): if the within-candidate overlap is in the hundreds, proceed with the within-candidate FE design. If only a handful, fall back to the race-level / funding-flag designs.

Script lives at pipelines/politica/source/clean/poll_sponsor.py; add a follow-up poll_sponsor.py that does the candidato join and writes build/clean/poll_sponsor_2024.parquet (adding columns sponsor_candidate_cpf, sponsor_candidate_name, sponsor_route ∈ {cpf, committee, party}).

5. Build the candidate×poll analysis table on the SP slice first

SP has ~1,635 polls fully extracted (LLM done). That's ~11% of the 2024 mayoral corpus — enough to run the headline regression as a prototype and shake out the code path while the laptop bulk run finishes the other 25 UFs.

# Sketch — put under pipelines/politica/source/clean/ or projects/electoral-justice/source/assemble/
poll_relatorio = pd.read_parquet('build/llm/poll_relatorio_2024.parquet')
sponsor       = pd.read_parquet('build/clean/poll_sponsor_2024.parquet')
eleicao       = pd.read_csv(f'{DATA_DIR}/tse/eleicao.csv')

# Filter to SP, prefeito, primeiro turno, estimulado scenario
sp = poll_relatorio[poll_relatorio['protocol'].str.startswith('SP')]
sp = sp[sp['scenario_type'] == 'estimulado']
# ... join sponsor by protocol; join election results by (UF, municipio, candidate)
# error_{c,p} = poll_share - final_share
# match candidate names within race (name matching is the main risk)

6. Run preliminary regressions on the SP slice

Per summary.md § Identification:

error_{c,p} = β · SponsoredBy_c_{c,p}
            + λ_pollster + μ_(c × race) + f(days_to_election) + ε

Plus the three-spec decomposition (summary.md § Mechanism):

  1. Sponsor only.
  2. Add structured methodology controls (sample size, field period, ST_PESQUISA_PROPRIA, declared cost).
  3. Add LLM-extracted methodology features once the methodology extractor runs (pipelines/politica/source/llm/poll_methodology.py, queued in todo.md).

Reading: β shrinks across specs → Channel A (Bayesian persuasion via design); β stable → Channel B (residual / fabrication).

SP is large enough to give a meaningful first read but small enough that confidence intervals will be wide; treat as a code-pipeline sanity check, not the headline. Run the opponent-sponsored symmetry test in the same pass.

7. Merge with the laptop bulk extraction

When the laptop run finishes (~2h after kickoff at 2026-06-01 09:53 sandbox time; full ~9,737 non-SP PDFs), rsync the cache up to bi-dropbox and pull on educloud:

# on laptop (host shell, not sandbox)
rclone copy ~/research/pipelines/politica/build/llm/poll_relatorio/ \
            bi-dropbox:data/TSE/2024/pesquisa_eleitoral/llm_cache/

# on educloud
rclone copy bi-dropbox:data/TSE/2024/pesquisa_eleitoral/llm_cache/ \
            ~/research/pipelines/politica/build/llm/poll_relatorio/

# re-assemble
python source/llm/poll_extract.py --year 2024 --validate-cached

The new cache entries have the new SO metadata (api_params.response_format == "structured_outputs"); the SP entries have legacy json_object metadata. Mixed cache is fine — assemble is agnostic.

8. (Optional) PDF download for the literature

The 37 paywalled PDFs flagged in ideas/poll-sponsor-bias/_pdfs/paywalled_urls.json need a campus-net Chrome session. Educloud doesn't have a browser, so this is a host / laptop step on the campus VPN. See todo.md for the priority Brazilian papers.

9. Side: methodology free-text extraction

When you're ready to do the Channel A vs Channel B decomposition, the methodology free-text extraction (DS_PLANO_AMOSTRAL, DS_DADO_MUNICIPIO, etc.) is the next data step. Schema and command queued in todo.md. Cost ~$15, can run on educloud or laptop. Best run on educloud since OPENAI_API_KEY is shared.

What the laptop will hand off

When the laptop bulk run finishes the runner prints:

Live-run counts: {'ok': N, 'cached': M, 'image_only': K, 'error': J}
Assembled X candidate-scenario rows from Y polls → build/llm/poll_relatorio_2024.parquet

I'll commit and push that parquet's summary diagnostics (zero-sum sub-scenarios, state coverage, scenario-type distribution) to the idea folder as build_audit.md, not the parquet itself (too large for git). The cache files at pipelines/politica/build/llm/poll_relatorio/ go via bi-dropbox as documented in step 7.