HPO Sweep Guide¶
Overview¶
The HPO sweep system runs hyperparameter optimization on top of existing
zetta_utils training specs. A sweep spec is itself a normal builder/CUE
spec with "@type": "hpo_sweep". Running it with zetta run starts a
controller process that:
samples or requests trial configs from a pluggable brain;
writes one standalone training CUE file per launched attempt;
launches each generated spec with
zetta run;discovers status and metrics from Weights & Biases;
passes one latest-attempt
Trialper logical config back to the brain.
No new CLI command is required.
Minimal Sweep Spec¶
"@type": "hpo_sweep"
base_spec_path: "specs/team/project/train_base.cue"
specs_dir: "specs/team/project/sweeps/lr_thr"
sweep_name: "lr_thr"
hparams: [
{
"@type": "LogContinuousHParam"
short_name: "lr"
cue_var: "#LR"
low: 1e-5
high: 1e-3
},
{
"@type": "DiscreteHParam"
short_name: "thr"
cue_var: "#THR"
choices: [1, 2, 3, 5, 8]
},
]
brain: {
"@type": "RandomSearchBrain"
max_trials: 20
seed: 1
}
max_parallel: 3
poll_interval_sec: 120
Run it the same way as any other CUE spec:
zetta run specs/team/project/sweeps/lr_thr.cue
Base Spec Requirements¶
The base spec should be a normal training spec that already works with
zetta run. The HPO system performs conservative text replacement on it.
Required pieces:
each swept CUE variable appears exactly once as a single-line declaration, such as
#LR: 0.001;the trainer is a
ZettaDefaultTrainer;the trainer either uses
experiment_name: #EXP_NAMEor has exactly oneexperiment_namefield;#EXP_VERSIONmay exist already. If it does, HPO replaces it. If it does not, HPO prepends it.
The generated spec overrides:
swept
#VARdeclarations;#EXP_NAMEor the trainerexperiment_nameto the sweep name;#EXP_VERSIONto the attempt ID;experiment_versionto#EXP_VERSION;wandb_kwargswith HPO tags and config metadata.
Generated Attempt Files¶
The controller distinguishes logical trials from launched attempts.
Logical trial ID:
lr0.0001_thr3
Attempt IDs:
lr0.0001_thr3-attempt-1
lr0.0001_thr3-attempt-2
Each attempt gets its own spec file:
specs/team/project/sweeps/lr_thr/lr0.0001_thr3-attempt-1.cue
The W&B run name is also the attempt ID. The W&B project is the sweep name.
Brain Actions¶
Brains return a list of actions from suggest.
HPOStartActionContains a config, for example
{"lr": 0.0001, "thr": 3}. The controller derives the logical trial ID from the config and launches the next attempt number. Returning the same config again means retry the same logical trial as a new attempt.HPOKillActionContains a logical trial ID. The controller terminates all active local
zetta runsubprocesses for that logical trial.
The controller owns process management, spec generation, attempt numbering, and W&B discovery. The brain owns optimization policy, retry policy, early-stop policy, and sweep completion policy.
W&B Metadata¶
Each generated training spec passes HPO metadata through
ZettaDefaultTrainer.wandb_kwargs. The metadata is intentionally explicit so
the controller does not have to rely only on display-name parsing.
Tags include:
hpohpo_sweep:<sweep_name>hpo_trial:<logical_trial_id>hpo_attempt:<attempt_number>
Config includes:
hpo_sweep_namehpo_logical_trial_idhpo_attempt_idhpo_attempthpo_config
Metric Collection¶
Brains declare the W&B metrics they need via required_metrics. The
controller collects those metrics for each W&B run on every poll.
Available metric helpers:
LatestMetricReads the latest value from
run.summary.HistoryMetricReads the full metric curve from
run.history.WindowMetricReads the last
Nvalues fromrun.history.
Operational Notes¶
max_parallelcounts active attempts discovered locally or through W&B.After controller restart, W&B-active attempts still count against
max_parallel.A restarted controller can observe old active attempts through W&B but cannot terminate old local subprocess handles that no longer exist.
Killing a local
zetta runsubprocess is expected to trigger clean downstream cleanup through the existingzetta runremote-training behavior. That cleanup contract is intentionally kept outside the HPO brain.
Why These Decisions¶
- Use
zetta runinstead of a new CLI command Existing training specs already go through builder loading, CUE parsing, remote launch, and run tracking. Keeping HPO as
"@type": "hpo_sweep"reuses that path and keeps generated trial specs easy to rerun manually.- Generate standalone CUE specs per attempt
A failed or interesting attempt can be debugged by running the generated file directly. It also makes controller state easier to reconstruct after a restart.
- Separate logical trial IDs from attempt IDs
A config can be retried without losing the identity of the hyperparameter point being evaluated. The brain sees one latest attempt per logical trial, while W&B and filenames still keep every attempt distinct.
- Make every run use
-attempt-N, including the first attempt Uniform naming avoids special cases in parsing, restart handling, and generated filenames.
- Put retry decisions in the brain
Retry behavior is part of optimization policy. Some brains may retry failed configs, some may mark them terminal, and some may use richer failure handling. The controller only executes actions.
- Use explicit W&B metadata in addition to run names
W&B display names are convenient for humans but are not a strong data model. Explicit config/tags make grouping and restart logic less fragile.
- Use conservative text replacement for CUE generation
The implementation only edits clearly identified declaration lines and trainer fields. If the base spec is ambiguous, generation fails rather than silently producing a wrong training spec.