HPO Sweep Guide

Overview

The HPO sweep system runs hyperparameter optimization on top of existing zetta_utils training specs. A sweep spec is itself a normal builder/CUE spec with "@type": "hpo_sweep". Running it with zetta run starts a controller process that:

  • samples or requests trial configs from a pluggable brain;

  • writes one standalone training CUE file per launched attempt;

  • launches each generated spec with zetta run;

  • discovers status and metrics from Weights & Biases;

  • passes one latest-attempt Trial per logical config back to the brain.

No new CLI command is required.

Minimal Sweep Spec

"@type": "hpo_sweep"

base_spec_path: "specs/team/project/train_base.cue"
specs_dir:      "specs/team/project/sweeps/lr_thr"
sweep_name:     "lr_thr"

hparams: [
    {
        "@type":    "LogContinuousHParam"
        short_name: "lr"
        cue_var:    "#LR"
        low:        1e-5
        high:       1e-3
    },
    {
        "@type":    "DiscreteHParam"
        short_name: "thr"
        cue_var:    "#THR"
        choices:    [1, 2, 3, 5, 8]
    },
]

brain: {
    "@type":    "RandomSearchBrain"
    max_trials: 20
    seed:       1
}

max_parallel:      3
poll_interval_sec: 120

Run it the same way as any other CUE spec:

zetta run specs/team/project/sweeps/lr_thr.cue

Base Spec Requirements

The base spec should be a normal training spec that already works with zetta run. The HPO system performs conservative text replacement on it.

Required pieces:

  • each swept CUE variable appears exactly once as a single-line declaration, such as #LR: 0.001;

  • the trainer is a ZettaDefaultTrainer;

  • the trainer either uses experiment_name: #EXP_NAME or has exactly one experiment_name field;

  • #EXP_VERSION may exist already. If it does, HPO replaces it. If it does not, HPO prepends it.

The generated spec overrides:

  • swept #VAR declarations;

  • #EXP_NAME or the trainer experiment_name to the sweep name;

  • #EXP_VERSION to the attempt ID;

  • experiment_version to #EXP_VERSION;

  • wandb_kwargs with HPO tags and config metadata.

Generated Attempt Files

The controller distinguishes logical trials from launched attempts.

Logical trial ID:

lr0.0001_thr3

Attempt IDs:

lr0.0001_thr3-attempt-1
lr0.0001_thr3-attempt-2

Each attempt gets its own spec file:

specs/team/project/sweeps/lr_thr/lr0.0001_thr3-attempt-1.cue

The W&B run name is also the attempt ID. The W&B project is the sweep name.

Brain Actions

Brains return a list of actions from suggest.

HPOStartAction

Contains a config, for example {"lr": 0.0001, "thr": 3}. The controller derives the logical trial ID from the config and launches the next attempt number. Returning the same config again means retry the same logical trial as a new attempt.

HPOKillAction

Contains a logical trial ID. The controller terminates all active local zetta run subprocesses for that logical trial.

The controller owns process management, spec generation, attempt numbering, and W&B discovery. The brain owns optimization policy, retry policy, early-stop policy, and sweep completion policy.

W&B Metadata

Each generated training spec passes HPO metadata through ZettaDefaultTrainer.wandb_kwargs. The metadata is intentionally explicit so the controller does not have to rely only on display-name parsing.

Tags include:

  • hpo

  • hpo_sweep:<sweep_name>

  • hpo_trial:<logical_trial_id>

  • hpo_attempt:<attempt_number>

Config includes:

  • hpo_sweep_name

  • hpo_logical_trial_id

  • hpo_attempt_id

  • hpo_attempt

  • hpo_config

Metric Collection

Brains declare the W&B metrics they need via required_metrics. The controller collects those metrics for each W&B run on every poll.

Available metric helpers:

LatestMetric

Reads the latest value from run.summary.

HistoryMetric

Reads the full metric curve from run.history.

WindowMetric

Reads the last N values from run.history.

Operational Notes

  • max_parallel counts active attempts discovered locally or through W&B.

  • After controller restart, W&B-active attempts still count against max_parallel.

  • A restarted controller can observe old active attempts through W&B but cannot terminate old local subprocess handles that no longer exist.

  • Killing a local zetta run subprocess is expected to trigger clean downstream cleanup through the existing zetta run remote-training behavior. That cleanup contract is intentionally kept outside the HPO brain.

Why These Decisions

Use zetta run instead of a new CLI command

Existing training specs already go through builder loading, CUE parsing, remote launch, and run tracking. Keeping HPO as "@type": "hpo_sweep" reuses that path and keeps generated trial specs easy to rerun manually.

Generate standalone CUE specs per attempt

A failed or interesting attempt can be debugged by running the generated file directly. It also makes controller state easier to reconstruct after a restart.

Separate logical trial IDs from attempt IDs

A config can be retried without losing the identity of the hyperparameter point being evaluated. The brain sees one latest attempt per logical trial, while W&B and filenames still keep every attempt distinct.

Make every run use -attempt-N, including the first attempt

Uniform naming avoids special cases in parsing, restart handling, and generated filenames.

Put retry decisions in the brain

Retry behavior is part of optimization policy. Some brains may retry failed configs, some may mark them terminal, and some may use richer failure handling. The controller only executes actions.

Use explicit W&B metadata in addition to run names

W&B display names are convenient for humans but are not a strong data model. Explicit config/tags make grouping and restart logic less fragile.

Use conservative text replacement for CUE generation

The implementation only edits clearly identified declaration lines and trainer fields. If the base spec is ambiguous, generation fails rather than silently producing a wrong training spec.