Model Development

Model Training & Evaluation

Model Training & Evaluation turn governed dataset versions into tracked, comparable, recoverable training jobs and explicit evaluation evidence across local, cloud, and on-prem compute. This is where dataset lineage becomes model behavior that teams can measure, compare, and eventually promote.

What This Surface Owns

Model Training & Evaluation own run launch, compute coordination, evaluation execution, runtime observability, and execution history.

Launch training from immutable dataset snapshots and explicit configs.

Manage heterogeneous compute environments and distributed execution.

Track job state, logs, metrics, checkpoints, and runtime metadata.

Preserve the exact execution environment needed for reproducibility.

Promotion into durable model identity still belongs to the Model Registry, but this surface owns the work and evidence that make promotion possible.

Compute Abstraction

Cloud compute

Launch runs on managed cloud GPU environments when capacity and speed matter most.
Scale up for sweeps, large batch experimentation, or release-critical validation.
Keep cloud allocation tied to the same run identity and provenance as every other environment.

On-prem and private cluster compute

Use Kubernetes, Slurm, or internal schedulers without breaking lifecycle traceability.
Preserve cost, queue time, hardware class, and environment assumptions as run metadata.
Useful for teams with existing internal ML infrastructure.

Hybrid execution

Debug locally, then burst to cloud or on-prem clusters for larger sweeps.
Maintain one run model across heterogeneous compute rather than separate “local” and “production” training systems.
Good fit for teams balancing fast iteration with cost control.

Run Definition

Every training run should be defined by explicit inputs rather than ambient context.

Required run inputs

dataset snapshot or set of dataset snapshots
training configuration and hyperparameters
code revision or training package version
dependency/runtime surface
hardware target or compute class
evaluation pack if required

Why this matters

If a run cannot be recreated without reading Slack threads, shell history, and notebook cells, it is not a platform-grade training record.

Execution Capabilities

Distributed training

Multi-GPU and multi-node execution where the training stack requires it.
Explicit record of world size, distribution strategy, checkpoint cadence, and failure semantics.
Same lifecycle tracking whether using simple single-node jobs or larger distributed setups.

Checkpoint management

Persist intermediate checkpoints and final outputs with run identity attached.
Keep checkpoint lineage explicit so later promotion into the Model Registry is safe and attributable.
Recover from interruptions without breaking run history.

Runtime observability

Surface logs, metrics, resource usage, and failure state in real time.
Allow engineers to understand a live run without logging into a separate compute environment.
Preserve post-run inspection so failures are diagnosable later.

Sweep execution

Launch parameterized variations from one base configuration.
Keep run grouping, cost, and metric comparison structured for later experiment analysis.
Useful for data ablations, learning-rate sweeps, architecture comparison, or target-specific tuning.

Example Run Submission

rfabric train submit \n  --dataset folding_v12 \n  --config configs/folding_sweep.yaml \n  --compute aws-h100-8x \n  --evaluation-pack manipulation_release_v4

What matters is not the command shape. It is that the run is defined by stable platform objects: dataset version, config package, compute target, and evaluation context.

Relationship To Neighboring Surfaces

Upstream

**Dataset Finalizer** provides immutable training-ready datasets and manifests.
**Workflow Engine** can trigger runs as part of larger lifecycle automation.

Adjacent

**Experiment Tracker** groups runs into hypotheses and preserves decision context.
**Model Registry** receives selected checkpoints and turns them into governed model objects.
**Evaluation & Release** provides the benchmark surface a candidate must satisfy before rollout.

Downstream

**Artifact Builder** only packages models that emerge from governed training and promotion flow.
**Telemetry** can later inform which training runs and data compositions actually improved field behavior.

Why This Matters Architecturally

The Training Orchestrator is where dataset lineage first becomes model lineage.

It binds immutable dataset state to explicit run execution.

It preserves enough runtime detail to reproduce or compare training outcomes.

It keeps compute diversity from fragmenting the lifecycle system of record.

It provides the execution substrate that Experiment Tracker and Model Registry depend on.

Without a strong training orchestrator, teams may still produce models, but the platform cannot make their creation process reliable or comparable.

Why Teams Care

Reproducibility

Runs remain recoverable and understandable long after they finish.

Speed

Teams can move from finalized datasets to tracked runs quickly across different compute environments.

Comparability

Sweeps and architecture changes remain structured instead of dissolving into separate scripts and notebooks.

Operational continuity

Training outputs can flow cleanly into evaluation, registry, packaging, and deployment without manual glue.