Model Development
Model Training & Evaluation
Model Training & Evaluation turn governed dataset versions into tracked, comparable, recoverable training jobs and explicit evaluation evidence across local, cloud, and on-prem compute. This is where dataset lineage becomes model behavior that teams can measure, compare, and eventually promote.
What This Surface Owns
Model Training & Evaluation own run launch, compute coordination, evaluation execution, runtime observability, and execution history.
- Launch training from immutable dataset snapshots and explicit configs.
- Manage heterogeneous compute environments and distributed execution.
- Track job state, logs, metrics, checkpoints, and runtime metadata.
- Preserve the exact execution environment needed for reproducibility.
Promotion into durable model identity still belongs to the Model Registry, but this surface owns the work and evidence that make promotion possible.
Compute Abstraction
Cloud compute
- Launch runs on managed cloud GPU environments when capacity and speed matter most.
- Scale up for sweeps, large batch experimentation, or release-critical validation.
- Keep cloud allocation tied to the same run identity and provenance as every other environment.
On-prem and private cluster compute
- Use Kubernetes, Slurm, or internal schedulers without breaking lifecycle traceability.
- Preserve cost, queue time, hardware class, and environment assumptions as run metadata.
- Useful for teams with existing internal ML infrastructure.
Hybrid execution
- Debug locally, then burst to cloud or on-prem clusters for larger sweeps.
- Maintain one run model across heterogeneous compute rather than separate “local” and “production” training systems.
- Good fit for teams balancing fast iteration with cost control.
Run Definition
Every training run should be defined by explicit inputs rather than ambient context.
Required run inputs
- dataset snapshot or set of dataset snapshots
- training configuration and hyperparameters
- code revision or training package version
- dependency/runtime surface
- hardware target or compute class
- evaluation pack if required
Why this matters
If a run cannot be recreated without reading Slack threads, shell history, and notebook cells, it is not a platform-grade training record.
Execution Capabilities
Distributed training
- Multi-GPU and multi-node execution where the training stack requires it.
- Explicit record of world size, distribution strategy, checkpoint cadence, and failure semantics.
- Same lifecycle tracking whether using simple single-node jobs or larger distributed setups.
Checkpoint management
- Persist intermediate checkpoints and final outputs with run identity attached.
- Keep checkpoint lineage explicit so later promotion into the Model Registry is safe and attributable.
- Recover from interruptions without breaking run history.
Runtime observability
- Surface logs, metrics, resource usage, and failure state in real time.
- Allow engineers to understand a live run without logging into a separate compute environment.
- Preserve post-run inspection so failures are diagnosable later.
Sweep execution
- Launch parameterized variations from one base configuration.
- Keep run grouping, cost, and metric comparison structured for later experiment analysis.
- Useful for data ablations, learning-rate sweeps, architecture comparison, or target-specific tuning.
Example Run Submission
rfabric train submit \n --dataset folding_v12 \n --config configs/folding_sweep.yaml \n --compute aws-h100-8x \n --evaluation-pack manipulation_release_v4What matters is not the command shape. It is that the run is defined by stable platform objects: dataset version, config package, compute target, and evaluation context.
Relationship To Neighboring Surfaces
Upstream
- **Dataset Finalizer** provides immutable training-ready datasets and manifests.
- **Workflow Engine** can trigger runs as part of larger lifecycle automation.
Adjacent
- **Experiment Tracker** groups runs into hypotheses and preserves decision context.
- **Model Registry** receives selected checkpoints and turns them into governed model objects.
- **Evaluation & Release** provides the benchmark surface a candidate must satisfy before rollout.
Downstream
- **Artifact Builder** only packages models that emerge from governed training and promotion flow.
- **Telemetry** can later inform which training runs and data compositions actually improved field behavior.
Why This Matters Architecturally
The Training Orchestrator is where dataset lineage first becomes model lineage.
- It binds immutable dataset state to explicit run execution.
- It preserves enough runtime detail to reproduce or compare training outcomes.
- It keeps compute diversity from fragmenting the lifecycle system of record.
- It provides the execution substrate that Experiment Tracker and Model Registry depend on.
Without a strong training orchestrator, teams may still produce models, but the platform cannot make their creation process reliable or comparable.
Why Teams Care
Reproducibility
Runs remain recoverable and understandable long after they finish.
Speed
Teams can move from finalized datasets to tracked runs quickly across different compute environments.
Comparability
Sweeps and architecture changes remain structured instead of dissolving into separate scripts and notebooks.
Operational continuity
Training outputs can flow cleanly into evaluation, registry, packaging, and deployment without manual glue.