Data Foundation

Dataset Curation & Versioning

Dataset Curation & Versioning decide which data should shape learning and evaluation, under which policy, and in what final immutable form. They combine quality scoring, selection, failure mining, coverage balancing, and human review so teams can build smaller, cleaner, more representative datasets with better downstream outcomes.

What This Surface Owns

Dataset Curation & Versioning own selection, ranking, composition logic, and the freeze boundary that turns collected and labeled data into governed dataset versions.

Score demonstrations and episodes for quality and usefulness.

Search and retrieve data semantically and structurally.

Identify duplication, gaps, drift, and hard negatives.

Build reproducible selection policies for datasets and benchmark packs.

It turns large collections into intentional training and release inputs.

Quality Scoring

Learned quality models

Reward-model or stage-aware scoring for manipulation and long-horizon tasks.
Policy- or task-specific scoring that can rank demonstrations beyond binary success/failure.
Batch scoring pipelines that keep provenance of model version, thresholds, and feature sources.

Heuristic and rules-based signals

Episode length, idle time, motion smoothness, terminal state, unexpected torque, or contact anomalies.
Fast filters for obvious low-value or broken demonstrations.
Useful as complements to learned quality signals, not replacements for them.

Human review augmentation

Manual overrides for critical edge cases or customer-important scenarios.
Reviewer approval or rejection attached to the same curation state.
Combined human and machine signals can produce better selection than either alone.

Semantic Retrieval And Search

Embedding-based search

Generate embeddings over frames, episodes, or semantic summaries.
Search by natural language, nearest neighbor, or mixed metadata + semantic queries.
Useful for fast failure mining and scenario discovery in very large corpora.

Query-by-example

Start from a representative frame, episode, intervention, or failure packet.
Retrieve similar events by visual, temporal, or trajectory similarity.
Build high-value comparison or retraining sets quickly.

Structured filter composition

Mix semantic search with hard filters like robot type, site, hardware revision, temperature range, operator, task variant, or deployment cohort.
Keep curation expressive enough for real operational questions, not only generic similarity search.

Coverage Construction

The best dataset is not the dataset with the most data or the highest mean score. It is the one with the right coverage.

Duplicate and near-duplicate control

Detect repeated sessions and oversampled operating regimes.
Reduce wasted training volume without discarding real variability.
Preserve representative examples while removing low-value repetition.

Failure and hard-negative mining

Extract misses, recoveries, contact failures, intervention-heavy episodes, and out-of-distribution states.
Build targeted improvement sets for the next training cycle.
Promote the same scenarios into benchmark and regression packs when relevant.

Coverage balancing

Balance by task, site, robot revision, operator, environment, and customer context.
Avoid overfitting to the dominant collection regime.
Track what remains underrepresented after each curation pass.

Correction-weighted learning data

Upweight intervention and recovery episodes when they are disproportionately valuable.
Support DAgger-style or supervised correction loops without requiring teams to manage them manually.

Policy-Based Dataset Construction

The most important output of curation is not just the selected data. It is the reproducible policy that selected it.

Curation policy examples

Keep only episodes above quality threshold `0.85`
Require completed annotation schema `pick-place-v4`
Retain all intervention recoveries from staging fleet
Downsample near-duplicates from the dominant site
Reserve certain failure classes for held-out evaluation only

Why policy matters

Teams can reproduce an old dataset exactly.
Teams can compare outcomes across curation strategies.
Release regressions can be traced back to differences in selection logic, not just vague “better data” claims.

Relationship To Other Surfaces

Upstream

**Data Processing Pipeline** provides derived assets, embeddings, and validation state.
**Annotation & Labeling** provides semantic structure and review completeness.
**Telemetry** and **HITL Operations** provide failure and intervention signals that become high-value curation inputs.

Downstream

**Dataset Finalizer** consumes selection policy and approved episode membership.
**Evaluation & Release** consumes curated benchmark and replay sets.
**Training** benefits from cleaner, more representative, and more targeted data.

Why This Matters Architecturally

Curation is where the platform converts raw collection scale into actual learning leverage.

It ties together search, quality scoring, labeling, interventions, and release evidence.

It ensures that dataset construction is policy-driven rather than ad hoc.

It turns field failures into future coverage.

It gives the Unified Data Model and Workflow Engine a concrete way to preserve why one dataset differed from another.

Without a strong curation surface, the system collects and labels data but still leaves the highest-leverage decision unstructured.

Why Teams Care

Quality

Better selection often matters more than another round of architecture tinkering.

Speed

Semantic retrieval and policy-based filtering reduce manual scrubbing dramatically.

Reproducibility

Teams can explain exactly why a dataset contained what it contained.

Compounding improvement

Release failures, interventions, and field anomalies become inputs to better future datasets instead of dead-end incidents.