Live Operations

Monitoring, Alerting & Maintenance

Monitoring, Alerting & Maintenance give the platform real visibility into deployed behavior. They capture health, performance, rollout behavior, intervention signals, and operational anomalies across fleets, then turn that evidence into actionable lifecycle input for incident response, maintenance, evaluation, and future training.

What This Surface Owns

This surface owns collection, summarization, alerting, maintenance context, and interpretation of operational signals from deployed robots.

Collect runtime, hardware, task, and autonomy metrics from fleet members.

Detect anomalies, regressions, and drift across robots, sites, and release cohorts.

Support real-time monitoring and retrospective analysis.

Feed operational evidence back into release, curation, and maintenance workflows.

This is not only a dashboard surface. It is the field evidence and maintenance coordination layer of the platform.

Telemetry Categories

System and hardware health

CPU, GPU, memory, storage, thermal, sensor status, network state, battery, and hardware-specific health signals.
Detect degradation and instability before they become mission failures or safety problems.

Inference and autonomy metrics

These are the metrics that connect deployment health to learning quality.

model latency
frame drop rate
policy confidence or uncertainty signals
intervention frequency
recovery behavior
task completion rate

Task and operational performance

throughput
success and failure by scenario
cycle time
downtime reason
customer or site-specific KPIs

Rollout and cohort metrics

canary vs baseline performance
site-specific regression patterns
hardware revision differences
release line performance over time

Monitoring Surfaces

Real-time fleet view

Fleet-wide online/offline state and current release distribution.
Alerting for severe failures, anomalous drift, or intervention spikes.
Useful for live operations teams and early rollout observation.

Robot and cohort drilldown

Inspect one robot, one site, one fleet segment, or one release cohort in detail.
Compare behavior across rollout cohorts instead of relying only on fleet-wide averages.
Critical for understanding regressions that affect only one environment.

Incident evidence

Bundle telemetry slices, deployment context, operator notes, and replay links into compact incident packets.
Make post-mortem and cross-team diagnosis efficient.
Support promotion of incidents into curation and evaluation pipelines later.
Retain those packets according to region-, customer-, and regulatory policy when operational evidence is sensitive.

Alerting And Response

Threshold-based alerting

latency ceilings
thermal limits
intervention spikes
availability and uptime drops
success-rate regressions

Anomaly detection

Detect deviations from baseline cohort behavior.
Useful when absolute thresholds are not enough or when environment-specific drift matters.

Workflow integration

Open incident records automatically.
Trigger maintenance triage when hardware health crosses policy thresholds.
Pause rollout or recommend rollback when canary cohorts fail validation.

Closing The Loop To Development

Telemetry becomes much more valuable when it is connected to the rest of the lifecycle.

Release feedback

Post-deploy behavior validates or invalidates the release evidence that justified the rollout.
Canary regressions become part of the release record.

Data feedback

Novel failure conditions, intervention-heavy episodes, or environment drift can be promoted into review and curation queues.
High-value incidents can seed benchmark and replay packs.

Training feedback

This is the compounding loop that makes deployment improve future models rather than only deliver them.

Teams can compare not just which model won in evaluation, but which model actually reduced intervention and failure in production.

Relationship To Neighboring Surfaces

Inputs

**Robot agents** and runtime services emit telemetry through the Platform API.
**Fleet Manager** provides target identity and segmentation.
**Deployment and Update** provide release and cohort context.

Outputs

**Maintenance System** receives diagnostic triggers and service context.
**Evaluation & Release** receives post-deploy evidence and new regression scenarios.
**Data Explorer** can display incidents and replay-aligned operational evidence.
**Human-in-the-Loop Operations** can use telemetry to escalate and contextualize intervention.

Why This Matters Architecturally

Telemetry is what makes the platform relevant after deployment.

It turns fleets into observable systems instead of black boxes.

It lets rollout policy act on evidence instead of hope.

It makes field behavior available to curation, evaluation, and retraining loops.

It ties deployment outcomes back to models, artifacts, and datasets through the Unified Data Model.

Without this surface, operations data stays fragmented and the platform cannot truly close the loop.

Why Teams Care

Operational confidence

Teams know what is happening before customers or operators tell them something went wrong.

Better release quality

Rollouts can be judged by real-world cohort behavior, not only pre-deploy benchmarks.

Faster diagnosis

Incident packets and cohort comparison reduce mean time to understand regressions.

Compounding learning

Production evidence becomes new data and evaluation coverage instead of dead-end telemetry.