rFabric

Live Operations

Monitoring, Alerting & Maintenance

Monitoring, Alerting & Maintenance give the platform real visibility into deployed behavior. They capture health, performance, rollout behavior, intervention signals, and operational anomalies across fleets, then turn that evidence into actionable lifecycle input for incident response, maintenance, evaluation, and future training.

What This Surface Owns

This surface owns collection, summarization, alerting, maintenance context, and interpretation of operational signals from deployed robots.

  • Collect runtime, hardware, task, and autonomy metrics from fleet members.
  • Detect anomalies, regressions, and drift across robots, sites, and release cohorts.
  • Support real-time monitoring and retrospective analysis.
  • Feed operational evidence back into release, curation, and maintenance workflows.

This is not only a dashboard surface. It is the field evidence and maintenance coordination layer of the platform.

Telemetry Categories

System and hardware health

  • CPU, GPU, memory, storage, thermal, sensor status, network state, battery, and hardware-specific health signals.
  • Detect degradation and instability before they become mission failures or safety problems.

Inference and autonomy metrics

These are the metrics that connect deployment health to learning quality.

  • model latency
  • frame drop rate
  • policy confidence or uncertainty signals
  • intervention frequency
  • recovery behavior
  • task completion rate

Task and operational performance

  • throughput
  • success and failure by scenario
  • cycle time
  • downtime reason
  • customer or site-specific KPIs

Rollout and cohort metrics

  • canary vs baseline performance
  • site-specific regression patterns
  • hardware revision differences
  • release line performance over time

Monitoring Surfaces

Real-time fleet view

  • Fleet-wide online/offline state and current release distribution.
  • Alerting for severe failures, anomalous drift, or intervention spikes.
  • Useful for live operations teams and early rollout observation.

Robot and cohort drilldown

  • Inspect one robot, one site, one fleet segment, or one release cohort in detail.
  • Compare behavior across rollout cohorts instead of relying only on fleet-wide averages.
  • Critical for understanding regressions that affect only one environment.

Incident evidence

  • Bundle telemetry slices, deployment context, operator notes, and replay links into compact incident packets.
  • Make post-mortem and cross-team diagnosis efficient.
  • Support promotion of incidents into curation and evaluation pipelines later.
  • Retain those packets according to region-, customer-, and regulatory policy when operational evidence is sensitive.

Alerting And Response

Threshold-based alerting

  • latency ceilings
  • thermal limits
  • intervention spikes
  • availability and uptime drops
  • success-rate regressions

Anomaly detection

  • Detect deviations from baseline cohort behavior.
  • Useful when absolute thresholds are not enough or when environment-specific drift matters.

Workflow integration

  • Open incident records automatically.
  • Trigger maintenance triage when hardware health crosses policy thresholds.
  • Pause rollout or recommend rollback when canary cohorts fail validation.

Closing The Loop To Development

Telemetry becomes much more valuable when it is connected to the rest of the lifecycle.

Release feedback

  • Post-deploy behavior validates or invalidates the release evidence that justified the rollout.
  • Canary regressions become part of the release record.

Data feedback

  • Novel failure conditions, intervention-heavy episodes, or environment drift can be promoted into review and curation queues.
  • High-value incidents can seed benchmark and replay packs.

Training feedback

This is the compounding loop that makes deployment improve future models rather than only deliver them.

  • Teams can compare not just which model won in evaluation, but which model actually reduced intervention and failure in production.

Relationship To Neighboring Surfaces

Inputs

  • **Robot agents** and runtime services emit telemetry through the Platform API.
  • **Fleet Manager** provides target identity and segmentation.
  • **Deployment and Update** provide release and cohort context.

Outputs

  • **Maintenance System** receives diagnostic triggers and service context.
  • **Evaluation & Release** receives post-deploy evidence and new regression scenarios.
  • **Data Explorer** can display incidents and replay-aligned operational evidence.
  • **Human-in-the-Loop Operations** can use telemetry to escalate and contextualize intervention.

Why This Matters Architecturally

Telemetry is what makes the platform relevant after deployment.

  • It turns fleets into observable systems instead of black boxes.
  • It lets rollout policy act on evidence instead of hope.
  • It makes field behavior available to curation, evaluation, and retraining loops.
  • It ties deployment outcomes back to models, artifacts, and datasets through the Unified Data Model.

Without this surface, operations data stays fragmented and the platform cannot truly close the loop.

Why Teams Care

Operational confidence

Teams know what is happening before customers or operators tell them something went wrong.

Better release quality

Rollouts can be judged by real-world cohort behavior, not only pre-deploy benchmarks.

Faster diagnosis

Incident packets and cohort comparison reduce mean time to understand regressions.

Compounding learning

Production evidence becomes new data and evaluation coverage instead of dead-end telemetry.