Skip to content

Monitoring

Ordinaut is designed for production environments and exposes key metrics for monitoring and alerting. A full observability stack can be launched using the docker-compose.observability.yml file in the ops/ directory.

Prometheus Metrics

The system exposes a Prometheus-compatible endpoint at /metrics. Key metrics include:

Core Metrics

  • orchestrator_tasks_created_total: Counter for created tasks, labeled by schedule_kind and agent_id.
  • orchestrator_runs_total: Counter for task runs, labeled by status, task_id, and agent_id.
  • orchestrator_task_duration_seconds: Histogram of complete task execution times.
  • orchestrator_step_duration_seconds: Histogram of individual pipeline step execution times.

System & Worker Health

  • orchestrator_due_work_queue_depth: Gauge showing the number of pending jobs.
  • orchestrator_scheduler_lag_seconds: Gauge measuring the delay between scheduled time and actual execution.
  • orchestrator_active_workers: Gauge showing the number of active workers and their state.
  • orchestrator_worker_heartbeat_total: Counter for worker heartbeats.

API & Security

  • orchestrator_http_request_duration_seconds: Histogram of API request latency.
  • orchestrator_security_events_total: Counter for security events, labeled by event_type and severity.
  • orchestrator_authentication_attempts_total: Counter for authentication attempts, labeled by result.

Grafana Dashboards

The ops/ directory contains provisioning files for Grafana, which can be used to visualize the Prometheus metrics. It is recommended to set up dashboards to monitor:

  • System Health: An overview of API response times, error rates, and component health.
  • Task & Run Analysis: Tracking of created vs. completed tasks, success/failure rates, and top failing tasks.
  • Worker Performance: Monitoring queue depth, processing latency, and worker saturation.

Logging

All services produce structured (JSON) logs with correlation IDs (task_id, run_id), allowing for easy filtering and analysis. The provided observability stack includes Loki for centralized log aggregation, which can be queried from Grafana.