Prometheus Stack pour OpenStack¶
Introduction¶
Prometheus est la solution standard pour le monitoring cloud-native. Déployé avec Kolla-Ansible, il collecte les métriques d'OpenStack, Ceph, et de l'infrastructure pour permettre l'observabilité complète.
Prérequis¶
- Cluster OpenStack HA fonctionnel
- Phase 5 - Haute Disponibilité
- Stockage suffisant pour les métriques (~50GB recommandé)
Points à apprendre¶
Architecture Monitoring¶
graph TB
subgraph External
ops[Ops Team<br/>Surveille l'infrastructure]
notify[Notifications<br/>Email, Slack, PagerDuty]
end
subgraph Monitoring["Monitoring Stack"]
prometheus[Prometheus<br/>Collecte métriques<br/>Stockage TSDB<br/>Port 9090]
alertmanager[Alertmanager<br/>Gestion alertes<br/>Routing notifications<br/>Port 9093]
grafana[Grafana<br/>Dashboards<br/>Visualisation<br/>Port 3000]
end
subgraph Exporters
node[Node Exporter<br/>Métriques système<br/>CPU, RAM, Disk]
cadvisor[cAdvisor<br/>Métriques containers]
os_exp[OpenStack Exporter<br/>Métriques services]
ceph_exp[Ceph Exporter<br/>Métriques Ceph]
haproxy_exp[HAProxy Exporter<br/>Métriques LB]
end
subgraph OpenStack
services[Services<br/>Nova, Neutron...]
infra[Infrastructure<br/>DB, MQ, Cache]
end
ops -->|Dashboards HTTPS| grafana
grafana -->|Query PromQL| prometheus
prometheus -->|Alerts| alertmanager
alertmanager -->|Send| notify
prometheus -->|Scrape| node
prometheus -->|Scrape| cadvisor
prometheus -->|Scrape| os_exp
prometheus -->|Scrape| ceph_exp
prometheus -->|Scrape| haproxy_exp
node -->|Collect| infra
os_exp -->|API| services
Activation avec Kolla¶
# /etc/kolla/globals.yml
enable_prometheus: "yes"
enable_grafana: "yes"
# Exporters
enable_prometheus_node_exporter: "yes"
enable_prometheus_cadvisor: "yes"
enable_prometheus_haproxy_exporter: "yes"
enable_prometheus_openstack_exporter: "yes"
enable_prometheus_mysqld_exporter: "yes"
enable_prometheus_rabbitmq_exporter: "yes"
enable_prometheus_memcached_exporter: "yes"
# Alertmanager
enable_prometheus_alertmanager: "yes"
# Rétention des données
prometheus_retention_period: "30d"
prometheus_storage_size: "50Gi"
Déploiement¶
# Déployer le stack monitoring
kolla-ansible -i ~/multinode deploy --tags prometheus,grafana
# Ou reconfigurer si déjà déployé
kolla-ansible -i ~/multinode reconfigure --tags prometheus,grafana
# Vérifier les containers
docker ps | grep -E "(prometheus|grafana|exporter)"
Configuration Prometheus¶
# /etc/kolla/config/prometheus/prometheus.yml (personnalisations)
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'openstack-prod'
environment: 'production'
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Configuration des jobs générée par Kolla
# Ajouter des jobs personnalisés ici
Jobs de scrape¶
# Exemple de configuration scrape générée
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets:
- 'controller-1:9100'
- 'controller-2:9100'
- 'controller-3:9100'
- 'compute-1:9100'
- 'compute-2:9100'
- job_name: 'cadvisor'
static_configs:
- targets:
- 'controller-1:8080'
- 'controller-2:8080'
- 'controller-3:8080'
- job_name: 'openstack-exporter'
static_configs:
- targets: ['controller-1:9180']
- job_name: 'haproxy'
static_configs:
- targets:
- 'controller-1:9101'
- 'controller-2:9101'
- 'controller-3:9101'
- job_name: 'rabbitmq'
static_configs:
- targets:
- 'controller-1:15692'
- 'controller-2:15692'
- 'controller-3:15692'
- job_name: 'mysqld'
static_configs:
- targets:
- 'controller-1:9104'
- 'controller-2:9104'
- 'controller-3:9104'
Accès à Prometheus¶
# Via HAProxy (VIP)
curl http://10.0.0.10:9090
# Interface web
# http://10.0.0.10:9090
# API Query
curl -s 'http://10.0.0.10:9090/api/v1/query?query=up' | jq .
# Targets status
curl -s 'http://10.0.0.10:9090/api/v1/targets' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
Requêtes PromQL essentielles¶
# Santé des services
up{job=~".*exporter.*"}
# CPU usage par host
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Mémoire utilisée
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes
# Disque utilisé
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100
# Containers actifs
count(container_running_count{job="cadvisor"})
# Requêtes HAProxy par seconde
sum(rate(haproxy_frontend_http_requests_total[5m])) by (frontend)
# Latence API Nova
histogram_quantile(0.95, rate(openstack_api_request_duration_seconds_bucket{service="nova"}[5m]))
Diagramme de flux des métriques¶
sequenceDiagram
participant Target as Target<br/>(Node, Container)
participant Exporter as Exporter<br/>(node_exporter)
participant Prom as Prometheus
participant Alert as Alertmanager
participant Graf as Grafana
Note over Target,Exporter: CPU, RAM, Disk<br/>Network, etc.
Target->>Exporter: System metrics
Prom->>Exporter: HTTP GET /metrics<br/>(every 15s)
Exporter-->>Prom: Prometheus format<br/>(text/plain)
Prom->>Prom: Store in TSDB
alt Alert condition met
Prom->>Alert: Send alert
Alert->>Alert: Group, dedupe
Alert->>Alert: Route to receiver
end
Graf->>Prom: PromQL query
Prom-->>Graf: Time series data
Graf->>Graf: Render dashboard
Exemples pratiques¶
Vérifier le stack¶
#!/bin/bash
# check-monitoring.sh
echo "=== Prometheus Status ==="
curl -s http://10.0.0.10:9090/-/healthy && echo "Prometheus: OK" || echo "Prometheus: FAIL"
echo -e "\n=== Targets Health ==="
curl -s 'http://10.0.0.10:9090/api/v1/targets' | \
jq -r '.data.activeTargets[] | "\(.labels.job): \(.health)"' | sort | uniq -c
echo -e "\n=== Alertmanager Status ==="
curl -s http://10.0.0.10:9093/-/healthy && echo "Alertmanager: OK" || echo "Alertmanager: FAIL"
echo -e "\n=== Grafana Status ==="
curl -s http://10.0.0.10:3000/api/health | jq .
echo -e "\n=== Active Alerts ==="
curl -s 'http://10.0.0.10:9093/api/v2/alerts' | jq '.[].labels.alertname'
Prometheus Federation (multi-cluster)¶
# Fédérer plusieurs Prometheus
scrape_configs:
- job_name: 'federate-site-2'
scrape_interval: 30s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job=~".+"}'
static_configs:
- targets:
- 'prometheus-site-2.example.com:9090'
Stockage long terme¶
# Configuration remote write vers Thanos/Mimir
remote_write:
- url: "http://thanos-receive:19291/api/v1/receive"
queue_config:
max_samples_per_send: 10000
batch_send_deadline: 5s
Ressources¶
Checkpoint¶
- Prometheus déployé et accessible (port 9090)
- Tous les exporters en état "up"
- Alertmanager accessible (port 9093)
- Grafana accessible (port 3000)
- Requêtes PromQL fonctionnelles
- Rétention configurée (30 jours)
- Stockage suffisant pour les métriques