Skip to content

Prometheus Stack pour OpenStack

Introduction

Prometheus est la solution standard pour le monitoring cloud-native. Déployé avec Kolla-Ansible, il collecte les métriques d'OpenStack, Ceph, et de l'infrastructure pour permettre l'observabilité complète.

Prérequis

Points à apprendre

Architecture Monitoring

graph TB
    subgraph External
        ops[Ops Team<br/>Surveille l'infrastructure]
        notify[Notifications<br/>Email, Slack, PagerDuty]
    end

    subgraph Monitoring["Monitoring Stack"]
        prometheus[Prometheus<br/>Collecte métriques<br/>Stockage TSDB<br/>Port 9090]
        alertmanager[Alertmanager<br/>Gestion alertes<br/>Routing notifications<br/>Port 9093]
        grafana[Grafana<br/>Dashboards<br/>Visualisation<br/>Port 3000]
    end

    subgraph Exporters
        node[Node Exporter<br/>Métriques système<br/>CPU, RAM, Disk]
        cadvisor[cAdvisor<br/>Métriques containers]
        os_exp[OpenStack Exporter<br/>Métriques services]
        ceph_exp[Ceph Exporter<br/>Métriques Ceph]
        haproxy_exp[HAProxy Exporter<br/>Métriques LB]
    end

    subgraph OpenStack
        services[Services<br/>Nova, Neutron...]
        infra[Infrastructure<br/>DB, MQ, Cache]
    end

    ops -->|Dashboards HTTPS| grafana
    grafana -->|Query PromQL| prometheus
    prometheus -->|Alerts| alertmanager
    alertmanager -->|Send| notify

    prometheus -->|Scrape| node
    prometheus -->|Scrape| cadvisor
    prometheus -->|Scrape| os_exp
    prometheus -->|Scrape| ceph_exp
    prometheus -->|Scrape| haproxy_exp

    node -->|Collect| infra
    os_exp -->|API| services

Activation avec Kolla

# /etc/kolla/globals.yml

enable_prometheus: "yes"
enable_grafana: "yes"

# Exporters
enable_prometheus_node_exporter: "yes"
enable_prometheus_cadvisor: "yes"
enable_prometheus_haproxy_exporter: "yes"
enable_prometheus_openstack_exporter: "yes"
enable_prometheus_mysqld_exporter: "yes"
enable_prometheus_rabbitmq_exporter: "yes"
enable_prometheus_memcached_exporter: "yes"

# Alertmanager
enable_prometheus_alertmanager: "yes"

# Rétention des données
prometheus_retention_period: "30d"
prometheus_storage_size: "50Gi"

Déploiement

# Déployer le stack monitoring
kolla-ansible -i ~/multinode deploy --tags prometheus,grafana

# Ou reconfigurer si déjà déployé
kolla-ansible -i ~/multinode reconfigure --tags prometheus,grafana

# Vérifier les containers
docker ps | grep -E "(prometheus|grafana|exporter)"

Configuration Prometheus

# /etc/kolla/config/prometheus/prometheus.yml (personnalisations)

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'openstack-prod'
    environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  # Configuration des jobs générée par Kolla
  # Ajouter des jobs personnalisés ici

Jobs de scrape

# Exemple de configuration scrape générée

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets:
          - 'controller-1:9100'
          - 'controller-2:9100'
          - 'controller-3:9100'
          - 'compute-1:9100'
          - 'compute-2:9100'

  - job_name: 'cadvisor'
    static_configs:
      - targets:
          - 'controller-1:8080'
          - 'controller-2:8080'
          - 'controller-3:8080'

  - job_name: 'openstack-exporter'
    static_configs:
      - targets: ['controller-1:9180']

  - job_name: 'haproxy'
    static_configs:
      - targets:
          - 'controller-1:9101'
          - 'controller-2:9101'
          - 'controller-3:9101'

  - job_name: 'rabbitmq'
    static_configs:
      - targets:
          - 'controller-1:15692'
          - 'controller-2:15692'
          - 'controller-3:15692'

  - job_name: 'mysqld'
    static_configs:
      - targets:
          - 'controller-1:9104'
          - 'controller-2:9104'
          - 'controller-3:9104'

Accès à Prometheus

# Via HAProxy (VIP)
curl http://10.0.0.10:9090

# Interface web
# http://10.0.0.10:9090

# API Query
curl -s 'http://10.0.0.10:9090/api/v1/query?query=up' | jq .

# Targets status
curl -s 'http://10.0.0.10:9090/api/v1/targets' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

Requêtes PromQL essentielles

# Santé des services
up{job=~".*exporter.*"}

# CPU usage par host
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Mémoire utilisée
node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes

# Disque utilisé
(node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100

# Containers actifs
count(container_running_count{job="cadvisor"})

# Requêtes HAProxy par seconde
sum(rate(haproxy_frontend_http_requests_total[5m])) by (frontend)

# Latence API Nova
histogram_quantile(0.95, rate(openstack_api_request_duration_seconds_bucket{service="nova"}[5m]))

Diagramme de flux des métriques

sequenceDiagram
    participant Target as Target<br/>(Node, Container)
    participant Exporter as Exporter<br/>(node_exporter)
    participant Prom as Prometheus
    participant Alert as Alertmanager
    participant Graf as Grafana

    Note over Target,Exporter: CPU, RAM, Disk<br/>Network, etc.
    Target->>Exporter: System metrics

    Prom->>Exporter: HTTP GET /metrics<br/>(every 15s)
    Exporter-->>Prom: Prometheus format<br/>(text/plain)

    Prom->>Prom: Store in TSDB

    alt Alert condition met
        Prom->>Alert: Send alert
        Alert->>Alert: Group, dedupe
        Alert->>Alert: Route to receiver
    end

    Graf->>Prom: PromQL query
    Prom-->>Graf: Time series data
    Graf->>Graf: Render dashboard

Exemples pratiques

Vérifier le stack

#!/bin/bash
# check-monitoring.sh

echo "=== Prometheus Status ==="
curl -s http://10.0.0.10:9090/-/healthy && echo "Prometheus: OK" || echo "Prometheus: FAIL"

echo -e "\n=== Targets Health ==="
curl -s 'http://10.0.0.10:9090/api/v1/targets' | \
    jq -r '.data.activeTargets[] | "\(.labels.job): \(.health)"' | sort | uniq -c

echo -e "\n=== Alertmanager Status ==="
curl -s http://10.0.0.10:9093/-/healthy && echo "Alertmanager: OK" || echo "Alertmanager: FAIL"

echo -e "\n=== Grafana Status ==="
curl -s http://10.0.0.10:3000/api/health | jq .

echo -e "\n=== Active Alerts ==="
curl -s 'http://10.0.0.10:9093/api/v2/alerts' | jq '.[].labels.alertname'

Prometheus Federation (multi-cluster)

# Fédérer plusieurs Prometheus
scrape_configs:
  - job_name: 'federate-site-2'
    scrape_interval: 30s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job=~".+"}'
    static_configs:
      - targets:
          - 'prometheus-site-2.example.com:9090'

Stockage long terme

# Configuration remote write vers Thanos/Mimir
remote_write:
  - url: "http://thanos-receive:19291/api/v1/receive"
    queue_config:
      max_samples_per_send: 10000
      batch_send_deadline: 5s

Ressources

Checkpoint

  • Prometheus déployé et accessible (port 9090)
  • Tous les exporters en état "up"
  • Alertmanager accessible (port 9093)
  • Grafana accessible (port 3000)
  • Requêtes PromQL fonctionnelles
  • Rétention configurée (30 jours)
  • Stockage suffisant pour les métriques