Skip to content

Logging avec Loki

Introduction

Loki est la solution de logging cloud-native, conçue pour s'intégrer avec Prometheus et Grafana. Elle collecte, stocke et permet la recherche dans les logs de tous les services OpenStack.

Prérequis

  • Grafana installé
  • Espace de stockage pour les logs
  • Compréhension des patterns de logs OpenStack

Points à apprendre

Architecture Logging

graph TB
    subgraph Logging["Logging Stack"]
        loki[Loki<br/>Stockage logs<br/>Indexation labels<br/>Port 3100]
        promtail[Promtail<br/>DaemonSet<br/>Collecte logs<br/>Ajoute labels]
        grafana[Grafana<br/>Exploration<br/>Dashboards]
    end

    subgraph Sources["Sources de logs"]
        docker[Docker Logs<br/>/var/lib/docker/containers]
        journal[Systemd Journal<br/>journald]
        files[Log Files<br/>/var/log/*]
    end

    subgraph OpenStack["OpenStack Services"]
        nova[Nova<br/>nova-*.log]
        neutron[Neutron<br/>neutron-*.log]
        keystone[Keystone<br/>keystone.log]
        ceph_logs[Ceph<br/>ceph-*.log]
    end

    promtail -->|Read| docker
    promtail -->|Read| journal
    promtail -->|Read| files
    promtail -->|Push logs HTTP| loki
    grafana -->|Query LogQL| loki

    nova -.-> files
    neutron -.-> files
    keystone -.-> files
    ceph_logs -.-> files

Activation avec Kolla

# /etc/kolla/globals.yml

# Loki n'est pas inclus dans Kolla par défaut
# Déploiement manuel ou via Helm

# Alternative: Elasticsearch + Kibana (legacy)
enable_central_logging: "yes"
enable_elasticsearch: "yes"
enable_kibana: "yes"

Déploiement Loki + Promtail

# docker-compose-loki.yml
version: '3.8'

services:
  loki:
    image: grafana/loki:2.9.3
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yaml:/etc/loki/local-config.yaml
      - loki-data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  promtail:
    image: grafana/promtail:2.9.3
    volumes:
      - ./promtail-config.yaml:/etc/promtail/config.yaml
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    command: -config.file=/etc/promtail/config.yaml
    restart: unless-stopped

volumes:
  loki-data:

Configuration Loki

# loki-config.yaml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 5m
  chunk_retain_period: 30s

schema_config:
  configs:
    - from: 2020-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/cache
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h  # 7 days
  ingestion_rate_mb: 16
  ingestion_burst_size_mb: 24

compactor:
  working_directory: /loki/compactor
  shared_store: filesystem
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

chunk_store_config:
  max_look_back_period: 168h  # 7 days

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h  # 7 days

Configuration Promtail

# promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # Docker containers
  - job_name: docker
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          host: ${HOSTNAME}
          __path__: /var/lib/docker/containers/*/*log

    pipeline_stages:
      - json:
          expressions:
            output: log
            stream: stream
            time: time
      - timestamp:
          source: time
          format: RFC3339Nano
      - output:
          source: output
      - labels:
          stream:
      # Extraire le nom du container
      - regex:
          expression: '/var/lib/docker/containers/(?P<container_id>[^/]+)/.*'
      - labels:
          container_id:

  # OpenStack logs
  - job_name: openstack
    static_configs:
      - targets:
          - localhost
        labels:
          job: openstack
          host: ${HOSTNAME}
          __path__: /var/log/kolla/*/*.log

    pipeline_stages:
      - regex:
          expression: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+) (?P<pid>\d+) (?P<level>\w+) (?P<component>\S+) \[(?P<request_id>[^\]]*)\] (?P<message>.*)$'
      - labels:
          level:
          component:
      - timestamp:
          source: timestamp
          format: '2006-01-02 15:04:05.000000'

  # Ceph logs
  - job_name: ceph
    static_configs:
      - targets:
          - localhost
        labels:
          job: ceph
          host: ${HOSTNAME}
          __path__: /var/log/ceph/*.log

  # System logs
  - job_name: syslog
    journal:
      max_age: 12h
      labels:
        job: syslog
        host: ${HOSTNAME}
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

Requêtes LogQL

# Logs d'un service spécifique
{job="openstack", component="nova-api"}

# Logs d'erreur uniquement
{job="openstack"} |= "ERROR"

# Logs avec pattern
{job="openstack"} |~ "failed|error|exception"

# Logs d'une instance spécifique
{job="openstack"} |= "instance_id=abc123"

# Compter les erreurs par service
sum(count_over_time({job="openstack", level="ERROR"}[5m])) by (component)

# Top 10 des erreurs
topk(10, sum(count_over_time({job="openstack", level="ERROR"}[1h])) by (message))

# Latence des requêtes (extrait du log)
{job="openstack", component="nova-api"}
  | regexp `took (?P<duration>\d+\.\d+) seconds`
  | duration > 1

Dashboard Grafana pour logs

{
  "title": "OpenStack Logs",
  "panels": [
    {
      "title": "Log Volume",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(count_over_time({job=\"openstack\"}[5m])) by (component)",
          "legendFormat": "{{component}}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(count_over_time({job=\"openstack\", level=\"ERROR\"}[5m])) by (component)",
          "legendFormat": "{{component}}"
        }
      ]
    },
    {
      "title": "Logs Explorer",
      "type": "logs",
      "targets": [
        {
          "expr": "{job=\"openstack\"} | line_format \"{{.component}}: {{.message}}\""
        }
      ]
    }
  ]
}

Alertes basées sur les logs

# /etc/kolla/config/prometheus/rules/log-alerts.yml
groups:
  - name: log-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(count_over_time({job="openstack", level="ERROR"}[5m])) by (component) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in {{ $labels.component }}"
          description: "{{ $value }} errors in the last 5 minutes"

      - alert: ServiceCrashLoop
        expr: |
          count_over_time({job="openstack"} |= "Starting" [10m]) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service appears to be crash looping"

      - alert: AuthenticationFailures
        expr: |
          sum(count_over_time({job="openstack", component="keystone"} |= "Authentication failed"[5m])) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Multiple authentication failures detected"

Diagramme de flux

sequenceDiagram
    participant Src as Container/Service
    participant Docker as Docker<br/>(stdout/stderr)
    participant Promtail
    participant Loki
    participant Grafana
    actor Ops as Operator

    Src->>Docker: Write log
    Docker->>Docker: Store in<br/>/var/lib/docker/containers

    Promtail->>Docker: Tail logs
    Promtail->>Promtail: Parse (regex/json)<br/>Add labels
    Promtail->>Loki: Push (batched)

    Loki->>Loki: Index labels<br/>Store chunks

    Ops->>Grafana: Query logs
    Grafana->>Loki: LogQL query
    Loki-->>Grafana: Log entries
    Grafana-->>Ops: Display results

Exemples pratiques

Recherche de problèmes

# Via API Loki
curl -G -s "http://loki:3100/loki/api/v1/query_range" \
    --data-urlencode 'query={job="openstack"} |= "ERROR"' \
    --data-urlencode 'start=1704067200' \
    --data-urlencode 'end=1704153600' \
    | jq '.data.result[].values[] | .[1]'

# Avec logcli (CLI Loki)
logcli query '{job="openstack", component="nova-api"}' --limit=100

# Erreurs récentes
logcli query '{job="openstack", level="ERROR"}' --since=1h

Corrélation logs/métriques

# Dans Grafana, utiliser la même variable de temps
# Panneau 1: Métrique
rate(http_requests_total{status="500"}[5m])

# Panneau 2: Logs correspondants
{job="openstack", component="nova-api"} |= "500"

Rétention et archivage

# Configuration de rétention
limits_config:
  retention_period: 720h  # 30 jours

# Pour archivage long terme
# Utiliser object storage (S3/Swift)
storage_config:
  aws:
    s3: s3://access_key:secret_key@region/bucket_name

Ressources

Checkpoint

  • Loki déployé et accessible
  • Promtail collecte les logs Docker
  • Logs OpenStack avec labels corrects
  • Data source Loki dans Grafana
  • Dashboard logs fonctionnel
  • Requêtes LogQL maîtrisées
  • Alertes sur logs configurées
  • Rétention configurée