Logging avec Loki¶
Introduction¶
Loki est la solution de logging cloud-native, conçue pour s'intégrer avec Prometheus et Grafana. Elle collecte, stocke et permet la recherche dans les logs de tous les services OpenStack.
Prérequis¶
- Grafana installé
- Espace de stockage pour les logs
- Compréhension des patterns de logs OpenStack
Points à apprendre¶
Architecture Logging¶
graph TB
subgraph Logging["Logging Stack"]
loki[Loki<br/>Stockage logs<br/>Indexation labels<br/>Port 3100]
promtail[Promtail<br/>DaemonSet<br/>Collecte logs<br/>Ajoute labels]
grafana[Grafana<br/>Exploration<br/>Dashboards]
end
subgraph Sources["Sources de logs"]
docker[Docker Logs<br/>/var/lib/docker/containers]
journal[Systemd Journal<br/>journald]
files[Log Files<br/>/var/log/*]
end
subgraph OpenStack["OpenStack Services"]
nova[Nova<br/>nova-*.log]
neutron[Neutron<br/>neutron-*.log]
keystone[Keystone<br/>keystone.log]
ceph_logs[Ceph<br/>ceph-*.log]
end
promtail -->|Read| docker
promtail -->|Read| journal
promtail -->|Read| files
promtail -->|Push logs HTTP| loki
grafana -->|Query LogQL| loki
nova -.-> files
neutron -.-> files
keystone -.-> files
ceph_logs -.-> files
Activation avec Kolla¶
# /etc/kolla/globals.yml
# Loki n'est pas inclus dans Kolla par défaut
# Déploiement manuel ou via Helm
# Alternative: Elasticsearch + Kibana (legacy)
enable_central_logging: "yes"
enable_elasticsearch: "yes"
enable_kibana: "yes"
Déploiement Loki + Promtail¶
# docker-compose-loki.yml
version: '3.8'
services:
loki:
image: grafana/loki:2.9.3
ports:
- "3100:3100"
volumes:
- ./loki-config.yaml:/etc/loki/local-config.yaml
- loki-data:/loki
command: -config.file=/etc/loki/local-config.yaml
restart: unless-stopped
promtail:
image: grafana/promtail:2.9.3
volumes:
- ./promtail-config.yaml:/etc/promtail/config.yaml
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
command: -config.file=/etc/promtail/config.yaml
restart: unless-stopped
volumes:
loki-data:
Configuration Loki¶
# loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
chunk_idle_period: 5m
chunk_retain_period: 30s
schema_config:
configs:
- from: 2020-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/cache
shared_store: filesystem
filesystem:
directory: /loki/chunks
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h # 7 days
ingestion_rate_mb: 16
ingestion_burst_size_mb: 24
compactor:
working_directory: /loki/compactor
shared_store: filesystem
retention_enabled: true
retention_delete_delay: 2h
retention_delete_worker_count: 150
chunk_store_config:
max_look_back_period: 168h # 7 days
table_manager:
retention_deletes_enabled: true
retention_period: 168h # 7 days
Configuration Promtail¶
# promtail-config.yaml
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Docker containers
- job_name: docker
static_configs:
- targets:
- localhost
labels:
job: docker
host: ${HOSTNAME}
__path__: /var/lib/docker/containers/*/*log
pipeline_stages:
- json:
expressions:
output: log
stream: stream
time: time
- timestamp:
source: time
format: RFC3339Nano
- output:
source: output
- labels:
stream:
# Extraire le nom du container
- regex:
expression: '/var/lib/docker/containers/(?P<container_id>[^/]+)/.*'
- labels:
container_id:
# OpenStack logs
- job_name: openstack
static_configs:
- targets:
- localhost
labels:
job: openstack
host: ${HOSTNAME}
__path__: /var/log/kolla/*/*.log
pipeline_stages:
- regex:
expression: '^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}\.\d+) (?P<pid>\d+) (?P<level>\w+) (?P<component>\S+) \[(?P<request_id>[^\]]*)\] (?P<message>.*)$'
- labels:
level:
component:
- timestamp:
source: timestamp
format: '2006-01-02 15:04:05.000000'
# Ceph logs
- job_name: ceph
static_configs:
- targets:
- localhost
labels:
job: ceph
host: ${HOSTNAME}
__path__: /var/log/ceph/*.log
# System logs
- job_name: syslog
journal:
max_age: 12h
labels:
job: syslog
host: ${HOSTNAME}
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
Requêtes LogQL¶
# Logs d'un service spécifique
{job="openstack", component="nova-api"}
# Logs d'erreur uniquement
{job="openstack"} |= "ERROR"
# Logs avec pattern
{job="openstack"} |~ "failed|error|exception"
# Logs d'une instance spécifique
{job="openstack"} |= "instance_id=abc123"
# Compter les erreurs par service
sum(count_over_time({job="openstack", level="ERROR"}[5m])) by (component)
# Top 10 des erreurs
topk(10, sum(count_over_time({job="openstack", level="ERROR"}[1h])) by (message))
# Latence des requêtes (extrait du log)
{job="openstack", component="nova-api"}
| regexp `took (?P<duration>\d+\.\d+) seconds`
| duration > 1
Dashboard Grafana pour logs¶
{
"title": "OpenStack Logs",
"panels": [
{
"title": "Log Volume",
"type": "timeseries",
"targets": [
{
"expr": "sum(count_over_time({job=\"openstack\"}[5m])) by (component)",
"legendFormat": "{{component}}"
}
]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [
{
"expr": "sum(count_over_time({job=\"openstack\", level=\"ERROR\"}[5m])) by (component)",
"legendFormat": "{{component}}"
}
]
},
{
"title": "Logs Explorer",
"type": "logs",
"targets": [
{
"expr": "{job=\"openstack\"} | line_format \"{{.component}}: {{.message}}\""
}
]
}
]
}
Alertes basées sur les logs¶
# /etc/kolla/config/prometheus/rules/log-alerts.yml
groups:
- name: log-alerts
rules:
- alert: HighErrorRate
expr: |
sum(count_over_time({job="openstack", level="ERROR"}[5m])) by (component) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate in {{ $labels.component }}"
description: "{{ $value }} errors in the last 5 minutes"
- alert: ServiceCrashLoop
expr: |
count_over_time({job="openstack"} |= "Starting" [10m]) > 5
for: 1m
labels:
severity: critical
annotations:
summary: "Service appears to be crash looping"
- alert: AuthenticationFailures
expr: |
sum(count_over_time({job="openstack", component="keystone"} |= "Authentication failed"[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Multiple authentication failures detected"
Diagramme de flux¶
sequenceDiagram
participant Src as Container/Service
participant Docker as Docker<br/>(stdout/stderr)
participant Promtail
participant Loki
participant Grafana
actor Ops as Operator
Src->>Docker: Write log
Docker->>Docker: Store in<br/>/var/lib/docker/containers
Promtail->>Docker: Tail logs
Promtail->>Promtail: Parse (regex/json)<br/>Add labels
Promtail->>Loki: Push (batched)
Loki->>Loki: Index labels<br/>Store chunks
Ops->>Grafana: Query logs
Grafana->>Loki: LogQL query
Loki-->>Grafana: Log entries
Grafana-->>Ops: Display results
Exemples pratiques¶
Recherche de problèmes¶
# Via API Loki
curl -G -s "http://loki:3100/loki/api/v1/query_range" \
--data-urlencode 'query={job="openstack"} |= "ERROR"' \
--data-urlencode 'start=1704067200' \
--data-urlencode 'end=1704153600' \
| jq '.data.result[].values[] | .[1]'
# Avec logcli (CLI Loki)
logcli query '{job="openstack", component="nova-api"}' --limit=100
# Erreurs récentes
logcli query '{job="openstack", level="ERROR"}' --since=1h
Corrélation logs/métriques¶
# Dans Grafana, utiliser la même variable de temps
# Panneau 1: Métrique
rate(http_requests_total{status="500"}[5m])
# Panneau 2: Logs correspondants
{job="openstack", component="nova-api"} |= "500"
Rétention et archivage¶
# Configuration de rétention
limits_config:
retention_period: 720h # 30 jours
# Pour archivage long terme
# Utiliser object storage (S3/Swift)
storage_config:
aws:
s3: s3://access_key:secret_key@region/bucket_name
Ressources¶
Checkpoint¶
- Loki déployé et accessible
- Promtail collecte les logs Docker
- Logs OpenStack avec labels corrects
- Data source Loki dans Grafana
- Dashboard logs fonctionnel
- Requêtes LogQL maîtrisées
- Alertes sur logs configurées
- Rétention configurée