Skip to content

Troubleshooting

Introduction

Le troubleshooting efficace d'OpenStack nécessite une méthodologie structurée et une bonne connaissance des logs, métriques et outils de diagnostic. Cette section fournit les procédures pour résoudre les problèmes courants.

Prérequis

  • Accès admin aux services OpenStack
  • Logging configuré
  • Monitoring en place
  • Compréhension de l'architecture OpenStack

Points à apprendre

Méthodologie de diagnostic

flowchart TD
    Start([Start]) --> Symptom[Identifier le symptôme<br/>Erreur API, VM down, lenteur...]

    Symptom --> Collect[Collecter les informations]

    subgraph "Collecte"
        Collect --> Logs[Logs des services concernés]
        Logs --> Metrics[Métriques CPU, RAM, réseau]
        Metrics --> State[État des services]
        State --> Changes[Derniers changements]
    end

    Changes --> Hypotheses[Formuler hypothèses]

    Hypotheses --> Test1[Tester hypothèse 1]
    Test1 --> Resolved1{Problème<br/>résolu?}
    Resolved1 -->|Oui| Doc1[Documenter solution]
    Doc1 --> End1([Stop])

    Resolved1 -->|Non| Test2[Tester hypothèse 2]
    Test2 --> Resolved2{Problème<br/>résolu?}
    Resolved2 -->|Oui| Doc2[Documenter solution]
    Doc2 --> End2([Stop])

    Resolved2 -->|Non| Escalate[Escalade ou recherche approfondie]
    Escalate --> Consult[Consulter documentation/communauté]
    Consult --> End3([Stop])

Architecture des logs

graph TB
    subgraph "OpenStack Services"
        keystone[Keystone<br/>/var/log/kolla/keystone/]
        nova[Nova<br/>/var/log/kolla/nova/]
        neutron[Neutron<br/>/var/log/kolla/neutron/]
        cinder[Cinder<br/>/var/log/kolla/cinder/]
    end

    subgraph "Infrastructure"
        mariadb[MariaDB<br/>/var/log/kolla/mariadb/]
        rabbitmq[RabbitMQ<br/>/var/log/kolla/rabbitmq/]
        haproxy[HAProxy<br/>/var/log/kolla/haproxy/]
    end

    subgraph "Logging Stack"
        promtail[Promtail<br/>Collecte logs]
        loki[(Loki<br/>Stockage)]
        grafana[Grafana<br/>Visualisation]
    end

    keystone -->|logs| promtail
    nova -->|logs| promtail
    neutron -->|logs| promtail
    cinder -->|logs| promtail
    mariadb -->|logs| promtail
    rabbitmq -->|logs| promtail
    haproxy -->|logs| promtail
    promtail -->|push| loki
    loki -->|query| grafana

Commandes de diagnostic essentielles

#!/bin/bash
# diagnostic-openstack.sh

echo "=== OpenStack Diagnostic ==="

# 1. État des services
echo -e "\n[1] Services Status"
openstack service list
openstack compute service list --long
openstack network agent list
openstack volume service list

# 2. État des endpoints
echo -e "\n[2] Endpoints"
openstack endpoint list --interface public

# 3. État du cluster MariaDB
echo -e "\n[3] MariaDB Galera"
docker exec mariadb mysql -e "
    SHOW STATUS LIKE 'wsrep_cluster_size';
    SHOW STATUS LIKE 'wsrep_cluster_status';
    SHOW STATUS LIKE 'wsrep_local_state_comment';
"

# 4. État RabbitMQ
echo -e "\n[4] RabbitMQ"
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_queues name messages consumers

# 5. Containers
echo -e "\n[5] Docker Containers"
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | grep -E "(kolla|ceph)"

# 6. Ressources système
echo -e "\n[6] System Resources"
free -h
df -h /var/lib/docker
uptime

# 7. Dernières erreurs
echo -e "\n[7] Recent Errors"
grep -r "ERROR" /var/log/kolla/*/\*.log 2>/dev/null | tail -20

Problèmes courants et solutions

Problème: VM ne démarre pas

#!/bin/bash
# debug-vm-boot.sh

VM_ID=$1
if [ -z "$VM_ID" ]; then
    echo "Usage: $0 <vm_id>"
    exit 1
fi

echo "=== Debugging VM: $VM_ID ==="

# 1. État de la VM
echo -e "\n[1] VM Status"
openstack server show $VM_ID -f yaml

# 2. Logs Nova pour cette VM
echo -e "\n[2] Nova Logs"
grep $VM_ID /var/log/kolla/nova/nova-compute.log | tail -50

# 3. Console log
echo -e "\n[3] Console Log"
openstack console log show $VM_ID --lines 100

# 4. Vérifier le compute node
COMPUTE=$(openstack server show $VM_ID -f value -c OS-EXT-SRV-ATTR:host)
echo -e "\n[4] Compute Node: $COMPUTE"
ssh $COMPUTE "virsh list --all | grep $VM_ID"
ssh $COMPUTE "virsh dominfo instance-* 2>/dev/null | head -20"

# 5. Vérifier le réseau
echo -e "\n[5] Network Ports"
openstack port list --server $VM_ID

# 6. Vérifier le volume
echo -e "\n[6] Volumes"
openstack server show $VM_ID -f value -c volumes_attached

# Solutions possibles
cat << EOF

=== Solutions possibles ===
1. Si "No valid host": vérifier quotas et ressources compute
   openstack hypervisor stats show

2. Si erreur réseau: vérifier les agents Neutron
   openstack network agent list

3. Si erreur volume: vérifier Cinder
   openstack volume list --server $VM_ID
   cinder service-list

4. Forcer rebuild si stuck:
   openstack server rebuild $VM_ID --image <image_id>
EOF

Problème: API lente ou timeout

#!/bin/bash
# debug-api-slow.sh

echo "=== API Performance Debug ==="

# 1. Test temps de réponse
echo -e "\n[1] API Response Times"
for service in identity:5000 compute:8774 network:9696 image:9292; do
    name=$(echo $service | cut -d: -f1)
    port=$(echo $service | cut -d: -f2)
    time=$(curl -s -o /dev/null -w "%{time_total}" https://cloud.example.com:$port/)
    echo "$name: ${time}s"
done

# 2. HAProxy stats
echo -e "\n[2] HAProxy Backend Status"
docker exec haproxy cat /var/lib/haproxy/stats | grep "BACKEND"

# 3. Connexions base de données
echo -e "\n[3] MariaDB Connections"
docker exec mariadb mysql -e "
    SHOW STATUS LIKE 'Threads_connected';
    SHOW STATUS LIKE 'Max_used_connections';
    SHOW PROCESSLIST;
"

# 4. RabbitMQ queues
echo -e "\n[4] RabbitMQ Queue Depth"
docker exec rabbitmq rabbitmqctl list_queues name messages | sort -k2 -n -r | head -10

# 5. Métriques système
echo -e "\n[5] System Load"
for host in controller-{1,2,3}; do
    echo "$host:"
    ssh $host "uptime; free -h | grep Mem"
done

# Solutions
cat << EOF

=== Solutions possibles ===
1. Si DB connections élevées:
   - Vérifier les requêtes lentes: SHOW FULL PROCESSLIST;
   - Augmenter max_connections
   - Optimiser les requêtes

2. Si RabbitMQ queue depth élevé:
   - Vérifier les consumers
   - Augmenter les workers
   - Purger les queues obsolètes

3. Si charge système élevée:
   - Identifier le processus: top -c
   - Vérifier I/O: iostat -x 1 5
   - Scale horizontalement
EOF

Problème: Perte de connectivité réseau

#!/bin/bash
# debug-network.sh

VM_ID=$1
echo "=== Network Debug for VM: $VM_ID ==="

# 1. Ports de la VM
echo -e "\n[1] VM Ports"
openstack port list --server $VM_ID -f yaml

PORT_ID=$(openstack port list --server $VM_ID -f value -c ID | head -1)
echo -e "\n[2] Port Details"
openstack port show $PORT_ID

# 2. Network namespace
NETWORK_ID=$(openstack port show $PORT_ID -f value -c network_id)
echo -e "\n[3] Network Namespace"
ssh network-node "ip netns list | grep $NETWORK_ID"

# 3. Vérifier les agents
echo -e "\n[4] Network Agents"
openstack network agent list

# 4. Flow OVS
echo -e "\n[5] OVS Flows"
COMPUTE=$(openstack server show $VM_ID -f value -c OS-EXT-SRV-ATTR:host)
ssh $COMPUTE "ovs-vsctl show"
ssh $COMPUTE "ovs-ofctl dump-flows br-int | head -20"

# 5. Security groups
echo -e "\n[6] Security Groups"
openstack port show $PORT_ID -f value -c security_group_ids | \
    xargs -I {} openstack security group rule list {}

# 6. Test ICMP depuis le network namespace
echo -e "\n[7] Connectivity Test"
ssh network-node "ip netns exec qdhcp-$NETWORK_ID ping -c 3 $(openstack port show $PORT_ID -f value -c fixed_ips | grep -oP '\d+\.\d+\.\d+\.\d+')"

# Solutions
cat << EOF

=== Solutions possibles ===
1. Si agent down: redémarrer le service
   docker restart neutron_openvswitch_agent

2. Si security group bloque:
   openstack security group rule create --ingress --protocol icmp <sg>

3. Si OVS problème:
   ovs-vsctl --if-exists del-br br-int && ovs-vsctl add-br br-int
   systemctl restart openvswitch

4. Si DHCP problème:
   docker restart neutron_dhcp_agent
EOF

Diagramme de flux troubleshooting

flowchart TD
    Start([Start]) --> Error[VM status = ERROR]

    Error --> CheckAPI[Vérifier logs nova-api]
    CheckAPI --> ErrorVisible{Erreur<br/>visible?}
    ErrorVisible -->|Oui| Analyze[Analyser erreur]
    ErrorVisible -->|Non| CheckScheduler[Vérifier nova-scheduler logs]

    Analyze --> NoValidHost{NoValidHost?}
    CheckScheduler --> NoValidHost

    NoValidHost -->|Oui| CheckHV[Vérifier hypervisor stats<br/>Vérifier quotas projet<br/>Vérifier flavors disponibles]
    NoValidHost -->|Non| NetworkError{Erreur<br/>réseau?}

    CheckHV --> NetworkError
    NetworkError -->|Oui| CheckNetwork[Vérifier neutron agents<br/>Vérifier port création<br/>Vérifier DHCP]
    NetworkError -->|Non| VolumeError{Erreur<br/>volume?}

    CheckNetwork --> VolumeError
    VolumeError -->|Oui| CheckVolume[Vérifier cinder services<br/>Vérifier espace Ceph<br/>Vérifier attachement]
    VolumeError -->|Non| ImageError{Erreur<br/>image?}

    CheckVolume --> ImageError
    ImageError -->|Oui| CheckImage[Vérifier image exists<br/>Vérifier download<br/>Vérifier checksum]
    ImageError -->|Non| CheckCompute[Consulter nova-compute logs]

    CheckImage --> CheckCompute
    CheckCompute --> CheckLibvirt[Vérifier libvirt logs]

    CheckLibvirt --> Resolved{Résolu?}
    Resolved -->|Oui| Document[Documenter solution]
    Resolved -->|Non| Escalate[Escalade niveau 2]

    Document --> End([Stop])
    Escalate --> End

Outils de diagnostic avancés

# === Rally - Benchmark et diagnostic ===
# Installer Rally
pip install rally-openstack

# Créer un deployment
rally deployment create --fromenv --name openstack

# Lancer des tests
rally task start /opt/rally/boot-and-delete.yaml

# === OSProfiler - Tracing distribué ===
# Activer dans les services
# [profiler]
# enabled = true
# hmac_keys = secret
# connection_string = redis://localhost:6379

# Tracer une requête
openstack --os-profile secret server list
osprofiler trace show <trace_id>

# === Tempest - Tests fonctionnels ===
pip install tempest
tempest init mytest
cd mytest
tempest run --regex 'tempest.api.compute.servers'

Logs à surveiller

# Fichiers de logs critiques par service

keystone:
  - /var/log/kolla/keystone/keystone.log
  - /var/log/kolla/keystone/keystone-access.log
  patterns:
    - "ERROR"
    - "AuthorizationFailure"
    - "TokenNotFound"

nova:
  - /var/log/kolla/nova/nova-api.log
  - /var/log/kolla/nova/nova-scheduler.log
  - /var/log/kolla/nova/nova-compute.log
  patterns:
    - "NoValidHost"
    - "BuildAbortException"
    - "InstanceNotFound"

neutron:
  - /var/log/kolla/neutron/neutron-server.log
  - /var/log/kolla/neutron/neutron-openvswitch-agent.log
  - /var/log/kolla/neutron/neutron-l3-agent.log
  patterns:
    - "Agent is not reachable"
    - "Port not found"
    - "NetworkNotFound"

cinder:
  - /var/log/kolla/cinder/cinder-api.log
  - /var/log/kolla/cinder/cinder-volume.log
  - /var/log/kolla/cinder/cinder-scheduler.log
  patterns:
    - "VolumeNotFound"
    - "ImageNotAuthorized"
    - "NoValidBackend"

mariadb:
  - /var/log/kolla/mariadb/mariadb.log
  patterns:
    - "WSREP"
    - "deadlock"
    - "too many connections"

rabbitmq:
  - /var/log/kolla/rabbitmq/rabbit@*.log
  patterns:
    - "connection refused"
    - "queue overflow"
    - "network partition"

Exemples pratiques

Script de collecte pour support

#!/bin/bash
# collect-support-bundle.sh

BUNDLE_DIR="/tmp/support-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BUNDLE_DIR

echo "Collecting support information..."

# 1. System info
echo "=== System ===" > $BUNDLE_DIR/system.txt
uname -a >> $BUNDLE_DIR/system.txt
cat /etc/os-release >> $BUNDLE_DIR/system.txt
free -h >> $BUNDLE_DIR/system.txt
df -h >> $BUNDLE_DIR/system.txt
uptime >> $BUNDLE_DIR/system.txt

# 2. OpenStack info
echo "=== OpenStack ===" > $BUNDLE_DIR/openstack.txt
openstack service list >> $BUNDLE_DIR/openstack.txt 2>&1
openstack compute service list >> $BUNDLE_DIR/openstack.txt 2>&1
openstack network agent list >> $BUNDLE_DIR/openstack.txt 2>&1

# 3. Container status
docker ps -a > $BUNDLE_DIR/docker-ps.txt

# 4. Logs (last 1000 lines each)
mkdir -p $BUNDLE_DIR/logs
for service in keystone nova neutron cinder; do
    if [ -d "/var/log/kolla/$service" ]; then
        tail -1000 /var/log/kolla/$service/*.log > $BUNDLE_DIR/logs/$service.log 2>/dev/null
    fi
done

# 5. Configs (sanitized)
mkdir -p $BUNDLE_DIR/config
cp /etc/kolla/globals.yml $BUNDLE_DIR/config/
sed -i 's/password.*/password: REDACTED/g' $BUNDLE_DIR/config/globals.yml

# 6. Package
tar -czvf ${BUNDLE_DIR}.tar.gz -C /tmp $(basename $BUNDLE_DIR)
rm -rf $BUNDLE_DIR

echo "Support bundle: ${BUNDLE_DIR}.tar.gz"

Ressources

Checkpoint

  • Scripts diagnostic créés
  • Logs centralisés et accessibles
  • Alertes sur erreurs critiques
  • Procédures documentées par type de problème
  • Formation équipe support
  • Base de connaissances alimentée
  • Tests réguliers des procédures
  • Escalade définie