Troubleshooting¶
Introduction¶
Le troubleshooting efficace d'OpenStack nécessite une méthodologie structurée et une bonne connaissance des logs, métriques et outils de diagnostic. Cette section fournit les procédures pour résoudre les problèmes courants.
Prérequis¶
- Accès admin aux services OpenStack
- Logging configuré
- Monitoring en place
- Compréhension de l'architecture OpenStack
Points à apprendre¶
Méthodologie de diagnostic¶
flowchart TD
Start([Start]) --> Symptom[Identifier le symptôme<br/>Erreur API, VM down, lenteur...]
Symptom --> Collect[Collecter les informations]
subgraph "Collecte"
Collect --> Logs[Logs des services concernés]
Logs --> Metrics[Métriques CPU, RAM, réseau]
Metrics --> State[État des services]
State --> Changes[Derniers changements]
end
Changes --> Hypotheses[Formuler hypothèses]
Hypotheses --> Test1[Tester hypothèse 1]
Test1 --> Resolved1{Problème<br/>résolu?}
Resolved1 -->|Oui| Doc1[Documenter solution]
Doc1 --> End1([Stop])
Resolved1 -->|Non| Test2[Tester hypothèse 2]
Test2 --> Resolved2{Problème<br/>résolu?}
Resolved2 -->|Oui| Doc2[Documenter solution]
Doc2 --> End2([Stop])
Resolved2 -->|Non| Escalate[Escalade ou recherche approfondie]
Escalate --> Consult[Consulter documentation/communauté]
Consult --> End3([Stop])
Architecture des logs¶
graph TB
subgraph "OpenStack Services"
keystone[Keystone<br/>/var/log/kolla/keystone/]
nova[Nova<br/>/var/log/kolla/nova/]
neutron[Neutron<br/>/var/log/kolla/neutron/]
cinder[Cinder<br/>/var/log/kolla/cinder/]
end
subgraph "Infrastructure"
mariadb[MariaDB<br/>/var/log/kolla/mariadb/]
rabbitmq[RabbitMQ<br/>/var/log/kolla/rabbitmq/]
haproxy[HAProxy<br/>/var/log/kolla/haproxy/]
end
subgraph "Logging Stack"
promtail[Promtail<br/>Collecte logs]
loki[(Loki<br/>Stockage)]
grafana[Grafana<br/>Visualisation]
end
keystone -->|logs| promtail
nova -->|logs| promtail
neutron -->|logs| promtail
cinder -->|logs| promtail
mariadb -->|logs| promtail
rabbitmq -->|logs| promtail
haproxy -->|logs| promtail
promtail -->|push| loki
loki -->|query| grafana
Commandes de diagnostic essentielles¶
#!/bin/bash
# diagnostic-openstack.sh
echo "=== OpenStack Diagnostic ==="
# 1. État des services
echo -e "\n[1] Services Status"
openstack service list
openstack compute service list --long
openstack network agent list
openstack volume service list
# 2. État des endpoints
echo -e "\n[2] Endpoints"
openstack endpoint list --interface public
# 3. État du cluster MariaDB
echo -e "\n[3] MariaDB Galera"
docker exec mariadb mysql -e "
SHOW STATUS LIKE 'wsrep_cluster_size';
SHOW STATUS LIKE 'wsrep_cluster_status';
SHOW STATUS LIKE 'wsrep_local_state_comment';
"
# 4. État RabbitMQ
echo -e "\n[4] RabbitMQ"
docker exec rabbitmq rabbitmqctl cluster_status
docker exec rabbitmq rabbitmqctl list_queues name messages consumers
# 5. Containers
echo -e "\n[5] Docker Containers"
docker ps --format "table {{.Names}}\t{{.Status}}\t{{.Ports}}" | grep -E "(kolla|ceph)"
# 6. Ressources système
echo -e "\n[6] System Resources"
free -h
df -h /var/lib/docker
uptime
# 7. Dernières erreurs
echo -e "\n[7] Recent Errors"
grep -r "ERROR" /var/log/kolla/*/\*.log 2>/dev/null | tail -20
Problèmes courants et solutions¶
Problème: VM ne démarre pas¶
#!/bin/bash
# debug-vm-boot.sh
VM_ID=$1
if [ -z "$VM_ID" ]; then
echo "Usage: $0 <vm_id>"
exit 1
fi
echo "=== Debugging VM: $VM_ID ==="
# 1. État de la VM
echo -e "\n[1] VM Status"
openstack server show $VM_ID -f yaml
# 2. Logs Nova pour cette VM
echo -e "\n[2] Nova Logs"
grep $VM_ID /var/log/kolla/nova/nova-compute.log | tail -50
# 3. Console log
echo -e "\n[3] Console Log"
openstack console log show $VM_ID --lines 100
# 4. Vérifier le compute node
COMPUTE=$(openstack server show $VM_ID -f value -c OS-EXT-SRV-ATTR:host)
echo -e "\n[4] Compute Node: $COMPUTE"
ssh $COMPUTE "virsh list --all | grep $VM_ID"
ssh $COMPUTE "virsh dominfo instance-* 2>/dev/null | head -20"
# 5. Vérifier le réseau
echo -e "\n[5] Network Ports"
openstack port list --server $VM_ID
# 6. Vérifier le volume
echo -e "\n[6] Volumes"
openstack server show $VM_ID -f value -c volumes_attached
# Solutions possibles
cat << EOF
=== Solutions possibles ===
1. Si "No valid host": vérifier quotas et ressources compute
openstack hypervisor stats show
2. Si erreur réseau: vérifier les agents Neutron
openstack network agent list
3. Si erreur volume: vérifier Cinder
openstack volume list --server $VM_ID
cinder service-list
4. Forcer rebuild si stuck:
openstack server rebuild $VM_ID --image <image_id>
EOF
Problème: API lente ou timeout¶
#!/bin/bash
# debug-api-slow.sh
echo "=== API Performance Debug ==="
# 1. Test temps de réponse
echo -e "\n[1] API Response Times"
for service in identity:5000 compute:8774 network:9696 image:9292; do
name=$(echo $service | cut -d: -f1)
port=$(echo $service | cut -d: -f2)
time=$(curl -s -o /dev/null -w "%{time_total}" https://cloud.example.com:$port/)
echo "$name: ${time}s"
done
# 2. HAProxy stats
echo -e "\n[2] HAProxy Backend Status"
docker exec haproxy cat /var/lib/haproxy/stats | grep "BACKEND"
# 3. Connexions base de données
echo -e "\n[3] MariaDB Connections"
docker exec mariadb mysql -e "
SHOW STATUS LIKE 'Threads_connected';
SHOW STATUS LIKE 'Max_used_connections';
SHOW PROCESSLIST;
"
# 4. RabbitMQ queues
echo -e "\n[4] RabbitMQ Queue Depth"
docker exec rabbitmq rabbitmqctl list_queues name messages | sort -k2 -n -r | head -10
# 5. Métriques système
echo -e "\n[5] System Load"
for host in controller-{1,2,3}; do
echo "$host:"
ssh $host "uptime; free -h | grep Mem"
done
# Solutions
cat << EOF
=== Solutions possibles ===
1. Si DB connections élevées:
- Vérifier les requêtes lentes: SHOW FULL PROCESSLIST;
- Augmenter max_connections
- Optimiser les requêtes
2. Si RabbitMQ queue depth élevé:
- Vérifier les consumers
- Augmenter les workers
- Purger les queues obsolètes
3. Si charge système élevée:
- Identifier le processus: top -c
- Vérifier I/O: iostat -x 1 5
- Scale horizontalement
EOF
Problème: Perte de connectivité réseau¶
#!/bin/bash
# debug-network.sh
VM_ID=$1
echo "=== Network Debug for VM: $VM_ID ==="
# 1. Ports de la VM
echo -e "\n[1] VM Ports"
openstack port list --server $VM_ID -f yaml
PORT_ID=$(openstack port list --server $VM_ID -f value -c ID | head -1)
echo -e "\n[2] Port Details"
openstack port show $PORT_ID
# 2. Network namespace
NETWORK_ID=$(openstack port show $PORT_ID -f value -c network_id)
echo -e "\n[3] Network Namespace"
ssh network-node "ip netns list | grep $NETWORK_ID"
# 3. Vérifier les agents
echo -e "\n[4] Network Agents"
openstack network agent list
# 4. Flow OVS
echo -e "\n[5] OVS Flows"
COMPUTE=$(openstack server show $VM_ID -f value -c OS-EXT-SRV-ATTR:host)
ssh $COMPUTE "ovs-vsctl show"
ssh $COMPUTE "ovs-ofctl dump-flows br-int | head -20"
# 5. Security groups
echo -e "\n[6] Security Groups"
openstack port show $PORT_ID -f value -c security_group_ids | \
xargs -I {} openstack security group rule list {}
# 6. Test ICMP depuis le network namespace
echo -e "\n[7] Connectivity Test"
ssh network-node "ip netns exec qdhcp-$NETWORK_ID ping -c 3 $(openstack port show $PORT_ID -f value -c fixed_ips | grep -oP '\d+\.\d+\.\d+\.\d+')"
# Solutions
cat << EOF
=== Solutions possibles ===
1. Si agent down: redémarrer le service
docker restart neutron_openvswitch_agent
2. Si security group bloque:
openstack security group rule create --ingress --protocol icmp <sg>
3. Si OVS problème:
ovs-vsctl --if-exists del-br br-int && ovs-vsctl add-br br-int
systemctl restart openvswitch
4. Si DHCP problème:
docker restart neutron_dhcp_agent
EOF
Diagramme de flux troubleshooting¶
flowchart TD
Start([Start]) --> Error[VM status = ERROR]
Error --> CheckAPI[Vérifier logs nova-api]
CheckAPI --> ErrorVisible{Erreur<br/>visible?}
ErrorVisible -->|Oui| Analyze[Analyser erreur]
ErrorVisible -->|Non| CheckScheduler[Vérifier nova-scheduler logs]
Analyze --> NoValidHost{NoValidHost?}
CheckScheduler --> NoValidHost
NoValidHost -->|Oui| CheckHV[Vérifier hypervisor stats<br/>Vérifier quotas projet<br/>Vérifier flavors disponibles]
NoValidHost -->|Non| NetworkError{Erreur<br/>réseau?}
CheckHV --> NetworkError
NetworkError -->|Oui| CheckNetwork[Vérifier neutron agents<br/>Vérifier port création<br/>Vérifier DHCP]
NetworkError -->|Non| VolumeError{Erreur<br/>volume?}
CheckNetwork --> VolumeError
VolumeError -->|Oui| CheckVolume[Vérifier cinder services<br/>Vérifier espace Ceph<br/>Vérifier attachement]
VolumeError -->|Non| ImageError{Erreur<br/>image?}
CheckVolume --> ImageError
ImageError -->|Oui| CheckImage[Vérifier image exists<br/>Vérifier download<br/>Vérifier checksum]
ImageError -->|Non| CheckCompute[Consulter nova-compute logs]
CheckImage --> CheckCompute
CheckCompute --> CheckLibvirt[Vérifier libvirt logs]
CheckLibvirt --> Resolved{Résolu?}
Resolved -->|Oui| Document[Documenter solution]
Resolved -->|Non| Escalate[Escalade niveau 2]
Document --> End([Stop])
Escalate --> End
Outils de diagnostic avancés¶
# === Rally - Benchmark et diagnostic ===
# Installer Rally
pip install rally-openstack
# Créer un deployment
rally deployment create --fromenv --name openstack
# Lancer des tests
rally task start /opt/rally/boot-and-delete.yaml
# === OSProfiler - Tracing distribué ===
# Activer dans les services
# [profiler]
# enabled = true
# hmac_keys = secret
# connection_string = redis://localhost:6379
# Tracer une requête
openstack --os-profile secret server list
osprofiler trace show <trace_id>
# === Tempest - Tests fonctionnels ===
pip install tempest
tempest init mytest
cd mytest
tempest run --regex 'tempest.api.compute.servers'
Logs à surveiller¶
# Fichiers de logs critiques par service
keystone:
- /var/log/kolla/keystone/keystone.log
- /var/log/kolla/keystone/keystone-access.log
patterns:
- "ERROR"
- "AuthorizationFailure"
- "TokenNotFound"
nova:
- /var/log/kolla/nova/nova-api.log
- /var/log/kolla/nova/nova-scheduler.log
- /var/log/kolla/nova/nova-compute.log
patterns:
- "NoValidHost"
- "BuildAbortException"
- "InstanceNotFound"
neutron:
- /var/log/kolla/neutron/neutron-server.log
- /var/log/kolla/neutron/neutron-openvswitch-agent.log
- /var/log/kolla/neutron/neutron-l3-agent.log
patterns:
- "Agent is not reachable"
- "Port not found"
- "NetworkNotFound"
cinder:
- /var/log/kolla/cinder/cinder-api.log
- /var/log/kolla/cinder/cinder-volume.log
- /var/log/kolla/cinder/cinder-scheduler.log
patterns:
- "VolumeNotFound"
- "ImageNotAuthorized"
- "NoValidBackend"
mariadb:
- /var/log/kolla/mariadb/mariadb.log
patterns:
- "WSREP"
- "deadlock"
- "too many connections"
rabbitmq:
- /var/log/kolla/rabbitmq/rabbit@*.log
patterns:
- "connection refused"
- "queue overflow"
- "network partition"
Exemples pratiques¶
Script de collecte pour support¶
#!/bin/bash
# collect-support-bundle.sh
BUNDLE_DIR="/tmp/support-$(date +%Y%m%d-%H%M%S)"
mkdir -p $BUNDLE_DIR
echo "Collecting support information..."
# 1. System info
echo "=== System ===" > $BUNDLE_DIR/system.txt
uname -a >> $BUNDLE_DIR/system.txt
cat /etc/os-release >> $BUNDLE_DIR/system.txt
free -h >> $BUNDLE_DIR/system.txt
df -h >> $BUNDLE_DIR/system.txt
uptime >> $BUNDLE_DIR/system.txt
# 2. OpenStack info
echo "=== OpenStack ===" > $BUNDLE_DIR/openstack.txt
openstack service list >> $BUNDLE_DIR/openstack.txt 2>&1
openstack compute service list >> $BUNDLE_DIR/openstack.txt 2>&1
openstack network agent list >> $BUNDLE_DIR/openstack.txt 2>&1
# 3. Container status
docker ps -a > $BUNDLE_DIR/docker-ps.txt
# 4. Logs (last 1000 lines each)
mkdir -p $BUNDLE_DIR/logs
for service in keystone nova neutron cinder; do
if [ -d "/var/log/kolla/$service" ]; then
tail -1000 /var/log/kolla/$service/*.log > $BUNDLE_DIR/logs/$service.log 2>/dev/null
fi
done
# 5. Configs (sanitized)
mkdir -p $BUNDLE_DIR/config
cp /etc/kolla/globals.yml $BUNDLE_DIR/config/
sed -i 's/password.*/password: REDACTED/g' $BUNDLE_DIR/config/globals.yml
# 6. Package
tar -czvf ${BUNDLE_DIR}.tar.gz -C /tmp $(basename $BUNDLE_DIR)
rm -rf $BUNDLE_DIR
echo "Support bundle: ${BUNDLE_DIR}.tar.gz"
Ressources¶
Checkpoint¶
- Scripts diagnostic créés
- Logs centralisés et accessibles
- Alertes sur erreurs critiques
- Procédures documentées par type de problème
- Formation équipe support
- Base de connaissances alimentée
- Tests réguliers des procédures
- Escalade définie