PRA/PCA - Plan de Reprise et Continuité d'Activité¶
Introduction¶
Le PRA (Plan de Reprise d'Activité) et le PCA (Plan de Continuité d'Activité) définissent les procédures pour maintenir et restaurer les services OpenStack en cas d'incident majeur. Ces plans sont essentiels pour les infrastructures gouvernementales.
Prérequis¶
- OpenStack HA configuré
- Backups fonctionnels
- Site secondaire ou cloud de secours
- Documentation à jour
Points à apprendre¶
Architecture PRA/PCA¶
graph TB
subgraph SiteA["Site Principal - Datacenter A"]
subgraph CtrlA["Control Plane"]
api_a["OpenStack APIs<br/>Active"]
db_a[(MariaDB Galera<br/>Master)]
end
subgraph ComputeA["Compute"]
nova_a["Nova Compute<br/>100 VMs"]
end
subgraph StorageA["Storage"]
ceph_a[(Ceph Cluster<br/>Primary)]
end
end
subgraph SiteB["Site Secondaire - Datacenter B"]
subgraph CtrlB["Control Plane"]
api_b["OpenStack APIs<br/>Standby"]
db_b[(MariaDB<br/>Replica async)]
end
subgraph StorageB["Storage"]
ceph_b[(Ceph Cluster<br/>RBD mirroring)]
end
end
subgraph Cloud["Cloud Backup - AWS/OVH"]
s3[(S3 Storage<br/>Backups)]
dr["DR Resources<br/>Cold standby"]
end
db_a -->|Async replication MySQL| db_b
ceph_a -->|RBD mirroring| ceph_b
ceph_a -->|Backup Daily| s3
Niveaux de service (SLA)¶
graph TB
subgraph Tier1["Tier 1 - Critique 🔴<br/>RTO: 15 min | RPO: 0 (sync)"]
t1_1["API Keystone"]
t1_2["API Nova"]
t1_3["MariaDB"]
end
subgraph Tier2["Tier 2 - Important 🟡<br/>RTO: 1 heure | RPO: 15 min"]
t2_1["Neutron"]
t2_2["Cinder"]
t2_3["Glance"]
end
subgraph Tier3["Tier 3 - Standard 🟢<br/>RTO: 4 heures | RPO: 1 heure"]
t3_1["Heat"]
t3_2["Horizon"]
t3_3["Monitoring"]
end
Légende: RTO = Recovery Time Objective | RPO = Recovery Point Objective
Matrice des risques¶
| Risque | Probabilité | Impact | Mitigation |
|---|---|---|---|
| Panne serveur | Moyenne | Faible | HA, failover auto |
| Panne rack | Faible | Moyen | Anti-affinity, réplication |
| Panne datacenter | Très faible | Critique | Site DR, geo-replication |
| Cyberattaque | Moyenne | Critique | Segmentation, backups offline |
| Erreur humaine | Haute | Variable | RBAC, audit, backups |
| Corruption données | Faible | Critique | Checksums, snapshots |
Configuration réplication MariaDB¶
# /etc/mysql/conf.d/replication.cnf - Site Principal
[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
expire_logs_days = 7
# GTID pour faciliter failover
gtid_mode = ON
enforce_gtid_consistency = ON
# Semi-sync replication
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_master_timeout = 10000
# Site Secondaire
[mysqld]
server-id = 2
relay_log = relay-bin
log_slave_updates = ON
read_only = ON
gtid_mode = ON
enforce_gtid_consistency = ON
rpl_semi_sync_slave_enabled = 1
# Configuration réplication
# Sur le replica
CHANGE MASTER TO
MASTER_HOST='db-primary.site-a.local',
MASTER_USER='replication',
MASTER_PASSWORD='secret',
MASTER_AUTO_POSITION=1;
START SLAVE;
SHOW SLAVE STATUS\G
Configuration Ceph RBD Mirroring¶
#!/bin/bash
# setup-rbd-mirroring.sh
# === Sur le cluster principal ===
# Activer le mirroring sur le pool
ceph osd pool enable volumes rbd
rbd mirror pool enable volumes image
# Créer le peer
rbd mirror pool peer bootstrap create --site-name site-a volumes > /tmp/bootstrap-token
# === Sur le cluster secondaire ===
# Importer le token
rbd mirror pool peer bootstrap import --site-name site-b --direction rx-only volumes < /tmp/bootstrap-token
# === Activer mirroring par image ===
# Sur le principal
rbd mirror image enable volumes/volume-xxx snapshot
rbd mirror image enable volumes/volume-xxx journal # Pour RPO=0
# Vérifier le status
rbd mirror pool status volumes
rbd mirror image status volumes/volume-xxx
Script de failover automatique¶
#!/bin/bash
# failover-site.sh
set -e
SITE_A="10.0.0.0/24"
SITE_B="10.1.0.0/24"
VIP="192.168.1.10"
CURRENT_SITE=""
# Détecter le site actif
check_site() {
if ping -c 1 10.0.0.11 > /dev/null 2>&1; then
CURRENT_SITE="A"
elif ping -c 1 10.1.0.11 > /dev/null 2>&1; then
CURRENT_SITE="B"
else
echo "ERROR: No site reachable"
exit 1
fi
echo "Current active site: $CURRENT_SITE"
}
# Failover vers Site B
failover_to_b() {
echo "=== Initiating failover to Site B ==="
# 1. Promouvoir le replica MariaDB
echo "[1/5] Promoting MariaDB replica..."
ssh db-replica.site-b "mysql -e 'STOP SLAVE; RESET SLAVE ALL;'"
# 2. Promouvoir les images Ceph
echo "[2/5] Promoting Ceph images..."
ssh ceph-admin.site-b "rbd mirror pool promote --force volumes"
# 3. Mettre à jour la configuration
echo "[3/5] Updating configuration..."
ssh controller.site-b "sed -i 's/site-a/site-b/g' /etc/kolla/globals.yml"
# 4. Démarrer les services
echo "[4/5] Starting services..."
ssh controller.site-b "kolla-ansible -i /etc/kolla/inventory deploy"
# 5. Migrer la VIP
echo "[5/5] Migrating VIP..."
ssh lb.site-b "ip addr add $VIP/24 dev eth0"
# Mise à jour DNS (si applicable)
# update_dns $VIP
echo "=== Failover complete ==="
}
# Failback vers Site A
failback_to_a() {
echo "=== Initiating failback to Site A ==="
# 1. Synchroniser les données
echo "[1/5] Syncing data back to Site A..."
# Resync MariaDB
ssh db-primary.site-a "mysql -e 'CHANGE MASTER TO MASTER_HOST=\"db.site-b\"...'"
# Resync Ceph
ssh ceph-admin.site-a "rbd mirror pool demote volumes"
ssh ceph-admin.site-b "rbd mirror pool promote volumes"
# Wait for sync
sleep 300
ssh ceph-admin.site-a "rbd mirror pool promote volumes"
ssh ceph-admin.site-b "rbd mirror pool demote volumes"
# 2-5. Similar to failover
echo "=== Failback complete ==="
}
# Main
case "$1" in
check)
check_site
;;
failover)
check_site
if [ "$CURRENT_SITE" == "A" ]; then
failover_to_b
else
echo "Already on Site B"
fi
;;
failback)
check_site
if [ "$CURRENT_SITE" == "B" ]; then
failback_to_a
else
echo "Already on Site A"
fi
;;
*)
echo "Usage: $0 {check|failover|failback}"
exit 1
;;
esac
Procédure PRA détaillée¶
flowchart TB
subgraph Detection["Détection"]
start([Start])
alert["Alerte monitoring"]
eval["Évaluation impact"]
check1{Site principal accessible?}
diag["Diagnostic panne"]
check2{Réparable < RTO?}
repair["Réparer"]
trigger["Déclencher PRA"]
start --> alert --> eval --> check1
check1 -->|oui| diag --> check2
check1 -->|non| trigger
check2 -->|oui| repair --> stop1([Stop])
check2 -->|non| trigger
end
subgraph Activation["Activation PRA"]
notify["Notifier équipes"]
crisis["Activer cellule de crise"]
comm["Communication utilisateurs"]
trigger --> notify --> crisis --> comm
end
subgraph Failover["Failover"]
db_promo["Promouvoir DB replica"]
ceph_promo["Promouvoir Ceph images"]
ctrl_start["Démarrer control plane"]
vip["Migrer VIP/DNS"]
comm --> db_promo --> ceph_promo --> ctrl_start --> vip
end
subgraph Validation["Validation"]
tests["Tests fonctionnels"]
verify["Vérifier données"]
check3{Services OK?}
open["Ouvrir accès utilisateurs"]
escalade["Escalade"]
planb["Plan B (cloud backup)"]
vip --> tests --> verify --> check3
check3 -->|oui| open
check3 -->|non| escalade --> planb
end
subgraph PostPRA["Post-PRA"]
monitoring["Monitoring intensif"]
rca["Analyse cause racine"]
failback["Planifier failback"]
docs["Mise à jour documentation"]
stop2([Stop])
open --> monitoring --> rca --> failback --> docs --> stop2
planb --> monitoring
end
Runbook PRA¶
# Runbook PRA - Perte Site Principal
## Critères de déclenchement
- Site principal inaccessible > 15 minutes
- Perte > 50% des services critiques
- Décision du responsable d'astreinte
## Contacts
| Rôle | Nom | Téléphone |
|------|-----|-----------|
| Responsable infra | Jean Dupont | +33 6 XX XX XX XX |
| DBA | Marie Martin | +33 6 XX XX XX XX |
| Sécurité | Paul Bernard | +33 6 XX XX XX XX |
## Procédure
### Phase 1: Activation (0-15 min)
1. Confirmer l'incident
```bash
./check-site-status.sh
```
2. Notifier l'équipe (PagerDuty/Slack)
3. Créer ticket incident
### Phase 2: Failover (15-45 min)
1. Accéder au site DR
```bash
ssh admin@dr-controller.site-b
```
2. Exécuter le failover
```bash
sudo /opt/scripts/failover-site.sh failover
```
3. Vérifier les services
```bash
openstack service list
openstack compute service list
```
### Phase 3: Validation (45-60 min)
1. Tests API
```bash
openstack token issue
openstack server list --all-projects
```
2. Tests utilisateurs pilotes
3. Ouvrir l'accès
### Phase 4: Communication
- Email aux utilisateurs
- Mise à jour page status
- Rapport incident
Tests PRA/PCA¶
#!/bin/bash
# test-pra.sh - Test annuel PRA
LOG_FILE="/var/log/pra-test-$(date +%Y%m%d).log"
exec > >(tee -a $LOG_FILE) 2>&1
echo "=== Test PRA - $(date) ==="
# 1. Test failover base de données
echo -e "\n[1] Test DB Failover"
# Simuler panne primary
ssh db-primary "systemctl stop mariadb"
sleep 30
# Vérifier promotion replica
ssh db-replica "mysql -e 'SHOW MASTER STATUS'"
# Restaurer
ssh db-primary "systemctl start mariadb"
# 2. Test failover Ceph
echo -e "\n[2] Test Ceph Mirroring"
# Créer volume test
openstack volume create --size 1 test-pra
VOLUME_ID=$(openstack volume show test-pra -f value -c id)
# Écrire données
openstack server create --volume $VOLUME_ID --flavor m1.tiny --image cirros test-pra-vm --wait
# Write test file...
# Vérifier réplication
ssh ceph-dr "rbd mirror image status volumes/volume-$VOLUME_ID"
# Cleanup
openstack server delete test-pra-vm --wait
openstack volume delete test-pra
# 3. Test failover complet (sur environnement test)
echo -e "\n[3] Test Full Failover (staging only)"
if [ "$ENVIRONMENT" == "staging" ]; then
./failover-site.sh failover
sleep 60
./validate-services.sh
./failover-site.sh failback
fi
# 4. Test restauration backup
echo -e "\n[4] Test Backup Restoration"
# Dans environnement isolé
docker run -d --name test-restore mariadb:10.6
LATEST=$(ls -t /backup/mariadb/*.sql.gz | head -1)
gunzip -c $LATEST | docker exec -i test-restore mysql -u root
docker exec test-restore mysql -e "SELECT COUNT(*) FROM nova.instances"
docker rm -f test-restore
echo -e "\n=== Test PRA Complete ==="
Dashboard PRA Grafana¶
{
"title": "PRA Status Dashboard",
"panels": [
{
"title": "Replication Lag",
"type": "gauge",
"targets": [
{"expr": "mysql_slave_status_seconds_behind_master"}
],
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 60, "color": "yellow"},
{"value": 300, "color": "red"}
]
}
},
{
"title": "Ceph Mirror Status",
"type": "stat",
"targets": [
{"expr": "ceph_rbd_mirror_image_state"}
]
},
{
"title": "Site Status",
"type": "table",
"targets": [
{"expr": "up{job=~\"site-.*\"}"}
]
},
{
"title": "RTO/RPO Compliance",
"type": "timeseries",
"targets": [
{"expr": "backup_age_seconds"},
{"expr": "replication_lag_seconds"}
]
}
]
}
Exemples pratiques¶
Checklist test PRA annuel¶
## Test PRA Annuel - Checklist
### Préparation
- [ ] Définir périmètre du test
- [ ] Prévenir les équipes
- [ ] Préparer environnement test
- [ ] Documenter état initial
### Exécution
- [ ] Test failover DB
- [ ] Test failover Ceph
- [ ] Test failover services
- [ ] Test restauration backup
- [ ] Test communication crise
### Métriques
- [ ] Temps de détection: ___ min
- [ ] Temps de décision: ___ min
- [ ] Temps de failover: ___ min
- [ ] Temps total (RTO réel): ___ min
- [ ] Perte de données (RPO réel): ___ min
### Post-test
- [ ] Rapport de test
- [ ] Actions correctives
- [ ] Mise à jour procédures
- [ ] Formation équipes
Ressources¶
Checkpoint¶
- Architecture DR documentée
- Réplication MariaDB configurée
- Ceph RBD mirroring actif
- Scripts failover testés
- Runbook PRA rédigé
- Tests PRA planifiés (annuel)
- Équipes formées
- SLA/RTO/RPO définis et validés