PRA/PCA - Plan de Reprise et Continuité d'Activité¶

Introduction¶

Le PRA (Plan de Reprise d'Activité) et le PCA (Plan de Continuité d'Activité) définissent les procédures pour maintenir et restaurer les services OpenStack en cas d'incident majeur. Ces plans sont essentiels pour les infrastructures gouvernementales.

Prérequis¶

OpenStack HA configuré
Backups fonctionnels
Site secondaire ou cloud de secours
Documentation à jour

Points à apprendre¶

Architecture PRA/PCA¶

graph TB
    subgraph SiteA["Site Principal - Datacenter A"]
        subgraph CtrlA["Control Plane"]
            api_a["OpenStack APIs<br/>Active"]
            db_a[(MariaDB Galera<br/>Master)]
        end
        subgraph ComputeA["Compute"]
            nova_a["Nova Compute<br/>100 VMs"]
        end
        subgraph StorageA["Storage"]
            ceph_a[(Ceph Cluster<br/>Primary)]
        end
    end

    subgraph SiteB["Site Secondaire - Datacenter B"]
        subgraph CtrlB["Control Plane"]
            api_b["OpenStack APIs<br/>Standby"]
            db_b[(MariaDB<br/>Replica async)]
        end
        subgraph StorageB["Storage"]
            ceph_b[(Ceph Cluster<br/>RBD mirroring)]
        end
    end

    subgraph Cloud["Cloud Backup - AWS/OVH"]
        s3[(S3 Storage<br/>Backups)]
        dr["DR Resources<br/>Cold standby"]
    end

    db_a -->|Async replication MySQL| db_b
    ceph_a -->|RBD mirroring| ceph_b
    ceph_a -->|Backup Daily| s3

Niveaux de service (SLA)¶

graph TB
    subgraph Tier1["Tier 1 - Critique 🔴<br/>RTO: 15 min | RPO: 0 (sync)"]
        t1_1["API Keystone"]
        t1_2["API Nova"]
        t1_3["MariaDB"]
    end

    subgraph Tier2["Tier 2 - Important 🟡<br/>RTO: 1 heure | RPO: 15 min"]
        t2_1["Neutron"]
        t2_2["Cinder"]
        t2_3["Glance"]
    end

    subgraph Tier3["Tier 3 - Standard 🟢<br/>RTO: 4 heures | RPO: 1 heure"]
        t3_1["Heat"]
        t3_2["Horizon"]
        t3_3["Monitoring"]
    end

Légende: RTO = Recovery Time Objective | RPO = Recovery Point Objective

Matrice des risques¶

Risque	Probabilité	Impact	Mitigation
Panne serveur	Moyenne	Faible	HA, failover auto
Panne rack	Faible	Moyen	Anti-affinity, réplication
Panne datacenter	Très faible	Critique	Site DR, geo-replication
Cyberattaque	Moyenne	Critique	Segmentation, backups offline
Erreur humaine	Haute	Variable	RBAC, audit, backups
Corruption données	Faible	Critique	Checksums, snapshots

Configuration réplication MariaDB¶

# /etc/mysql/conf.d/replication.cnf - Site Principal

[mysqld]
server-id = 1
log_bin = mysql-bin
binlog_format = ROW
expire_logs_days = 7

# GTID pour faciliter failover
gtid_mode = ON
enforce_gtid_consistency = ON

# Semi-sync replication
rpl_semi_sync_master_enabled = 1
rpl_semi_sync_master_timeout = 10000

# Site Secondaire
[mysqld]
server-id = 2
relay_log = relay-bin
log_slave_updates = ON
read_only = ON

gtid_mode = ON
enforce_gtid_consistency = ON

rpl_semi_sync_slave_enabled = 1

# Configuration réplication
# Sur le replica
CHANGE MASTER TO
    MASTER_HOST='db-primary.site-a.local',
    MASTER_USER='replication',
    MASTER_PASSWORD='secret',
    MASTER_AUTO_POSITION=1;

START SLAVE;
SHOW SLAVE STATUS\G

Configuration Ceph RBD Mirroring¶

#!/bin/bash
# setup-rbd-mirroring.sh

# === Sur le cluster principal ===
# Activer le mirroring sur le pool
ceph osd pool enable volumes rbd
rbd mirror pool enable volumes image

# Créer le peer
rbd mirror pool peer bootstrap create --site-name site-a volumes > /tmp/bootstrap-token

# === Sur le cluster secondaire ===
# Importer le token
rbd mirror pool peer bootstrap import --site-name site-b --direction rx-only volumes < /tmp/bootstrap-token

# === Activer mirroring par image ===
# Sur le principal
rbd mirror image enable volumes/volume-xxx snapshot
rbd mirror image enable volumes/volume-xxx journal  # Pour RPO=0

# Vérifier le status
rbd mirror pool status volumes
rbd mirror image status volumes/volume-xxx

Script de failover automatique¶

#!/bin/bash
# failover-site.sh

set -e

SITE_A="10.0.0.0/24"
SITE_B="10.1.0.0/24"
VIP="192.168.1.10"
CURRENT_SITE=""

# Détecter le site actif
check_site() {
    if ping -c 1 10.0.0.11 > /dev/null 2>&1; then
        CURRENT_SITE="A"
    elif ping -c 1 10.1.0.11 > /dev/null 2>&1; then
        CURRENT_SITE="B"
    else
        echo "ERROR: No site reachable"
        exit 1
    fi
    echo "Current active site: $CURRENT_SITE"
}

# Failover vers Site B
failover_to_b() {
    echo "=== Initiating failover to Site B ==="

    # 1. Promouvoir le replica MariaDB
    echo "[1/5] Promoting MariaDB replica..."
    ssh db-replica.site-b "mysql -e 'STOP SLAVE; RESET SLAVE ALL;'"

    # 2. Promouvoir les images Ceph
    echo "[2/5] Promoting Ceph images..."
    ssh ceph-admin.site-b "rbd mirror pool promote --force volumes"

    # 3. Mettre à jour la configuration
    echo "[3/5] Updating configuration..."
    ssh controller.site-b "sed -i 's/site-a/site-b/g' /etc/kolla/globals.yml"

    # 4. Démarrer les services
    echo "[4/5] Starting services..."
    ssh controller.site-b "kolla-ansible -i /etc/kolla/inventory deploy"

    # 5. Migrer la VIP
    echo "[5/5] Migrating VIP..."
    ssh lb.site-b "ip addr add $VIP/24 dev eth0"

    # Mise à jour DNS (si applicable)
    # update_dns $VIP

    echo "=== Failover complete ==="
}

# Failback vers Site A
failback_to_a() {
    echo "=== Initiating failback to Site A ==="

    # 1. Synchroniser les données
    echo "[1/5] Syncing data back to Site A..."

    # Resync MariaDB
    ssh db-primary.site-a "mysql -e 'CHANGE MASTER TO MASTER_HOST=\"db.site-b\"...'"

    # Resync Ceph
    ssh ceph-admin.site-a "rbd mirror pool demote volumes"
    ssh ceph-admin.site-b "rbd mirror pool promote volumes"
    # Wait for sync
    sleep 300
    ssh ceph-admin.site-a "rbd mirror pool promote volumes"
    ssh ceph-admin.site-b "rbd mirror pool demote volumes"

    # 2-5. Similar to failover
    echo "=== Failback complete ==="
}

# Main
case "$1" in
    check)
        check_site
        ;;
    failover)
        check_site
        if [ "$CURRENT_SITE" == "A" ]; then
            failover_to_b
        else
            echo "Already on Site B"
        fi
        ;;
    failback)
        check_site
        if [ "$CURRENT_SITE" == "B" ]; then
            failback_to_a
        else
            echo "Already on Site A"
        fi
        ;;
    *)
        echo "Usage: $0 {check|failover|failback}"
        exit 1
        ;;
esac

Procédure PRA détaillée¶

flowchart TB
    subgraph Detection["Détection"]
        start([Start])
        alert["Alerte monitoring"]
        eval["Évaluation impact"]
        check1{Site principal accessible?}
        diag["Diagnostic panne"]
        check2{Réparable < RTO?}
        repair["Réparer"]
        trigger["Déclencher PRA"]

        start --> alert --> eval --> check1
        check1 -->|oui| diag --> check2
        check1 -->|non| trigger
        check2 -->|oui| repair --> stop1([Stop])
        check2 -->|non| trigger
    end

    subgraph Activation["Activation PRA"]
        notify["Notifier équipes"]
        crisis["Activer cellule de crise"]
        comm["Communication utilisateurs"]

        trigger --> notify --> crisis --> comm
    end

    subgraph Failover["Failover"]
        db_promo["Promouvoir DB replica"]
        ceph_promo["Promouvoir Ceph images"]
        ctrl_start["Démarrer control plane"]
        vip["Migrer VIP/DNS"]

        comm --> db_promo --> ceph_promo --> ctrl_start --> vip
    end

    subgraph Validation["Validation"]
        tests["Tests fonctionnels"]
        verify["Vérifier données"]
        check3{Services OK?}
        open["Ouvrir accès utilisateurs"]
        escalade["Escalade"]
        planb["Plan B (cloud backup)"]

        vip --> tests --> verify --> check3
        check3 -->|oui| open
        check3 -->|non| escalade --> planb
    end

    subgraph PostPRA["Post-PRA"]
        monitoring["Monitoring intensif"]
        rca["Analyse cause racine"]
        failback["Planifier failback"]
        docs["Mise à jour documentation"]
        stop2([Stop])

        open --> monitoring --> rca --> failback --> docs --> stop2
        planb --> monitoring
    end

Runbook PRA¶

# Runbook PRA - Perte Site Principal

## Critères de déclenchement
- Site principal inaccessible > 15 minutes
- Perte > 50% des services critiques
- Décision du responsable d'astreinte

## Contacts
| Rôle | Nom | Téléphone |
|------|-----|-----------|
| Responsable infra | Jean Dupont | +33 6 XX XX XX XX |
| DBA | Marie Martin | +33 6 XX XX XX XX |
| Sécurité | Paul Bernard | +33 6 XX XX XX XX |

## Procédure

### Phase 1: Activation (0-15 min)
1. Confirmer l'incident
   ```bash
   ./check-site-status.sh
   ```
2. Notifier l'équipe (PagerDuty/Slack)
3. Créer ticket incident

### Phase 2: Failover (15-45 min)
1. Accéder au site DR
   ```bash
   ssh admin@dr-controller.site-b
   ```
2. Exécuter le failover
   ```bash
   sudo /opt/scripts/failover-site.sh failover
   ```
3. Vérifier les services
   ```bash
   openstack service list
   openstack compute service list
   ```

### Phase 3: Validation (45-60 min)
1. Tests API
   ```bash
   openstack token issue
   openstack server list --all-projects
   ```
2. Tests utilisateurs pilotes
3. Ouvrir l'accès

### Phase 4: Communication
- Email aux utilisateurs
- Mise à jour page status
- Rapport incident

Tests PRA/PCA¶

#!/bin/bash
# test-pra.sh - Test annuel PRA

LOG_FILE="/var/log/pra-test-$(date +%Y%m%d).log"
exec > >(tee -a $LOG_FILE) 2>&1

echo "=== Test PRA - $(date) ==="

# 1. Test failover base de données
echo -e "\n[1] Test DB Failover"
# Simuler panne primary
ssh db-primary "systemctl stop mariadb"
sleep 30

# Vérifier promotion replica
ssh db-replica "mysql -e 'SHOW MASTER STATUS'"

# Restaurer
ssh db-primary "systemctl start mariadb"

# 2. Test failover Ceph
echo -e "\n[2] Test Ceph Mirroring"
# Créer volume test
openstack volume create --size 1 test-pra
VOLUME_ID=$(openstack volume show test-pra -f value -c id)

# Écrire données
openstack server create --volume $VOLUME_ID --flavor m1.tiny --image cirros test-pra-vm --wait
# Write test file...

# Vérifier réplication
ssh ceph-dr "rbd mirror image status volumes/volume-$VOLUME_ID"

# Cleanup
openstack server delete test-pra-vm --wait
openstack volume delete test-pra

# 3. Test failover complet (sur environnement test)
echo -e "\n[3] Test Full Failover (staging only)"
if [ "$ENVIRONMENT" == "staging" ]; then
    ./failover-site.sh failover
    sleep 60
    ./validate-services.sh
    ./failover-site.sh failback
fi

# 4. Test restauration backup
echo -e "\n[4] Test Backup Restoration"
# Dans environnement isolé
docker run -d --name test-restore mariadb:10.6
LATEST=$(ls -t /backup/mariadb/*.sql.gz | head -1)
gunzip -c $LATEST | docker exec -i test-restore mysql -u root
docker exec test-restore mysql -e "SELECT COUNT(*) FROM nova.instances"
docker rm -f test-restore

echo -e "\n=== Test PRA Complete ==="

Dashboard PRA Grafana¶

{
  "title": "PRA Status Dashboard",
  "panels": [
    {
      "title": "Replication Lag",
      "type": "gauge",
      "targets": [
        {"expr": "mysql_slave_status_seconds_behind_master"}
      ],
      "thresholds": {
        "steps": [
          {"value": 0, "color": "green"},
          {"value": 60, "color": "yellow"},
          {"value": 300, "color": "red"}
        ]
      }
    },
    {
      "title": "Ceph Mirror Status",
      "type": "stat",
      "targets": [
        {"expr": "ceph_rbd_mirror_image_state"}
      ]
    },
    {
      "title": "Site Status",
      "type": "table",
      "targets": [
        {"expr": "up{job=~\"site-.*\"}"}
      ]
    },
    {
      "title": "RTO/RPO Compliance",
      "type": "timeseries",
      "targets": [
        {"expr": "backup_age_seconds"},
        {"expr": "replication_lag_seconds"}
      ]
    }
  ]
}

Exemples pratiques¶

Checklist test PRA annuel¶

## Test PRA Annuel - Checklist

### Préparation
- [ ] Définir périmètre du test
- [ ] Prévenir les équipes
- [ ] Préparer environnement test
- [ ] Documenter état initial

### Exécution
- [ ] Test failover DB
- [ ] Test failover Ceph
- [ ] Test failover services
- [ ] Test restauration backup
- [ ] Test communication crise

### Métriques
- [ ] Temps de détection: ___ min
- [ ] Temps de décision: ___ min
- [ ] Temps de failover: ___ min
- [ ] Temps total (RTO réel): ___ min
- [ ] Perte de données (RPO réel): ___ min

### Post-test
- [ ] Rapport de test
- [ ] Actions correctives
- [ ] Mise à jour procédures
- [ ] Formation équipes