It happend, What now ?
Its a nightmare for a every cloud provider. Even you take all the measures to prevent, it could be happen. Its called Disaster. You are serving a lot of customers, you have to protect their interests. Because they are also a service provider to their customers. They are answerable to their customers and stake holders. You have to took some peculiar action when your cloud is Down. Not all the engineers are requested to investigate the root cause. You have to follow below best practices while you are in such kind of situation.
Get into battlefield
Well, get your cavalries in to the battlefield with Special Weapons And Tactics (SWAT). You are going to make different operating cells. A team should be checking the monitoring systems for the reason for alerts. Another team is going to login in to the particular system directed by monitoring system.
Monitoring Cell
Monitoring team should be dig in deep for the exact reason for the failure from the monitoring tools. Your job is only to analyse and find out the the reason as soon as possible. Once you found a probably cause, immediately report to Engineering cell. Possible reasons are as follows.
- You could be affected by a DDOS.
- You could lost the BGP session with your neighbour routers.
- Your control plane rack burned out.
- The messaging broker malfunctioning.
- Database cluster could be down.
- Flood on the API server.
- Datacenter Outage.
- Fabric went down.
Engineering Cell
Engineering cell will be logged in to the environment, if there is an access. If you identified the faulty component already, start looking on it for the reason. If not yet, check with monitoring cell and parallel build you own path. Do not spend a lot of time to fixing it. You can ask for a L3 support from your vendor to bring it back the system. Your desk / mobile phones should be in Airplane mode.
- Check your core services.
- Check your database functions.
- Check the number of API calls.
- Check the message broker cluster status.
- Check the bond status for your compute cluster.
- Check traffic on the all the VLAN.
- Check the accessibility on your shared volumes…… N number of troubleshooting
Customer success Cell
Disaster happens. Inform your customers through Email channel, through your status page, through phone calls to your VIP customers. Explain the situation in very polite and gentle way.Ask for the apologies for the inconvenience. Provide an ETA if you already have. Never provide a false information. Understand that they are also technical persons like you. False information may spoil your dignity later.
- Inform your customers proactively, before they start complaining.
- Provide frequent update about the progress.
- Provide an ETA once you are clear about it.
- Be transparent.
- Update the social networking channels.
- Customer will be angry, but you should be calm always.
- Your survival is depend on them.
- Get frequent update from the Engineering cell without disturbing them.
- Provide the feed back to Administrative cell in frequent intervals.
Administrative Cell
Your management is also worried about this incident. They will keep calling for the updates. Administrative cell should make sure that, top executives are aware about the issue and they contacting only the administrative cell. They wont be allowed to contact any other cells.
- Be transparent to them
- Never provide an ETA if you dont have one
- Never provide false information, it may transfer to their customer friends
- Prepare an interim Incident Report for the stake holders
- Open a Severity 1 Ticket with vendor. Call their top officials and ask for help
You brought it back
Environment is Back. What you should do now?
- Update the status page.
- Inform your customers through available channels.
- Advise them to check their workloads and update you.
- Provide a brief about the issue.
- Once customer confirm about their services ask them to close the incide ticket
- open a Service request for RCA.
- Follow the SLA for compensation if there is a SLA breach.
- Try to provide a virtual credit,if management agrees.
Post incident measures
You should have a post incident meeting to brief the whole team about the issue. This should be documented for future references and Audit purposes
- Why this happened ?
- How to prevent in future.
- Well document the reason for outage.
- Take immediate necessary action to fix the issue.
- Nevel apply a permanent fix.
- Ask for an emergency MDT if its needed.