- Public Cloud Upgrade.
- Preparations
- Stage your Upgrade
- Monitoring
- Day01 - Infra Upgrade
- Day02 - Tungsten Fabric (MW01)
- Day03 - VCP Components (MW02)
- Day04 - Dataplane - ZONE-01 Compute nodes (MW03)
- Day05 - Dataplane - ZONE-02 Compute nodes (MW04)
- QA
- Statistics
- References
Public Cloud Upgrade.
What is the scariest task, a cloud administrator may face, Public Cloud Upgrade. I had written an article before about how to plan for a Public Cloud Upgrade. Well, thats theory only. This article will tell you how we did it practically.
Preparations
Change Request
Its an internal process. A change request should be created and submit for the approval of multiple stake holders. This change request should have the following information
- What is this upgrade
- Major enhancements
- Customer Impact
- Implementation Strategy
- Backup procedure
- Roll back Procedure
Customer Engagement
Inform the premium customer through their CSM. They need to prepare themselves. They may need breathing time to guard them for not losing service to their customers. Need to prepare a 5 slides presenation provide to your large customers if any.
Executive management
Even they were provided you the green light, they may have lot of questions. One of their major concern will be What will be the Impact
and What is the roll back plan
. All these questions were addressed in the CR. But you need to prepare yourself to answer them. They are technical as well as more curios.
Resource planning
You are going to do the upgrade in night. All your team will be present during the activity. At the same time, you need to make sure that, the team shouldn’t be exhausted. It will be a long night. Also, someone should be available during the day time to address customer questions / concerns if any. 100%
sure, not all customers took it serious.
Frequent Updates
Provide frequent updates on your status Page. About the activity, how you are progressing. Once finished each days activity update the channel.
Stage your Upgrade
We have tried the following activities on our Stage environment. Staging is a small environment, but its exactly same as production in architecture level. Its use same reclass model. Same versions of the softwares. Same database schema. The playbook
we are going to apply on production is created based on this staging experience. We tried the below on staging.
- Apply the procedure on Staging
- Roll back when identified a bug
- Repeat the procedure
- Did seamless upgrade 3 times
- Create the final play book
Monitoring
During the upgrade, you should have a full picture about each micro components. We have deployed tiny virtual machines in all the compute nodes to do multiple check like following
- Continously ping external website to make sure VM is able to reach Internet
- Above test with Public IP and without public IP. Because both using different routes
- Check connectivity between two regions
- Check reachability of Public IP from Internet
- Check the metadata
- Check some external service like
git pull
All the above data are pulled in a Grafana dash board.
Day01 - Infra Upgrade
Infrastructure nodes, where our Virtual Controle Plane resides and few adhoc virtual machines as well. VCP is running in highly available method, for example keystone
service running on three virtual machines, and each three are running on three different hypervisors (infra nodes). Upgrading those infra nodes are not that difficult. Move the VIP to another node, destroy the domain,upgrade the packages and reboot. All went well.
- Stop the domains
- Populate the new repositories
- Run the upgrade
- Reboot the Nodes
Day02 - Tungsten Fabric (MW01)
One of the core piller of a public cloud is its SDN. We are using industry best SDN, Tungsten Fabric. Its a complicated architecture with a lot of components. One of the main components is Network Controllers. They are three servers working in HA mode. These controllers are peering with passive layer ASR routers over BGP. Upgrading these nodes are most challenging. For some reason, if you lose BGP peer or a small flap will destroy thousands of advertised routes and all the customers will be impacted.
Procedure we took
- Move all traffic to two controllers
- Upgrade the isolated one
- Did the QA and bring the traffic to this upgraded node
- Upgrade the rest two
Upgrading other components like Analytics node and adhoc components are straight forward. We can do one by one since all are in HA mode.
Day03 - VCP Components (MW02)
VCP aka Virtual Control Plane upgrade is the next. All the VCP components are running in HA mode as described before.
We were deploying new virtual machine instead of doing the inplace upgrade of services. It means, destroy the current virtual machine from the cluster and keep the quorum. Deploy the new virtual machine with latest version. The idea is to reduce the time of upgrade. And, its safe to do that when you are having a HA deployment. Important thing you need to care is Backup. Before destroying the domain, you need to take the backup and put it in a mounted NFS. If something went wrong, just copy back and rebuild in minutes.
All our infrastructure nodes, were mounted with a 1Terabyte of NFS volumes. Backups were taken to these mount points.
- Move the VIP
- Stop the domain
- Take the backup
- Destroy the domain
- Deploy the new VM
Day04 - Dataplane - ZONE-01 Compute nodes (MW03)
During the upgrade of compute nodes, the rebooting of all virtual machines are necessary. After upgrading the kernel, server require a reboot. During this reboot all the virtual machines residing on the compute nodes will go for a reboot. Some customers took necessary action to shutoff their virtual machine to avoid any trouble for their applications. 4 engineers works parallel to do the compute nodes upgrade.
Day05 - Dataplane - ZONE-02 Compute nodes (MW04)
Same procedure for the Zone01
QA
Quality Assurance is key part during the upgrade. We were doing Quality Check after each stages. We have created multiple Ansible playbooks for this purpose. It will do multiple checks based on the components. For example, if we are upgrading VCP, script will deploy multiple virtual machines, create and attach volumes. Delete virtual machines and so on. Same like during the SDN upgrade, create network, create port and attach it. Check AAP etc.
We use the below tools for QA
Statistics
- 1000 of man hours
- 100 of Commands
- IaaC
- multiple seamless upgrades in Staging
- sleepless nights
- Bugs & Fixes
- Dedication
- Emotions
- Anger
- Fun
- Success
- Rewards