How did we upgrade our Public Cloud

Public Cloud Upgrade.
Preparations
Stage your Upgrade
Monitoring
Day01 - Infra Upgrade
Day02 - Tungsten Fabric (MW01)
Day03 - VCP Components (MW02)
Day04 - Dataplane - ZONE-01 Compute nodes (MW03)
Day05 - Dataplane - ZONE-02 Compute nodes (MW04)
QA
Statistics
References

Public Cloud Upgrade.

What is the scariest task, a cloud administrator may face, Public Cloud Upgrade. I had written an article before about how to plan for a Public Cloud Upgrade. Well, thats theory only. This article will tell you how we did it practically.

Preparations

Change Request

Its an internal process. A change request should be created and submit for the approval of multiple stake holders. This change request should have the following information

What is this upgrade
Major enhancements
Customer Impact
Implementation Strategy
Backup procedure
Roll back Procedure

Customer Engagement

Inform the premium customer through their CSM. They need to prepare themselves. They may need breathing time to guard them for not losing service to their customers. Need to prepare a 5 slides presenation provide to your large customers if any.

Executive management

Even they were provided you the green light, they may have lot of questions. One of their major concern will be What will be the Impact and What is the roll back plan. All these questions were addressed in the CR. But you need to prepare yourself to answer them. They are technical as well as more curios.

Resource planning

You are going to do the upgrade in night. All your team will be present during the activity. At the same time, you need to make sure that, the team shouldn’t be exhausted. It will be a long night. Also, someone should be available during the day time to address customer questions / concerns if any. 100% sure, not all customers took it serious.

Frequent Updates

Provide frequent updates on your status Page. About the activity, how you are progressing. Once finished each days activity update the channel.

Stage your Upgrade

We have tried the following activities on our Stage environment. Staging is a small environment, but its exactly same as production in architecture level. Its use same reclass model. Same versions of the softwares. Same database schema. The playbook we are going to apply on production is created based on this staging experience. We tried the below on staging.

Apply the procedure on Staging
Roll back when identified a bug
Repeat the procedure
Did seamless upgrade 3 times
Create the final play book

Monitoring

During the upgrade, you should have a full picture about each micro components. We have deployed tiny virtual machines in all the compute nodes to do multiple check like following

Continously ping external website to make sure VM is able to reach Internet
Above test with Public IP and without public IP. Because both using different routes
Check connectivity between two regions
Check reachability of Public IP from Internet
Check the metadata
Check some external service like git pull

All the above data are pulled in a Grafana dash board.

Day01 - Infra Upgrade

{No Impact}

Infrastructure nodes, where our Virtual Controle Plane resides and few adhoc virtual machines as well. VCP is running in highly available method, for example keystone service running on three virtual machines, and each three are running on three different hypervisors (infra nodes). Upgrading those infra nodes are not that difficult. Move the VIP to another node, destroy the domain,upgrade the packages and reboot. All went well.

Stop the domains
Populate the new repositories
Run the upgrade
Reboot the Nodes

Day02 - Tungsten Fabric (MW01)

{No Impact}

One of the core piller of a public cloud is its SDN. We are using industry best SDN, Tungsten Fabric. Its a complicated architecture with a lot of components. One of the main components is Network Controllers. They are three servers working in HA mode. These controllers are peering with passive layer ASR routers over BGP. Upgrading these nodes are most challenging. For some reason, if you lose BGP peer or a small flap will destroy thousands of advertised routes and all the customers will be impacted.

Procedure we took

Move all traffic to two controllers
Upgrade the isolated one
Did the QA and bring the traffic to this upgraded node
Upgrade the rest two

Upgrading other components like Analytics node and adhoc components are straight forward. We can do one by one since all are in HA mode.

Day03 - VCP Components (MW02)

{No Impact}

VCP aka Virtual Control Plane upgrade is the next. All the VCP components are running in HA mode as described before.

We were deploying new virtual machine instead of doing the inplace upgrade of services. It means, destroy the current virtual machine from the cluster and keep the quorum. Deploy the new virtual machine with latest version. The idea is to reduce the time of upgrade. And, its safe to do that when you are having a HA deployment. Important thing you need to care is Backup. Before destroying the domain, you need to take the backup and put it in a mounted NFS. If something went wrong, just copy back and rebuild in minutes.

All our infrastructure nodes, were mounted with a 1Terabyte of NFS volumes. Backups were taken to these mount points.

Move the VIP
Stop the domain
Take the backup
Destroy the domain
Deploy the new VM

Day04 - Dataplane - ZONE-01 Compute nodes (MW03)

{VM Reboot}

During the upgrade of compute nodes, the rebooting of all virtual machines are necessary. After upgrading the kernel, server require a reboot. During this reboot all the virtual machines residing on the compute nodes will go for a reboot. Some customers took necessary action to shutoff their virtual machine to avoid any trouble for their applications. 4 engineers works parallel to do the compute nodes upgrade.

Day05 - Dataplane - ZONE-02 Compute nodes (MW04)

{VM Reboot}

Same procedure for the Zone01

QA

Quality Assurance is key part during the upgrade. We were doing Quality Check after each stages. We have created multiple Ansible playbooks for this purpose. It will do multiple checks based on the components. For example, if we are upgrading VCP, script will deploy multiple virtual machines, create and attach volumes. Delete virtual machines and so on. Same like during the SDN upgrade, create network, create port and attach it. Check AAP etc.

We use the below tools for QA

Statistics

1000 of man hours
100 of Commands
IaaC
multiple seamless upgrades in Staging
sleepless nights
Bugs & Fixes
Dedication
Emotions
Anger
Fun
Success
Rewards

Public Cloud Upgrade.

Preparations

Change Request

Customer Engagement

Executive management

Resource planning

Frequent Updates

Stage your Upgrade

Monitoring

Day01 - Infra Upgrade

Day02 - Tungsten Fabric (MW01)

Day03 - VCP Components (MW02)

Day04 - Dataplane - ZONE-01 Compute nodes (MW03)

Day05 - Dataplane - ZONE-02 Compute nodes (MW04)

QA

Statistics

References