Preface

Upgrading a public cloud to the next stable release is a Nightmare for the Operations Team. We are using Openstack as our cloud software. As we know, openstack is an abstract layer. It consists of a number of distributed software components that collaborate with each other in order to deliver the required cloud services. upgrading from one version of OpenStack to another can take a great deal of effort.

Would you need an Upgrade?

This is a very important question. There is a new version of the product is available, but are you going to benefit from this? Does this Upgrade gave you any value? Do a multiple brainstorming sessions to get a clear answer. If the answer is Yes, go to next step.

Plan the upgrade

  • Read the release notes for the targeted version. Learn about the New, Updated and Deprecated features. Find out the incompatibilites.
  • Consider the impact for the end users.
  • Document each and every thing you remember.
  • Discuss the pros and cons with all stake holders

Methods of upgrades

Parallel Cloud

This one is the safest method while we discussing about a public cloud and most costly method. In this method, you are building a same like cloud parallely. Live migrate your resources to new cloud. Upgrade the Cloud and bring back the workloads.

Advantages

  • Safe
  • Easy to rollback
  • No downtime for customer workloads

Disadvantages

  • Investment
  • Idle resource after the upgrade

Rolling upgrade

This approach involves upgrading each component of the system, piece by piece, eventually giving you a cloud running on the new version. While this is more complex, we should be able to have minimal downtime of each component, and using the reliance built into OpenStack, we should be able to achieve zero downtime, but we may have some actions taking slightly longer than usual. Recent versions, controle planes are decoupled from the dataplane we can do the upgrade the components one by one. And this approache will minimise the downtime.

Advantages

  • potentially easier rollback
  • potentially less downtime of a component
  • works well when deploying nova in a VM
  • easier to test as system is in a known state (or VM image)

Do an upgrade in Staging

As Legacy, you have to build a staging environment identical to your production environment. I said Identical. Keep the exact component versions. If you have a lot of NFV workloads, deploy a few in staging environment. This is needed to make sure that, these workloads will survive the upgrade. If you face any issues, during the upgrade of staging, note it down. Fix it, do the upgrade until the issue disappears. You should have a clean and perfect upgraded version before upgrading the production.

Backup

Take the backup of all the components, not only inevitable but whole system should be backed up. Allocate enough space for the backup.

  • MySQL
  • Cassandra.
  • Zookeeper.
  • All configuration files.
  • Salt / Ansible playbooks.
  • Take a snapshot of all the virtual machines in your control plane.

Upgrade plan

  • Day01 Upgrade the Controle plane (Traffic interruption, because we are upgrading network nodes)
  • Day02 Upgrade Compute nodes - ZONE-01
  • Day03 Upgrade Compute nodes - ZONe-02

Upgrade

Day01

  • Assuming that, the virtual control plane has been deployed in HA mode.
  • We are going to upgrade your controller-01 node.
  • Take the latest snapshot of controller-01 VM
  • Move your VIP to a different controller, if VIP hosted here.
  • This will assure you the API availability.
  • Stop all service in your Controller-01 node.
  • Rolleout the upgrade packages.
  • Reboot the machine.
  • Make sure that, all your services are back online.
  • If not, Rollback the upgrade
  • Deploy couple of virtual machine using this controller.
  • Move to next controller

Day02

  • Rolleout the upgrade
  • Reboot the Hypervisor
  • Make sure the Customer workloads keep their last Status
  • Testing and Validation of New cloud

Customer Feedback

  • Deploy mulitple scenarios on the New cloud.
  • Follow up with your Premium customers about the service status.
  • Call your NFV customers and make sure that, Traffic is stable.
  • Be vigilant on your monitoring tools
  • Alert the L1 & L2 Team for any suspicious activities on the monitors

Further Reading