No outage WLC Upgrade?

So the most annoying thing about upgrading WLCs and probably one of the key reasons i see so many WLCs running old versions is you have to plan an outage (ill leave hitting new bugs aside). This becomes especially hard in environments where they have any of the following to deal with:

  • Multiple Sites on the same WLC
    • Flexconnect Setup mainly (but could be dark fibre to controller form remote sites)
    • With potentially different maintenance windows for each site
  • Sites requiring to be operational 24/7
    • Healthcare
    • Warehousing
    • Mining
    • Others

Now there have always been options to reduce the outage time, like pre-downloading the new code version but that doesn’t always help. I have seen issues with the Cisco x700 series APs where even if you have done the pre-download to the AP successfully during the upgrade it goes back to the controller and downloads the code again (the double download). And then there is the models without enough room on flash (although they slowly disappearing).

With the 9800 controller Cisco released a feature call N+1 Hitless Rolling AP Upgrade along with some other “hitless upgrades” like the In Service Software Upgrade (ISSU) which is supported in the 17.x track.

For this post I’m going to share my experience with how truely Hitless the N+1 Hitless Rolling AP Upgrade was in a warehouse environment.

First what is the N+1 Hitless Rolling AP Upgrade? Well its designed to upgrade access points in a staggered manner to the secondary WLC. The secondary WLC has to be in the same mobility group as the primary WLC.

Unfortunately this means if there is not secondary coverage for your clients then there will be an outage (for how long? I will detail that later)

Now being we were piloting this upgrade we arranged a maintenance “outage” window with the site just to be safe. The site was a large food distribution centre running approx. 200 APs in Flexconnect mode.

Plan:

  1. Upload the new version to the primary and secondary WLCs – I did this via the webUI
  1. Pre-download image using site tags to get WAN utilisation efficiencies of not all APs downloading from WLC
wlc#install add file bootflash:<File name>
wlc#show install summary

wlc#clear ap predownload statistics
wlc#ap image predownload site-tag <site tag> start

wlc#show ap image
wlc#show ap master list
  1. Upgrade the secondary WLC
wlc#install add file bootflash:<file name> activate commit
wlc#show version | include Version
  1. Trigger the start of the rolling AP upgrade.
wlc(config)# ap upgrade staggered 15
wlc#ap image upgrade destination <Secondary WLC Name> <Secondary WLC IP>

there is also the option to do this where the APs will automatically failback when the primary WLC has been upgraded using this command. If you use this you can skip step 6

wlc#ap image upgrade destination <Secondary WLC Name> <Secondary WLC IP> fallback
  1. Upgrade the primary WLC
wlc#install activate

Accept reload prompt and commit install post-reload

wlc#install commit

wlc#show version | i Version
  1. Fail the APs back to primary
wlc#ap image move destination <Primary WLC Name> <Primary WLC IP>

Testing In place during steps 4-6

As this was the first time conducting this “Hitless” upgrade we needed some ways to determine if there was client impact. To do this we had multiple pickers/ Forklift operators conducting normal operations through out the test – confirming the user impact if any.

I then strategically deployed some WLANPI’s as wireless clients, along with some other wireless clients to static locations throughout the warehouse. I have a continuous ping running to these devices from a wired source.

Findings

Firstly the pre-downloads using the flexconnect master (which you now don’t have to manually select) work really well the physical site was split into 2 site tags, which had 9120AXI and 9120AXE model APs which the master took about 15 minutes to download from the WLC over a 50MB WAN but then within like 5-10 minutes all the remaining APs had completed their pre-download.

Now this site is in the operational areas has a crazy coverage design, every area has minimum of 2 APs at -72dBm with primary being -67dBm or better so it is an ideal candidate for the N+1 rolling upgrade, that being said all the roaming enhancements like 802.11k/v/r are not supported or enabled for most of the clients.

During the upgrade the most ping loss i saw from any of the static stations was 1-2 pings. For the pickers and forklift operators they did not report any interruption of service.

The one thing that i did find was the few areas at this site that had only a single AP coverage due to design requirements had an outage of approx. 5 minutes whilst the AP reloaded to apply the new version. Now i would like to see the AP reload time improved to be <30 seconds, ideally from connection to forwarding traffic <10 seconds.

This new upgrade method brings more possibilities of not having to get up at zero dark thirty and perform WLC upgrades to being able to conduct upgrades during business hours. I have been calling on Cisco since this feature was first announced to put there money where their mouth is and actually do the upgrade during the keynote at CLUS after telling the audience they are going to do it, to show that they believe in the technology but so far they not willing to maybe one day.

Leave a comment