vSphere Load Based Teaming and Converged Network Adapters

Obvious after a little thought but perhaps not upfront, vSphere Distributed Switch Load Based Teaming (LBT aka “route based on physical NIC load”) is unsuitable in certain configurations when using converged network adapters (CNAs) such as the Cisco VIC.

A CNA is a physical Ethernet network adapter card that can be “carved up” into virtual Ethernet adapters and virtual Fibre Channel over Ethernet (FCoE) adapters. The virtual interfaces appears as discrete system devices on the PCI bus, have their own MAC or WWxN addresses and as far as the hypervisor is concerned are all physical adapters.

Depending on configuration, the full bandwidth of the physical network adapter is generally available to be consumed by all of its virtual adapters. For example, a 10Gbps CNA that has been carved up into four virtual ethernet adapters will present four adapters each with a link speed of 10Gbps, and each of those virtual adapters is capable of consuming the entire 10Gbps to the detriment of the others (QoS is typically used to mitigate this but is irrelevant to this discussion).

vSphere Load Based Teaming is an uplink load balancing algorithm available with the vSphere Distributed Switch and is similar to “Route based on originating port ID” with the addition of software intelligence to dynamically re-home traffic to different uplinks in response to any individual uplink exceeding 75% utilisation over a 30-second period. The hypervisor determines if a link is at 75% utilisation by comparing the traffic it sees on the uplink with its reported physical link speed.

When using a CNA, the actual utilisation of the adapter is the sum of the utilisation of each of the virtual adapters. However, LBT is only able to observe and therefore only considers the utilisation of  each of the virtual adapters and not the actual utilisation of the parent physical adapter.

Where Observed != Actual

Let’s look at an simplified example where a dual-port 10Gbps CNA has four virtual adapters defined (VNICs) to provide networking for two virtual switches: one standard switch for vmkernel traffic one distributed switch for virtual machine traffic leveraging LBT on each distributed port group. Similar designs are common and stem from the days of multiple 1Gbps ethernet connections.

LBT CNA img1

In this case, there is traffic on the physical adapter (PNIC) that LBT is unaware of and thus it cannot make intelligent load balancing decisions. If a large amount of VMotion traffic is saturating vmnic4 and thus PNIC2, vmnic2’s link will also be saturated, but LBT will not act to rebalance traffic as it is unable to observe that activity.

Make Observed = Actual

If LBT is desirable and there are no other constraints such separation of vmkernel and virtual machine traffic at the vswitch level, this is possible by ensuring that the traffic observed by LBT is equivalent to the actual utilisation of the physical link — by ensuring that there is only a single virtual adapter defined for every physical adapter in the system, for example:

LBT CNA img2

With FCoE ?

So we’ve  solved the issue with virtual ethernet adapters, but what if we’re using virtual FCoE adapters to pass virtual machine storage traffic for either for VMFS datastores or raw device mappings (RDMs)? Again, there is traffic traversing the physical adapters which LBT is unable to observe.

LBT CNA img3

There is one situation where this configuration might work, and that is where the FCoE virtual adapters are only being used to SAN boot ESXi and not for virtual machine storage traffic. In this situation, traffic on the VHBAs should be close enough to zero as to make LBT effective.

LBT CNA img4

Conclusion

So, in general LBT shouldn’t be used with CNAs, however I can think of a few use cases where it is appropriate to do so:

  1. As in the redesigned example above, where only a single virtual adapter per physical port is configured. In this configuration, the hypervisor is seeing the entire utilisation of the uplink and so can make intelligent load balancing decisions.
  2. FCoE boot from SAN.

 

Buzzword Mitigation – Cloud to Butt Plus

Things can get pretty serious in the IT community, especially as new and vaguely defined buzzwords take hold. News articles are written by churnalists with an at-best tenuous grasp on the significance of the buzzwords, used in contexts implying widely differing definitions even within the same article.

One of the most egregious offenders of late is “cloud”. While I think industry has matured its definition of cloud lately, you still see the term used to describe everything from full SaaS offerings all the way down to a VM and a powershell script. All of which means that for most of content on the internet, use of the word “cloud” is effectively meaningless and interchangeable with pretty much anything.

And that realisation is the genesis of the deliciously juvenile “Cloud to Butt Plus” Chrome plugin that one of my colleagues introduced me to a few months ago.

Put simply, the plugin:

Replaces the text ‘the cloud’ with ‘my butt’, as well as ‘cloud’ with ‘butt’ in certain contexts.

The results are both puerile and enlightened at the same time. Witness.

2014-08-06 14_44_53-Search _ ITworld

2014-08-06 14_40_14-HP trims its cloud offer for lighter use _ ITworld

2014-08-06 14_46_26-Search _ ITworld

 

My CCNA Data Center Exam Experience(s)

I recently sat and passed the two constituent exams of the Cisco Certified Network Associate – Data Center certification.

This certification covers networking, storage and virtualisation technology broadly and several of Cisco’s data centre lineup specifically including Nexus and MDS switching and UCS compute.

While historically a server, storage and virtualisation guy, converged storage and networking plus the more recent adoption of software defined networking, network virtualisation, network function virtualisation and/or whatever it is that we’re calling it this week suggest that  moving forward virtualisation engineers will need a solid grounding in networking to stay relevant.

Heck, the team that I’m now working in is called “Data Centre Platforms” and is a converged team with remit across SAN, storage, DC networking, firewalling, load balancing,  hypervisor, server hardware and infrastructure automation. So convergence is not only coming soon — it’s already happening — and it’s not only a technology change — it’s an organisational change as well.

In this sense, the CCNA Data Centre is a useful certification for a data centre platforms professional. Many of the topics covered are generic enough that they provide a solid grounding in networking, storage and virtualisation technology relevant beyond the Cisco stack. Unfortunately, other areas of the exams dive far, far too deep into product specifics — the dreaded speeds & feeds — to be of any real use to anybody, even someone that lives and breathes Cisco.

The Exams

There are two exams required to achieve the certification:

  • 640-911 Introducing Cisco Data Center Networking
  • 640-916 Introducing Cisco Data Center Technologies

The exams are both 90-minutes long and contain about 70 questions each. The questions are a mix of multiple choice and lab emulators.  Once you have completed a question, you cannot go back. If you do not know the answer, you cannot skip the question and come back to it later. This has implications both good and bad for time management.

Time Management

Unlike any other IT exam I’ve ever sat, I had absolutely no problem with time management and cannot make any specific recommendations to other potential candidates around this. I finished both exams in between thirty to forty minutes, or less than half of the allocated time. This isn’t because I am a genius and breezed through to a perfect score. It is because for the majority of questions I either knew the answer immediately, in which case it took only a few seconds, or I didn’t and spending five or ten minutes thinking about it wouldn’t have helped. A few maths and lab questions were the exceptions here.

Exam Preparation

I’ve been exposed to networking technology and Cisco products for a number of years and didn’t feel that attending the official instructor-led training would be a good investment of time and money. Instead, I lent on my exposure with a Flexpod implementation, some online self-paced video training, and to a much lesser extent the UCS emulator for study preparation.

The Blueprints

As with all exams, the first resource I went to was the blueprints. These were my first Cisco exams and I found the blueprints to be unhelpfully terse. Let’s look at the Unified Computing section from the 640-916 Introducing Cisco Data Center Technologies exam:

5.0 Unified Computing

5.1 Describe and verify discovery operation
5.2 Describe, configure, and verify connectivity
5.3 Perform initial set up
5.4 Describe the key features of UCSM

UCS is an incredibly powerful and incredibly complex beast. While a blueprint should be bullet-points, I found these to be unhelpful.

The various spelling and grammatical mistakes also gave the impression that they were an afterthought, not the basis from which the exam content was derived. If someone could “Describe initiator target”  for me, I’d be most obliged. Perhaps it has something to do with “zoningt”.

3.0   Storage Networking

3.1 Describe initiator target
3.2 Verify SAN switch operationst
3.3 Describe basic SAN connectivityt
3.5 Describe the different storage array connectivityt
3.6 Verify name server logint
3.7 Describe, Configure and verify zoningt
3.8 Perform initial set upt
3.9 Describe, Configure and verify VSAN

And now it’s time for blog.lukedudney’s very first prize giveaway! The author of the first comment to correctly describe the difference between “server virtualization” and “Server Virtualization” will receive a cheque for ONE MILLION DOLLARS**

4.0   DC Virtualization

4.1 Describe device virtualization
4.2 Describe server virtualization
4.3 Describe Server Virtualization
4.4 Describe Nexus 1000v
4.5 Verify initial set up and operation for Nexus 1k

Video Training

If you’re looking to sit the exam, I highly recommend Chris Wahl’s excellent video training series on Pluralsight which  comprehensively detailed the blueprint topics. Chris’s professional but friendly and passionate style of presentation really resonated with me, which surely helps with absorbing the information. There were moments where I could tell he was genuinely geeking out over the technology, and the VOLTRON references were also a nice touch 😉

My Wife, the FEXpert

Beware of any blueprint topic that includes the words “Describe” and “product” . These are the demon topics that expect comprehensive rote learning of the minutia of the Cisco product family specs including such Googleable and irrelevant factoids as port counts and speeds, feature support, and physical connectors.

My approach to these topics was produce flash-cards that my wife tested me on over breakfast, and I am certain that through the process she (an accountant by trade) retained a more comprehensive knowledge of Cisco FEX specifications than I did. Of course, both of us had completely forgotten all of this information by the next day.

It’s disappointing that vendors still feel that rote learning of product specs is a useful and acceptable aspect of certification exams, but they’re all guilty of it to a certain degree.  Of course these exams aren’t purely academic, there is also a high level of marketing involved, but even so it  seems a waste of time for both the candidate and vendor to base certification on such temporal and uselessly specific product details.

UCS Platform Emulator

Cisco provide a functional UCS Manager interface in the form of a virtual appliance than can be deployed into VMware Worksation (and other similar products I’m sure). This emulator lets you interact with the UCS GUI in ways that should cover the blueprint topic “5.4   Describe the key features of UCSM”. Other than exposure to the GUI, the Emulator isn’t really going to help you much as a lot of the exam topics are about the functions of the various components within the UCS that are not obvious just from playing with the GUI.

FCoE is a Thing That Exists in the World

Fortunately I’ve had experience with the build and operations of a FlexPod environment, and even more fortunately, that environment included FCoE. Several of the exam topics covering Nexus and MDS deal specifically with FCoE, a technology to which I and probably 95% of IT professionals have never had and will never have any exposure. If you’ve got the opportunity to work with a FlexPod or VBLOCK, then this exposure will be invaluable to you in the exam.

ACE and WAAS-who?

Bless these guys, they’re still in the running and only just made the cut with 1% of the total exam score. My advice — give them a quick glance, but don’t bother rote learning the configuration maximums and so on. You likely won’t come across the technologies in the real world.

 

All in all, the CCNA Data Center certification process was a positive one. Cisco have done really well in creating exams that test for competence across a wide range of hitherto segregated data centre disciplines. The next Cisco exams I sit will be the two UCS certifications DCUCI and DCUCD towards my CCNP Data Center. If I need any additional help, well, it helps that my wife is now an expert!

– L

 **cheques will not be honoured.

First of the Cisco UCS M4 Gen Servers Released

Quite a few months late on the announcement, but thought I’d put together a quick note that Cisco have released the first of the the M4 generation of C-series (rack mount) and B-series (blade) servers from their Unified Computing System.

A lot of the usual —  new CPUs, more memories etc. Besides all that, there are a couple of notables in the release, though, that stood out for me.

1. The Physical Design

The new systems are, to put it bluntly, really, really, really ridiculously good looking. The black accents, metal grills and an aggressively prodigious use of right angles on the blades make these servers look more at home in a Mad Max film than a contemporary data centre.

Cisco B460 M4
Cisco B460 M4

The C460 has also come quite a long way since its Fisher-Price M1/M2 years

 

UCS C460 M2
UCS C460 M1 c. 2010
UCS C460 M4
UCS C460 M4 c. 2014

 

 

 

 

 

 

 2. We Form Like VOLTRON

This is probably the niftiest technical feature out of the new lineup. The B260 M4 is a full-width, two-socket blade that, when it gets together and loves another B260 M4 blade in a very special way, unifies into a 4-socket B460 M4. This is achieved by using dynotherms and infracells a “Scalability Connector”, which attaches to the front of the blades. I guess it’s Unified Computing in more ways than one.

While I’m aware that other blade vendors have had expansion blades on the market for quite a while, I’m not sure that any of them provide the ability to double your memory, CPU sockets, I/O bandwidth, and expansion slots all at the same time. No word yet on whether additional accessories such as scented candles and Barry White CDs are required to encourage a successful pairing.

More Memories

Scalable up to 6.0 terabytes  of main memory when 64GB DIMMs are available, these seem aimed at memory-intensive workloads for now, as Cisco have four-socket blades in a single full-width form factor in the previous generation B420 M3 blade. However, with support for the 15-core Intel E7 v2 48xx and 88xx series CPUs, you can squeeze up to 60-cores onto the B460 M4, up from 48 in the B420 M3. For my money, if you’re looking to take up two full-width bays in your blade chassis, you’re probably better off going for a C-series rack-mount option. Or depending on your requirements and timeframe, hang out for a B420 M4 (or whatever its M4  4-socket, single full-width successor may be).

More M4s

This is the first batch of the UCS M4 generation, supporting 4-socket CPU models. Given Intel’s annual developer forum IDF14 is a few weeks away in September, we can probably expect some of the announced chips including the E5 v3 Haswells to make their way into Cisco’s existing 2-socket models such as the B200 and C220 (and other server vendor line-ups) very soon, though I won’t speculate on specific timeframes.

More Cores

Back to that Scalability Connector, I wonder also if engineering are working the capability to combine two E7-88xx four-socket systems to max out in an 8-way behemoth. That’s 120-cores and 240-threads on a single system. The CPUs support it, and it would certainly be far more attractive to buy a 4-socket system to meet your immediate requirements but still have the option to later expand into an 8-socket configuration if and when the requirement for the additional horsepower arrives. At that end of town, cost is prohibitive enough without also having to cater for future growth requirements for your application in your day-1 capex.

That said, with the obscene amount of cores Intel are cramming onto a single die these days — we’re up to 18 on the 2-socket E5-2699 v3 @ 2.3GHz in case you’re not following along at home — most workloads can be probably be catered to with a 2-socket system, especially in general purpose virtualisation.

 

vCNS/vShield Edge 5.5.2 SSL-VPN+ and Mac OSX bug

I’m sure this issue doesn’t affect a lot of people, but it’s caused me considerable grief and there is precisely zero information available on the web, so here we go:

 

Symptoms

  1. Affects Mac OSX clients connecting to an SSL VPN+ service on a vShield Edge 5.5.2 appliance. 5.1.2 is not affected.
  2. The Edge system VM exhibits high CPU usage, with one or more cores flatlining at 100% .
  3. Client VPN connectivity either completely fails (disconnects, with no reconnect) or is unable to pass traffic.
  4. The ‘sslvpn’ process within the Edge gateway flatlines at 100% CPU (or multiples thereof, I assume depending on the number of sessions triggering the bug), identified with “show process monitor” on the Edge console
  5. Eventually the ‘sslvpn’ process dies, all sessions drop, and the sslvpn process restarts logging the following:

DEBUG :: C_ServiceControl :: GetServerStatus: sslvpn is DOWN
config: [daemon.err] ERROR :: C_ServiceControl :: sslvpn not running, start sslvpn again

After this the sslvpn process restarts and CPU usage returns to normal until the bug is triggered again.

 

 Fix

This issue is fixed in vCNS 5.5.2.1, however it was a late edition to the release so it is not mentioned in the release notes. Will post an update or KB article when I see them.

 

My VCAP-CIA Exam Experience

In December I sat and passed my VMware Certified Advanced Professional – Cloud Infrastructure Administration / VCAP-CIA exam. I have recently sat and passed two other VCAP exams – Datacenter Design (VCAP5-DCD) and Datacenter Administration (VCAP5-DCA). Obviously, being a live lab exam, the exam environment is very similar to the DCA but focuses on the vCloud suite of products. The exam is based on vCloud 5.1 and you are provided with the product documentation.

Tips

1. Expect UI latency and jitter

This is a bit of a negative way to start, but it really defined the whole experience for me so it’s good to be prepared for it. The latency and jitter was appalling. Trying to interact with the UI was at best incredibly frustrating and at worst almost impossible. Sometimes clicking on a UI element or typing would take 2-3 seconds to register, sometimes 6-7, sometimes more, so it’s impossible to even anticipate the lag and work with it.

During that time, I was sitting there waiting for feedback, not knowing if when I started typing it will be into a text entry field or will trigger some browser or application shortcut. Good luck arrowing back to fix a typo. At one point the whole environment froze for 2-3 minutes before the connections timed out and had to be re-established. This not only significantly reduced my effectiveness at the exam, it really soured the experience for me. Just sitting there watching the individual tiles progressively render across the screen, you just feel helpless and frustrated.

I was (as always) aiming for a perfect score, but only achieved 344 out of a maximum of 500. Good enough to pass, but I’ll never know what my score could have been. I took the exam in Perth Western Australia and assume that at least part of the problem is my geographic distance from where the lab is hosted but VMware assure me that the problem is random and can affect one testing centre just as much as any other anywhere in the world. I remain unconvinced as I’ve had the same problem three out of three times — in both of my DCA exams and now again in the CIA. On my first DCA attempt, I assumed there was something severely wrong with the exam setup and the local testing centre personnel agreed with me so we aborted. The second time was no better.

If this happens to you my advice is to just deal with it and try to finish the exam. Once you’re done, lodge an incident with Pearson and with VMware, who should offer a free re-sit if the problem resulted in exam failure.

Update 2014-07-10 Joshua Andrews @SOSTech_WP has replied on Twitter with the following advice: 

FYI if you have bad performance lodge a support request during the exam – PVue can add time plus it bolsters your case…. Also, if you leave a .txt file on the desktop for any exam but DCA510 we can get it later.

2. Time management

This might seem obvious but time is definitely your enemy in the exam. There are 210 minutes in which to perform 32 lab activities, so you can schedule about 6.5 minutes per question. Many questions were incredibly easy, taking only a minute or so, but quite a few took a lot longer than that.

I used the time before the exam to write the numbers 1-32 down the page on my notepad, with room for writing notes. As I passed each question, I’d either cross it off or for partially completed questions write a note on what was left to do in case I had time to come back later. You are able to go back and forward between the questions, so it is possible to skip over questions and come back to them later if you need to. Unfortunately you can only go back or forward — there is no index — and due to the latency issues described above this takes about 15-20 seconds per question so going back to the start could take up to ten minutes… which you don’t have up your sleeve!

3. Lab it lab it lab it

There really is no substitute for using the technology in the wild, to achieve specific real-world business outcomes. I think this prepared me incredibly well for the exam, but I still I spent a lot of time in my lab rehearsing actions and learning how to achieve unfamiliar tasks. For my lab I relied exclusively on the excellent AutoLab available at http://professionalvmware.com/2012/05/vsphere-5-autolab/. Using AutoLab and VMware Workstation let me build and destroy a vCloud environment automatically, as well as install and configure additional vcloud cells and prepare ESXi hosts with the vCloud agent and VXLAN module manually as necessary.

4. Use The Blueprint

This should go without saying but I will anyway – the blueprint is king! I focused all of my study around each of the knowledge items in the blueprint. I took the approach that I have taken on my other two VCAPs by creating a simple spreadsheet of each blueprint item including a column to progressively grade myself on my ability:

  • A: No problem
  • B: Needs work
  • C: No idea

This allows me to study iteratively, slowly turning all of those “C”s into “B”s as I learn and perform the tasks in my lab. Eventually, I can filter out the “A”s until I’m left with a handful of “B”s and, god forbid, maybe a few “C”s to study.

I was still caught out by topics that I thought I knew back to front, but really struggled with during the exam, for example vApp fencing. I was also unable to really spend any time studying Chargeback as this product is not used in my production environment, and requires extra setup that hasn’t been automated yet. My study was restricted to VMware’s demo video which really wasn’t enough, but luckily it doesn’t comprise a large proportion of the blueprint.

5. Don’t rely on the provided documentation

Yes, you have at your fingertips all of the installation, configuration and administration guides you need to use the products. Given days or weeks, you could probably make use of them to complete the exam from a position of ignorance. However you really do not have time to reference any of the documentation, and given that it is in a different area of the interface, you’re wasting precious moments by context switching back and forth. Even if there was no UI latency, I’d still recommend against it – you don’t want to be learning how to do something while the timer is ticking.

6. Do you even UNIX?

While most of the blueprint is driving the vCloud application itself, it runs on a RedHat or CentOS operating system and many of the tasks require getting dirty on the command line. I won’t give anything away by saying that these and more blueprint items require some Linux experience:

  • Generate vCloud Director response files
  • Add vCloud cells to an existing installation using response files
  • Set up vCloud Director transfer storage space

 

I found a couple of questions didn’t include enough information to understand what the solution should look like. Unfortunately, as you’re prohibited from taking notes out of the exam room, I have forgotten what these were so haven’t reported them. It could also be that my own knowledge failed me. I am really impressed with the vCloud products and enjoy using them on a day to day basis, and studying for this exam let me brush up on a number of concepts that I either hadn’t covered before or only briefly. Overall studying for and sitting the exam was a very positive experience that has left me better equipped to perform my duties. However, I am very unimpressed that the latency issues I’ve now experienced three times in the past six months are yet to be addressed.

The exam result took eighteen days to be processed, but this was over US thanksgiving weekend. Eventually I received the email

Luke,

Congratulations! You have completed all of the required tasks and have now earned the following certification:

VMware Certified Advanced Professional – Cloud Infrastructure Administration

 

Next up – VCAP-CID!