vsphere

VAX Virtualization Explored under Charon

by Nick 9 years ago

“Hey… you know that plant of ours in Europe… the one with all of the downtime?”

“Sure… “

“Did you know it runs on a 30-year old VAX that we’re sourcing parts for off of Ebay?”

“Really?! … I guess that makes 4 plants that I know of in the exact same situation!”

That conversation, or one very much like it is the same conversation being had at thousands publicly traded companies, and government organizations around the world. If you’re a syadmin, a VMware resource, or a developer who got their start anytime in the x86 era, you’ll be forgiven if the closest you’ve come to hardware from Digital Equipment Corp (DEC)/HP Alpha is maybe Alpha/NT box somewhere along the line. You’d also be forgiven for assuming that VAX hardware from the 1970’s doesn’t still run manufacturing lines that produce millions of dollars in products a year.

But that’s exactly what’s happening.

… And so is the Ebay part of the equation.

To hear the Alpha folks talk, those old platforms were bulletproof and would run forever. Perhaps not in exactly the same way that the large swaths of the banking industry still run on COBOL, but it’s an apt comparison. The biggest difference is that code doesn’t literally rust away. The DEC/HP Alpha hardware is engineered to something like Apollo-era reliability standards… but while they stopped flying Saturn V’s 40 years ago, these VAX machines are still churning away. Anyway, there’s a joke that goes something like… you know how some syadmins used to like to brag about our *nix system uptimes being measured in years (before heartbleed and shellshock)?

Well, VAX folks brag about uptimes measured in decades.

Crazy, isn’t it?

You might be sitting there asking yourself how we got to this situation? In simple business terms… If it ain’t broke (and you can’t make any extra margin by fixing it), don’t fix it!

I know lots of IT folks have this tendency to think in 1-3 year time-spans. I get it. We like technology, the latest gadgets, and sometimes have an unfortunate tendency argue about technica-obsecura. But that’s only really because “Technology moves so fast”, right? Yes, there’s Moore’s law, and the Cloud, and mobility, and all of that stuff. Yes, technology does move fast. But business… business doesn’t really care about how fast technology moves beyond of the context of if it can benefit them. In other words, you use assets for as long as you can extract value from them.

That’s just good business.

What’s the objective of this project?

The primary objective is to mitigate risk – the risk that a critical hardware failure will occur that takes production off-line for an indeterminate amount of time. Secondary objectives all include modernizing the solution, improving disaster recovery capabilities, eliminating proprietary or unsupported code, and cleaning up any hidden messes that might have collected over the years.

Put differently, the question really is – can we virtualize it and buy some more time, or do we need to re-engineer the solution?

Starting with a quick overview of the project in question… The CLI looks vaguely familiar, but requires a bit of translation (or a VAX/VMS resource) to interact with it. Starting Lsnrctl returns an Oracle database version… which, unfortunately several searches return precisely zero results for. Un/der-documented versions of Oracle are always a favorite. Backups to tape are reportedly functioning, and there’s also a Windows reporting client GUI (binary only, of course), from a long-defunct vendor. The good news this time around… the platform is apparently functional and in a relatively “good” working state. The bad news… there is no support contract for anything. Not for the hardware, not for Oracle, and certainly not from the Windows reporting client. In this case, the legacy VAX is basically a magical black-box that just runs and gives the customer the data they need. And at this point, all institutional knowledge beyond running very specific commend sets has been lost – which isn’t atypical for 20-30 year old platforms.

Which bring us to the question – virtualize, or re-engineer?

Virtualizing a VAX

To start with, most VAX/VMS operating systems are designed for specific CPU types, so virtualizing directly using something like VMware, or Hyper-V is a non-starter. But those CPU architectures and configurations are pretty old now. Like, 20-30 years old. That makes them candidates for brute force emulation. And there are a few choices of emulator out there… including open-source options like SIMH, and TS10, as well as commercial solutions like NuVAX, and Charon. After doing a bit of research, it’s pretty clear that there was only one leading commercial offering for my use case… Charon from a company called Stomasys. While there may be merit in exploring open-source alternatives further, the reality is that the open-source community for VAX system development isn’t exactly active in the same sense the Linux OS community is active. So if you do go down the open-source path, keep in mind that some of the solutions aren’t even going to be able to do what you might think of as simple and obvious things… like, say, boot OpenVMS. Which is pretty limiting.

Charon Overview

Aside from the Greek mythology reference to the ferryman who transported the dead across the river Styx, Charon is also a brand name for a group of products (CHARON-AXP, CHARON-VAX) that emulate several CPU architectures, covering most of the common DEC platforms. You know… things like OpenVMS , VAX, AlphaServer, MicroVAX3100, and other legacy operating systems. Why the name Charon? Like the mythological boatman of who, for a price, keeps the dead from being trapped on the wrong side of the river (e.g. old failing hardware); Charon transports the legacy platform unchanged between the two worlds (legacy and modern). In a similar manner that running a P2V conversion on say, Windows NT, let’s you run a 20 year old Windows assets under vSphere ESXi, Charon lets you run your legacy VAX workloads unchanged on modern hardware. In other words, you can kind of think of Charon like a P2V platform for your legacy VAX/VMS systems. Of course, that’s a wildly inaccurate way to think about it, but that’s basically the result you effectively get.

How does Charon Work?

Charon is emulator … it’s secret sauce is that it does the hard work of converting instructions written for a legacy hardware architecture, so that you can run them on an x86/x64 CPU architecture, and do so quickly and reliably. Because Charon enables you to run your environment unchanged on the new hardware, not only to you get to avoid the costly effort of reengineering your solution, but you can also usually avoid the painful effort of reinstalling your applications, databases, etc. So beneath the hood, what Charon is essentially doing is creating a new hardware abstraction layer (HAL), to sit on top of your x86/x64 compatible physical or virtual hardware. The Charon emulator creates a model of the DEC/HP Alpha hardware and I/IO devices. Once you have the Charon emulator installed, you have an exact working model on which you can install your DEC/HP/VMS operating system, and applications. Charon systems then execute the same binary code that the old physical hardware did. Here’s what the whole solution stack looks like mashed together:

CharonBigStackR2

Yes, lots of layers. But even still, because of the difference between the legacy platform and the modern platform, you still typically get a performance boost in the process.

What do I need?

Assuming you have a running legacy asset that’s compatible with Charon, all you need is a destination server. In my case, the customer had an existing vSphere environment, and existing backup/recovery capabilities, so all that was really needed was an ESXi host to run a new VM on, and the licensing for Charon.

The process at 30,000 feet looks like this:

Add a new vSphere (5.5x) host
Deploy a Windows 2008 R2 VM (or Linux) template
Use image backups to move your system to the VM
Restore databases from backup.
Telnet into your Charon instance

At a high-level, it really is that simple.

How challenging is the installation?

If you skim the documentation before installing, it shouldn’t be an issue. Assuming you have access to the legacy host, you can get an inventory of the information about the legacy platform in order to get the right Charon license… you basically need to grab a list of things like CPU architecture, OS version, tape drive type, etc. (e.g. SHO SYS, SHO DEV, SHO LIC,, SHO MEM, SHO CLU, etc.), which will enable you to get the right Charon licenses. After that, you’ll be ready to step through the installation. This isn’t a next/next/finish setup, but once you’ve added got the USB dongle setup, and create a VM based on the recommended hardware specifications, you’re well on your way.

Restoring the data from the legacy hardware onto the new VM, can be a bit more involved. In a perfect world, you’d be able to restore directly from tape into the new Windows VM – assuming you have the right tape drive, good backups to tape, etc. Short of that, you’ll need to backup and restore the legacy drives into the new environment. So you’re going to take image backups of each drive, and then upload the backups to your new VM. More specifically, do a backup from drive A0 to A1, then A1 to A2, etc. Upload the A0 backup to your new VM and restore the data. Proceed like that until you’ve completed the all of the restores. In this manner, you’ll be able preserve your operating system, database installation, and any other applications, without going through the time consuming installation and configuration process. As a result, you avoid troublesome things like version mismatches, etc., missing media, poor documentation, etc. After the backups are restored, Charon is able to take those restored files that exist on the parent VM, and boot those as local storage – and you’re off and running.

What does Charon look like?

After you’ve installed Charon, the management interface is accessible via the system tray.

CharonUI

If you’re thinking that’s pretty bare-bones, then you’re right. Once you’ve installed and configured Charon, there’s just not a lot to do from the management interface.

How do I login to the console?

In order to access the legacy OS console and CLI, you’ll simply fire-up your favorite telnet client and point it at the IP address of your Charon system.

charonConsoleR2

Which should resemble the old physical console.

Issues Encountered

While the 30,000 foot process that was outlined early in the article is essentially what was followed, the biggest problem that was ran into is probably exactly what you’d guess it to be. The Oracle database. Unsupported, and underdocumented as it was, we ran into several problems restoring successfully from tape. While not a problem with Charon,the reality is that very old and unsupported platforms can have problems that go undetected for years. Whiel this was resolved within the planned budget, it was still inconvenient. And should serve as a reminder, that to the extent it’s possible to have a support contract in-place for critical components, you should. At the same time, that’s not always the boots-on-the-ground situation.

The verdict?

We successfully mitigated the hardware risk associated with the failing hardware in the environment, which was our primary objective. Using Charon, we were able to pull-forward the legacy environment , running it on new supported hardware. Between re-installing the legacy OS under Charon, and restoring our application data and backups via tape, we were able to also meet some of the secondary objectives. As a Windows 2008 R2 VM running on a dedicated vSphere host in the customer’s datacenter, we have something modernized (at least to some extent), and that plugs-into the existing backup infrastructure. With Veeam Backup and Replication, and a standard backup policy with standard RPO, and RTO objectives we have something that the client has a high-degree of confidence in.

vSphere: VM Stuck during Power down at 95%

by Nick 10 years ago

Occasionally I’ve run into a VM that gets stuck at 95% while powering down (or during a vMotion). I know the issue isn’t unheard of, but I didn’t run into it until working with a few ESXi 4.0.0 208167 servers. So – if you have a virtual machine hangs while shutting down – and you’re certain that you’re just not waiting for it to finish powering down, and you’ve already tried to “power off” from the client- but the power-off command is stuck at 95%, you may have to manually kill the hung VM.

Login to the host with the hung machine via SSH (enable SSH if you haven’t already)
do a /sbin/services.sh restart (or services vmware-mgmt restart on ESX)… which is the same thing as doing this from the ESXi console
This command will restart the agents that are installed in /etc/init.d/ … including hostd, ntpd, sfcbd, sfcbd-watchdog, slpd and wsmand (and HA if you have it)
When you do this, the VI/vsphere client will loose connectivity as those services restart, but VM’s that are running will not be affected
After the services have restarted, you can re-connect via the VI client.
Via SSH, go to the right datastore (such as, /vmfs/volumes/DatastoreName/VMname), and delete (rm -r) the *.vswp file (the swap file).
If you can’t delete it, and you’re getting an error message to the effect… can not remove VM: device or resource busy… go find the processes associated with the VM.
“ps auxfww|grep “vmname”
“kill -9 ProcessIDNumber”
After doing so, remove the orphaned VM from inventory… just right-click the “unknown” VM, and select “remove from inventory”, being careful to not delete it.
Then delete the *.log, and *.0*. If you don’t, re-adding the VM may cause the interface to hang, and you’ll have to go through some of this all over again.
Add the VM back to the inventory, and you should be able to start the VM.

I have run into a situation once, where a host reboot was the only way to solve the problem. But other than that, this seems to be quite effective. The short version – see steps 8 & 9.

Building a 2-Node ESXi Cluster with Centralized Storage for $2,500

by Nick 10 years ago

What started as a simple goal… replacing my vSphere 4.1 whitebox with something that more closely resembles a production environment, became a design requirement for a multi-node ESXi lab cluster that can do HA, vMotion, DRS, and most of the other good stuff. But do it without having to resort to using nested ESXi. And I wanted an iSCSI storage array that was fast. Not, Synology NAS “kind of” fast under certain conditions… but something with SSD-like performance, and a bunch of space. And I didn’t want to spend more than $2,500. For everything. In other words, what really I wanted was something akin to a 2-node ESXi cluster with SAN backing it with 10TB of “fast” disk I/O. Almost something like a Dell VRTX, but for my home lab. And I wanted to spend less one-tenth of what it might cost with Enterprise-grade gear. As I iterated through hardware configurations, checking the vSphere whitebox forums, contrasting against the HCL, and running out of budget quickly, it became clear that the only way to do this was to figure out storage piece first.

Key Design Decision

So the question became, do I really want to bother with having a “real” iSCSI storage array. If not, then I could opt for multiple SSD drives locally with a hardware RAID, and just reconsider the need for a dedicated storage array entirely – or making some other compromises. Sure, I would have had more IOPS than I would have known what to do with, and yeah, maybe I could have resigned myself to living in a nested ESXi world, but no. As it turned out, someone had already done some good heavy lifting on this topic, and had a fast homemade iSCSI storage array with hardware RAID for $1,500. No VAAI of course, but still… not bad for $1,500. More than that through, that left $1,000 to put together a couple of hosts – so plenty, right?

Critical Path item – Inexpensive Storage

If the storage array as a build component was the biggest overall challenge, then raw storage was the critical path item in terms of procurement. In order to come out with about 11TB of usable capacity, I needed 14 x 1TB drives in a RAID6, and I needed to spend less than $700. Depending on when you read this article, that may be less of a challenge than it was in early 2014, but at the time $50 per TB on a 7200RPM drive was hard to come by. Still harder was finding Hitachi Ultrastar 7200pm drives with 32MB of cache – as they’re Enterprise-grade and not always around on Ebay. After a couple of months bidding on large lots, I eventually found 14 for $50 each. Given no schedule constraints, I could have perhaps ended up with 2X as much storage, for a reasonable cost premium – but 11TB of usable space in a RAID6 configuration exceeded my need, and kept me on pace to acquire most of the hardware on schedule.

Controller

The LSI MegaRAID 84016E is, albeit last generation, a SAS/SATAII workhorse of a controller card. Supporting up to 16 drives, and RAID levels 0, 1, 5, 6, 10, 50, and 60 at 3Gb/s per port, with a battery backup module, and online capacity expansion and it’s dirt cheap on Ebay and nearly always available. For my use-case, it compared favorably against going the local SSD drive route, or using LSI9260-4i which cost 4X. If you’re copying the build, be sure and pick-up four of these Mini SAS (SFF-8087) Male to SATA 7-pin Female cables so that you can plug you SATA drives into the LSI 84016E.

Storage Array: Everything Else

This section could almost everything else, because by now we’ve spent about 31% of the budget and many of the remaining decisions are mostly inconsequential. Still I opted for a Rosewill RSV-L4500, because I knew it would fit all 14 drives in it (and it does, and the design isn’t bad at all – easy enough for me to work in). I did add a spare SSD drive as an OS and Level -2 Cache for PrimoCache (though the SSD was scavenged from another box). For CPU, Memory, and Power Supply – I went with the ASRock 970 Extreme 3 – $65, AMD FX8320 8-core CPU (because it was on-sale for about $105), 32GB of the lowest-cost DDR3-1600 ECC RAM I could find, and a 750W Corsaid HX750 power-supply to power all of these drives. For the network interfaces, I had a spare Intel 2-port NIC lying around that I had picked-up in a lot sale a couple of years ago.

Making the Storage Array Useful

There are a number of ways you can go with this. Nexenta, Microsoft Storage Spaces, OpenIndiana, or FreeNAS to name a few. But I really wanted to take a look at Starwind’s SAN/iSCSI combined with PrimoCache. Short version… you install Windows 2008R2/2012R2, configure the LSI controller software (MegaRAID), add Starwind’s SAN/iSCSI software and carve out some storage to expose to ESXi, and PrimoCache and configure it to use as much RAM as you can for the Level-1 Cache (26GB), and as much SSD space as you can for the Level-2 Cache (120GB). In my usage scenario, this seems to work pretty well – keeping the disks from bogging down I/O.

vSphere ESXi Cluster

With the remaining budget, I still needed to build two vSphere ESXi boxes. As I started looking, I found it really challenging to come up with a lower-cost and still good build than using the ASRock 970 Extreme 3, AMD-FX-8320 3.6GHz 8-Core processors, 32-GB of RAM, ATI Rage XL 8mb, Logisys PS550E12BK, spare Intel Pro/1000, 16GB USB sticks… I built two more machines – installing vSphere ESXi 5.1 on the USB sticks, and mounting the iSCSI volumes that I exposed in Starwind’s SAN/iSCSI and began building out my vCenter box and templates.

Compromises to hit budget

In order to stay around $2,500 there were a few small compromises that I had to make. The first was the hard drives… I couldn’t wait for the 2TB Hitachi Ultrastar drives. While not significant for my use case, it is nonetheless noteworthy. Secondly, I ran out of budget to put a full 32GB of RAM in the both of the hosts, so I have a total of 48GB across the two nodes, instead of the 64GB that I hoped for. Finally, for the hosts, I bought the lowest cost cases I could find – $25 each. Aside from scavenging a few parts that I had laying around (an SSD drive, a few Intel Pro/1000), I managed to come in on budget.

Bottom Line

The project met all of my goals – a home lab, multi-node ESXi cluster with a dedicated iSCSI storage array that resembles a production environment – all on a budget of around $2,500. I’m able to vMotion my VMs around, DRS is functioning, and Veeam Backup and Replication is working. Better still, I can tear-down and rebuilt the environment pretty quickly now. I didn’t really run into any show-stoppers per say, or real problems with the build. If there’s interest, I’ll post some additional information about the lab in the future. A big thanks to Don over at The Home Server blog for his work on Building a Homemade SAN on the Cheap, particularly in validating that you can actually buy descent drives in large lots on Ebay at a discount, as well as for the Motherboard recommendation which was critical in hitting the budget.