Robert McLaws: Windows Edition

Blogging about Windows since before Vista became a bad word

Back Online (Mostly), And My Hyper-V Nightmare

Rumors of my demise have been greatly exaggerated.

Well folks, it's been a pretty crappy 10 days. Late the night of the 27th (of January), I signed into one of my virtual web servers to get a new client site reconfigured. To my horror, the server was completely offline. My datacenter discovered that the hard drive had failed, the first time that had happened to one of my servers in 5 years.

I was able to get the drive transferred to my working server, and started the process of recovering the data. I didn't have a backup since I had moved to Hyper-V (dumb, yes I am well aware of that), so I had to get the virtual hard drives (VHDs) back. Well, as the drive failed and Hyper-V came crashing to a halt, it removed all traces of the Virtual Machine that hosted all of my client sites. In addition, it deleted the snapshot of the server that runs Windows-Now, hence some of the broken images that are coming up at the moment (more on that shortly).

The next several days were extremely frustrating, as I learned a very hard lesson about deploying beta software in the wild. For example, in Hyper-V, "snapshots" are completely misleading. They are not full backups of the VHD file, like one might expect. No, instead, they are like hybrid differencing/undo disks... which would be all well and good if the documentation explained that, but it doesn't. The problem with that is, the changes to the VHDs are not committed unless you explicitly do so. So in the event of a catastrophic failure, you're basically screwed, because you think all the changes are on the original VHD, but they're actually split up and hidden away in an AVHD file in some ungodly GUID directory.

Fortunately, I figured out that you can rename some files and re-merge the hard drives together offline, which saved my butt for a couple of the VHDs. (I'll explain this process in a later post.) A week after the servers went down, and 140 man-hours later, I had my secondary server running everything... for about an hour. When attempting to restart the VM that runs this site, the entire hard drive on that server went kaput. Yep, you read right, I lost another drive. The drive geometry was totally out of whack. I went to recover the data, and again, Hyper-V files were missing. These were Seagate Barracuda ES drives too, which have always been rock-solid for me in the past.

I feel I need to stress this point: I hadn't lost a single hard drive in 5 years, I lost 2 inside of a week. The wonderful people at ServerBeach have all but eliminated faulty hardware, which leads me squarely to virtualization solution I was using.

Where We're At Now
My server environment is now nearly completely operational. I'm running everything on the Windows Server 2008 RTM bits, along with a new RAID array for each physical box, and I'm also backing up my VHDs and my web files using JungleDisk, which is a really awesome tool, that is super cheap. They also have a decent WHS add-in that I believe is a must-have for any WHS user.

So all of the important sites are back up, and I'll be restoring the lesser sites in the next 24 hours. And now I might actually be able to get some decent sleep tonight.

But now I'm concerned about the decision that I made to move to Hyper-V. I don't think it is as robust as Microsoft would lead you to believe, and several critical design decisions have made Hyper-V VMs *FAR* less portable than Virtual PC / Virtual Server VMs. The word of record is that Hyper-V will ship in 6 months, and that gives me cause for great concern. I was told by someone on the Hyper-V team that these issues were not Hyper-V's fault, that they would have seen it in hardware testing. But why did 2 drives that were less than 90 days old fail within a week of each other, on separate machines? Why did it happen the second time while I was starting a Hyper-V VM? And why were Hyper-V files the only ones that vanished off the disks like they never existed?

I sure hope the Hyper-V team contacts me to investigate this. Because I lost over $10K in productivity over this code, I'd really hate to get blown off, and later find that it *is* a bug, that just happened to take down someone a lot more important than me. I'd really rather that $10K at least mean *something*.

And yes, I know... that's what I get for using Beta software... I should probably know better.

BTW, the team at ServerBeach has been nothing short of incredible, and I couldn't have fixed everything without them. After the 50 or so tickets I had to open up, I'm surprised they're still letting me be a client.

Now if you'll excuse me... my bed misses me... and vice versa.

PostTypeIcon
25,689 Views

Comments

  • Bryan said:

    So glad it was only a hard drive, or two, and not the site shutting down!  Sounds like a nightmare.

    February 7, 2008 2:43 AM
  • Michael Ott said:

    Glad to hear you are finally back online. I too thought you had thrown in the towel. I don't envy your recovery nightmare though. Backups, backups backups!

    Keep up the good work.

    Regards.

    Mike.

    February 7, 2008 3:09 AM
  • It'll be interesting to see what becomes of this... Please keep us updated. :)

    February 7, 2008 8:05 AM
  • No. This time it wasn't me. But not because I'm smarter than this guy, but because I don't the kind of

    February 7, 2008 11:46 AM
  • Micronet said:

    What a nightmare. I wrote about it in my blog here. http://www.networkworld.com/community/node/24806

    Will be watching your blog to see how/if Microsoft responds.

    February 7, 2008 12:39 PM
  • February 7, 2008 1:39 PM
  • Dijutal_Phreak said:

    I can't believe you didn't backup or even run RAID on your servers.   Shame on you!

    February 9, 2008 10:06 AM
  • Ken Schaefer said:

    Hi,

    So you had an disk failure. And you were running beta code in production. And you didn't understand the technology. And you didn't have a DR strategy. And somehow this is now Microsoft's fault and they should be contacting you about this? Wow.

    February 12, 2008 4:58 PM
  • Tobie Fysh said:

    But all will be okay, because you had some _good_ offline backups, you know, the ones you tested every 6 months to make sure all was as it should be and you had a copy of the website running on non-beta code too..... you know, because you would only be running beta code as a comparison against fully released code..... yea...... that'll be why all is good........   :-)

    February 16, 2008 5:40 AM
  • excuse said:
    March 10, 2008 9:01 AM