Rumors of my demise have been greatly exaggerated.
Well folks, it's been a pretty crappy 10 days. Late the night of the 27th (of January), I signed into one of my virtual web servers to get a new client site reconfigured. To my horror, the server was completely offline. My datacenter discovered that the hard drive had failed, the first time that had happened to one of my servers in 5 years.
I was able to get the drive transferred to my working server, and started the process of recovering the data. I didn't have a backup since I had moved to Hyper-V (dumb, yes I am well aware of that), so I had to get the virtual hard drives (VHDs) back. Well, as the drive failed and Hyper-V came crashing to a halt, it removed all traces of the Virtual Machine that hosted all of my client sites. In addition, it deleted the snapshot of the server that runs Windows-Now, hence some of the broken images that are coming up at the moment (more on that shortly).
The next several days were extremely frustrating, as I learned a very hard lesson about deploying beta software in the wild. For example, in Hyper-V, "snapshots" are completely misleading. They are not full backups of the VHD file, like one might expect. No, instead, they are like hybrid differencing/undo disks... which would be all well and good if the documentation explained that, but it doesn't. The problem with that is, the changes to the VHDs are not committed unless you explicitly do so. So in the event of a catastrophic failure, you're basically screwed, because you think all the changes are on the original VHD, but they're actually split up and hidden away in an AVHD file in some ungodly GUID directory.
Fortunately, I figured out that you can rename some files and re-merge the hard drives together offline, which saved my butt for a couple of the VHDs. (I'll explain this process in a later post.) A week after the servers went down, and 140 man-hours later, I had my secondary server running everything... for about an hour. When attempting to restart the VM that runs this site, the entire hard drive on that server went kaput. Yep, you read right, I lost another drive. The drive geometry was totally out of whack. I went to recover the data, and again, Hyper-V files were missing. These were Seagate Barracuda ES drives too, which have always been rock-solid for me in the past.
I feel I need to stress this point: I hadn't lost a single hard drive in 5 years, I lost 2 inside of a week. The wonderful people at ServerBeach have all but eliminated faulty hardware, which leads me squarely to virtualization solution I was using.
Where We're At Now
My server environment is now nearly completely operational. I'm running everything on the Windows Server 2008 RTM bits, along with a new RAID array for each physical box, and I'm also backing up my VHDs and my web files using JungleDisk, which is a really awesome tool, that is super cheap. They also have a decent WHS add-in that I believe is a must-have for any WHS user.
So all of the important sites are back up, and I'll be restoring the lesser sites in the next 24 hours. And now I might actually be able to get some decent sleep tonight.
But now I'm concerned about the decision that I made to move to Hyper-V. I don't think it is as robust as Microsoft would lead you to believe, and several critical design decisions have made Hyper-V VMs *FAR* less portable than Virtual PC / Virtual Server VMs. The word of record is that Hyper-V will ship in 6 months, and that gives me cause for great concern. I was told by someone on the Hyper-V team that these issues were not Hyper-V's fault, that they would have seen it in hardware testing. But why did 2 drives that were less than 90 days old fail within a week of each other, on separate machines? Why did it happen the second time while I was starting a Hyper-V VM? And why were Hyper-V files the only ones that vanished off the disks like they never existed?
I sure hope the Hyper-V team contacts me to investigate this. Because I lost over $10K in productivity over this code, I'd really hate to get blown off, and later find that it *is* a bug, that just happened to take down someone a lot more important than me. I'd really rather that $10K at least mean *something*.
And yes, I know... that's what I get for using Beta software... I should probably know better.
BTW, the team at ServerBeach has been nothing short of incredible, and I couldn't have fixed everything without them. After the 50 or so tickets I had to open up, I'm surprised they're still letting me be a client.
Now if you'll excuse me... my bed misses me... and vice versa.