Skip to main content

VMware nightmares

After a great couple of days away I was called urgently to work - one of my client's networks was down and my colleague was stuck. The VMsphere client couldn't see the hosts, which couldn't see their datastores and the network had zero stability. Yay. Just wanted I wanted to come home to.

The servers are running ESXi 4.1 that needs a few updates but the networks stability has always been a real issue for us. We took it over from some other chaps, solving a long series of issues with the servers and particularly the IOmega NAS. Throw in a database issue on one of the guest servers that kept the disk IO at absolute peak all the time and things were pretty tricky. All those things have been resolved, more or less, and the network had been fully functional for some months. So why did it change? Several reasons and I hope you take away from this some ideas for yourself.

The NAS appeared to have dropped a disk and while it has a RAID10 set up, it paralysed the system (not at all like the FreeNAS I've talked about before). The whole thing fell over and we shut it down before reseating the disk and powering back up. The NAS detected the disk and began a rebuild. The VMware hosts couldn't see the datastore though and the server managing the virtual environment periodically disconnected from everything. Initially we thought it was a network issue and restarted everything. The hosts were slow to come up, with errors in the logs indicating an inability to see the iSCSI disks. We could connect to the NAS via it's web interface, ping it and it looked quite happy on the network so we couldn't understand what was happening. An added complication was we have a new HP NAS on the network and while we were able to migrate several of the hosts to it, we've had problems getting then started. Don't know why, but the host's cpu goes through the roof everytime we try to start them. I thought we might have an all paths down bug and most of the documentation suggests a call to VMware and let them sort it out. At 8pm at night this isn't so great a plan, and with the client losing production time and money we had to solve it.

So with all these errors and problems left and right I was at a loss. Eventually we took inspiration from The IT Crowd and turned it all off, counted to 10 and started it back up. Would you believe that it took 3 reboots of the management server before the VMsphere client would connect to anything - especially considering it was running locally! The iSCSI shares from the NAS became available finally - there was a service issue on the iOmega NAS that was failing silently. It was alleging that all was good and it really wasn't. Now those shares were available we were able to reconnect to the datastores and boot the guest machines. The management server was still disconnecting from the hosts constantly but we were able to at least get things going. There is still quite a bit to do there but the servers are finally running.

Going forward I think we'll use XenServer and NFS shares. Simpler, fully functional and quick. Easy to backup and expand disks. Adios VMware I reckon.

The final thought is "Have you turned it off and back on again?" Something in that for all of us :-)

Comments

Popular posts from this blog

Plone - the open source Content Management System - a review

One of my clients, a non-profit, has a lot of files on it's clients. They need a way to digitally store these files, securely and with availability for certain people. They also need these files to expire and be deleted after a given length of time - usually about 7 years. These were the parameters I was given to search for a Document Management System (DMS) or more commonly a Content Management System (CMS). There are quite a lot of them, but most are designed for front facing information delivery - that is, to write something, put it up for review, have it reviewed and then published. We do not want this data published ever - and some CMS's make that a bit tricky to manage. So at the end of the day, I looked into several CMS systems that looked like they could be useful. The first one to be reviewed was OpenKM ( www.openkm.com ). It looked OK, was open source which is preferable and seemed to have solid security and publishing options. Backing up the database and upgradin

Musings on System Administration

I was reading an article discussing forensic preparation for computer systems. Some of the stuff in there I knew the general theory of, but not the specifics of how to perform. As I thought about it, it occurred to me that Systems Administration is such a vast field. There is no way I can know all of this stuff. I made a list of the software and operating systems I currently manage. They include: - Windows Server 2003, Standard and Enterprise - Exchange 2003 - Windows XP - Windows Vista - Windows 2000 - Ubuntu Linux - OpenSuSE Linux - Mac OSX (10.3 and 10.4) - Solaris 8 - SQL 2005 - Various specialised software for the transport industry I have specific knowledge on some of this, broad knowledge on all of it, and always think "There's so much I *don't* know". It gets a bit down heartening sometimes. For one thing - I have no clue about SQL 2005 and I need to make it work with another bit of software. All complicated and nothing straightforward. Irritating doesn&

Traffic Monitoring using Ubuntu Linux, ntop, iftop and bridging

This is an update of an older post, as the utilities change, so has this concept of a cheap network spike - I use it to troubleshoot network issues, usually between a router and the network to understand what traffic is going where. The concept involves a transparent bridge between two network interface cards, and then looking at that traffic with a variety of tools to determine network traffic specifics. Most recently I used one to determine if a 4MB SDSL connection was saturated or not. It turned out the router was incorrectly configured and the connection had a maximum usage under 100Kb/s (!) At $1600 / month it's probably important to get this right - especially when the client was considering upgrading to a faster (and more expensive) link based on their DSL provider's advice. Hardware requirements: I'm using an old Dell Vostro desktop PC with a dual gigabit NIC in it - low profile and fits into the box nicely. Added a bit of extra RAM and a decent disk and that&