After a great couple of days away I was called urgently to work - one of my client's networks was down and my colleague was stuck. The VMsphere client couldn't see the hosts, which couldn't see their datastores and the network had zero stability. Yay. Just wanted I wanted to come home to.
The servers are running ESXi 4.1 that needs a few updates but the networks stability has always been a real issue for us. We took it over from some other chaps, solving a long series of issues with the servers and particularly the IOmega NAS. Throw in a database issue on one of the guest servers that kept the disk IO at absolute peak all the time and things were pretty tricky. All those things have been resolved, more or less, and the network had been fully functional for some months. So why did it change? Several reasons and I hope you take away from this some ideas for yourself.
The NAS appeared to have dropped a disk and while it has a RAID10 set up, it paralysed the system (not at all like the FreeNAS I've talked about before). The whole thing fell over and we shut it down before reseating the disk and powering back up. The NAS detected the disk and began a rebuild. The VMware hosts couldn't see the datastore though and the server managing the virtual environment periodically disconnected from everything. Initially we thought it was a network issue and restarted everything. The hosts were slow to come up, with errors in the logs indicating an inability to see the iSCSI disks. We could connect to the NAS via it's web interface, ping it and it looked quite happy on the network so we couldn't understand what was happening. An added complication was we have a new HP NAS on the network and while we were able to migrate several of the hosts to it, we've had problems getting then started. Don't know why, but the host's cpu goes through the roof everytime we try to start them. I thought we might have an all paths down bug and most of the documentation suggests a call to VMware and let them sort it out. At 8pm at night this isn't so great a plan, and with the client losing production time and money we had to solve it.
So with all these errors and problems left and right I was at a loss. Eventually we took inspiration from The IT Crowd and turned it all off, counted to 10 and started it back up. Would you believe that it took 3 reboots of the management server before the VMsphere client would connect to anything - especially considering it was running locally! The iSCSI shares from the NAS became available finally - there was a service issue on the iOmega NAS that was failing silently. It was alleging that all was good and it really wasn't. Now those shares were available we were able to reconnect to the datastores and boot the guest machines. The management server was still disconnecting from the hosts constantly but we were able to at least get things going. There is still quite a bit to do there but the servers are finally running.
Going forward I think we'll use XenServer and NFS shares. Simpler, fully functional and quick. Easy to backup and expand disks. Adios VMware I reckon.
The final thought is "Have you turned it off and back on again?" Something in that for all of us :-)