So, the Comcast tech is here in the building, and he says that the WA market doesn’t use modems other than the one I’ve got for business accounts, so the fix I was hoping for was a bust. He says there’s a lot of RF noise and he’s working on that, and he’ll swap out for a different modem of the same type. I was told that the routing problem we’re seeing is a firmware issue, so I don’t hold out a lot of hope for this being a good solution.
So I’m on to Plan F: Abandon the affected IP addresses and ask for my money back. I’m going to start the process of moving said services onto a different IP. Since there isn’t a physical move involved, there should be no appreciable outages (other than when the modem decides to crap the bed again).
Puget Sound Atheism and vis.nu Networks are the only affected subsystems– basically, everything brought into the corporate substrate from The Great Convergence. It’s all one set of servers now, but it still speaks on three addresses.
I’ll just move them onto the same IP as Tacoma Telematics, and all should be well. This is also a temporary solution, as there’s likely another move coming up.
It’s a situation of everything happening at once.
The hypervisor is fixed, the router is running smoothly, and I’m even making progress on my terminal server. I’ve upgraded the Hypervisor routing tools, and they’re working more efficiently than before. But vis.nu and PSA would still stop talking to the world now and again, and flushing caches didn’t fix anything this time. I started to suspect that it was something upstream from here.
I placed a laptop outside of the rack, and the it can see PSA and vis.nu when the rest of the world can’t, no problem. Next step is the CPE, so I rebooted that. Lo and behold: everything came back up. I’m on the horn with comcast, just got handed to level 2, and I’m working on doing some long term testing.
At this point I’m kind of hoping the problem comes back, so they’ll replace the router. But if I were a betting man, I wouldn’t put money on it.
Update 21:23 PM: Finally got a solid answer out of Comcast after about 12 hours of fighting– a new firmware push introduced a bug where the CPE would drop the last two IP addresses on /28 networks. They’re sending a tech between 8:00 and 10:00 tomorrow to install a modem with a different firmware, which should fix the problem. For now, I’ve moved the comcast router into my remote rebooter so that if/when this happens again tonight, I can reboot the modem without having to live at the office.
The Tacoma production server is on the upgraded packages, including the new server. The reason the kernel was being strange was my fault. After working some things out, the new kernel is booting fine and the VMs are running. I’m going to nap in the back room now (too bleary-eyed to drive) and then get back to it.
For some reason, ganesha is losing both the PSA and vis.nu IP now and again. A reboot fixes it… I’m hoping it’s part of our continuing issues with the present Xen hypervisor, but I’m not holding out of a lot of hope. I’m keeping a close eye on it.
Meanwhile, I’ve got helium (my incredibly loud development server that I stopped using when I moved out of the Washington Building and had to have the rack on the same floor as me) fired up with my fresh Xen package. I’m load testing the Linux PVMs while installing a Windows HVMs, and no crashing so far. If that holds out for the next few hours, I’ll be taking hydrogen (my production server) down in order to upgrade to the new hypervisor. Shoudn’t take long, and I can back out of the upgrade if something breaks.
Update 08:38: The new copy of Xen is installed, but it didn’t take with the same kernel I’d been using on my test hardware. The older 2.6 kernel seem to work, so I’ve put that back on and loaded the virtual machines that way. I’m still talking to Xen folks, and when I’ve got a solution I’ll give it a shot. Probably not until tonight.