About two hours ago I was happily puttering around in a shell on the server that hosts most of my digital existence. Until, after a routine software update, I was suddenly presented with some disturbing sights:
[pizza@stuffed] ~]$ yum Bus error [pizza@stuffed] ~]$ dmesg -bash: /bin/dmesg: input/output error
It seems that something had gone very, very wrong. I trumped home to find the console full of disk error messages. But that shouldn't have brought the system down; The OS was installed on a pair of drives set up in a RAID1 mirror. There's no excuse for a read failure -- If one drive failed, the other should have picked up the slack and kept the system working.
Well, *should*. I made the mistake of setting up the RAID1 mirror using the "dmraid" tools, which piggyback on the motherboard's "fakeraid" metadata. Unfortunately, it seems that this mode of operation doesn't handle failures worth a damn. And that there's no easy way to migrate from the "dmraid" stuff to the very mature and robust native Linux "dmraid" tools.
The mdraid tools have one major disadvantage though -- they are much more dificult to boot from, and from the motherbord's perspective, if the primary drive fails, the array becomes unbootable. So by using dmraid intead of mdraid, I ended up trading one (minor) failure case for a much more serious one.
How does one work around this quandry? By going for a real hardware RAID controller. The ones with onboard pocessors that do all the heavy lifting. The ones that hide the messy details from the OS. The ones that are designed to JustWork(tm) and NotFail(tm).
The ironic thing is that this server already has a hardware RAID controller in it, a 3Ware 9550SXU-8LP, but it's maxed out with two 4-drive RAID5 arrays totalling about 7TB. This server's predecessors both had 3Ware RAID controllers -- and in my decade-long experience with 3Ware controllers, I've not had a single controller-related failure, ever. They JustWork(tm), handling too-numerous-to-count drive failures cleanly and transparently.
So, I just ordered another 3Ware 9550SXU-4LP controller to plug my OS array into. It's used so I got it for cheapcheapcheap, and like the one I already have it's two generations older than their current top-line models, but it'll be more than good enough for a simple RAID1 array. Migrating the dmraid array over to the new controller will be an interesting proposition even without the flaky drive complicating things, but since downtime is unavoidable thanks to dmraid's failings, I might as well make sure things are done right.
If I hadn't picked up another hardware controller, I'd have migrated to mdraid, and booted off of an USB stick. That solution has worked great at work, and those USB sticks are trivial to back up and replicate in case of failures.
In other news, the tally from this weekend is 5581 RAW photos taking up some 62Gigs of space. Been to busy to post anything, but as my backlog empties out, you'll be seeing more here again.