Before we dig into part II of the NAS rebuild, I'd like to entertain you with a tale of how I accidentally destroyed my newly-rebuilt NAS, and how I proceeded to recover it completely.
First, some background
After the FreeBSD rebuild, I decided to expand the NAS by switching to a new case. I sourced a brand new Array R2 by Fractal Design from a local supplier, and salvaged two spare 1TB drives I had lying around in an unused 2-bay NAS.
The transplant of motherboard and disks went without issues, and the new case was heaps quieter. Unfortunately, as I found out, it is not possible to add more physical disks to an existing RAID-Z array. I tried a few methods and, during one of the attempts, I accidentally added a standalone disk to the ZFS pool. This had, I believe, the adverse effect that all performance gains I had by pooling 4 disks together got cancelled by the disk that was part of the pool, but not of the RAID-Z array. This is best explained by looking at the output below:
nas% zpool status pool: pool0 state: ONLINE scan: scrub repaired 0 in 43h35m with 0 errors on Tue Jul 17 12:13:06 2012 config: NAME STATE READ WRITE CKSUM pool0 ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ada4 ONLINE 0 0 0 ada5 ONLINE 0 0 0 ada3 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 errors: No known data errors
As you can see,
ada2 is part of pool0 but sits outside the
raidz1-0 array. Although some data may live in the fast RAID-Z array, there's a chance that reads or writes will be performed on the much slower single disk as well, thus slowing the whole pool down. Not to mention the single-point-of-failure that having just one disk presents. As you'll see ahead.
But on with the story…
It all started last weekend, when I was playing a movie (Total Recall, the 1987 one) through the network on my HTPC. For some reason, XBMC was stuttering quite a lot with buffering. Both my NAS and my HTPC (running XBMC) are connected to the same Gigabit-capable Netgear WNDR3700 (running DD-WRT), so I assumed that network wasn't the bottleneck. Heck, I can play the same movie over WiFi to my iMac on the other room! At least I could.
After getting fed up with the stutters, I decided to focus my attention on fixing my earlier blunder with the array setup. The first thing I tried was to offline the disk (offlining means to unplug the disk virtually, before doing it physically). That, unfortunately, didn't work, because
ada2 is the only disk in its separate pool, as explained above. You cannot offline disks that have no replicas (copies). A good feature of ZFS, but an annoyance in this particular case.
Next I attempted to physically remove the disk (with the computer off, obviously). And that's when things got… interesting.
For reasons that go beyond my understanding of ZFS, removing that drive had the adverse effect of making the ZFS pool unbootable. So even though the USB drive was still working, the root drive wasn't being mounted, as the ZFS pool wasn't available. Re-attaching the drive did nothing to improve the situation. I started to worry. Did I just accidentally wipe all my data? Could I have been that dumb?
Thankfully, I learned from my mistakes and knew better than to store my critical data on the NAS. With solutions like SkyDrive, Dropbox and others, keeping data on local disks is asking for trouble. In the current scenario, if I had to rebuild my array, only media files would have been lost. That is: video and audio files. A lesson to be learned here, folks!
After a couple days (and many expletives) of trial and error, I decided to re-format the USB stick and start fresh, already accepting the losses. So I followed part I of this guide again from the top and worked my way down, up to the point where I needed to create the array. With all the original drives attached, I tried my luck and attempted to do a
zpool import pool0 instead of a
zpool create.... And whaddya know?!? It worked!
The array came back up exactly as shown at the beginning of this post. I let ZFS scrub the drives (almost 2 days later), then proceeded to plug in an external 2TB drive and extract all the data from the array, before rebuilding it again. This time, however, I used RAID-Z2 (minimum of 6 disks). And that's how it's been running since:
nas% zpool status pool: pool0 state: ONLINE scan: none requested config: NAME STATE READ WRITE CKSUM pool0 ONLINE 0 0 0 raidz2-0 ONLINE 0 0 0 ada0 ONLINE 0 0 0 ada1 ONLINE 0 0 0 ada2 ONLINE 0 0 0 ada3 ONLINE 0 0 0 ada4 ONLINE 0 0 0 ada5 ONLINE 0 0 0 errors: No known data errors
Not all that glitters is gold
Although the data was safe and sound (+ points to ZFS), the issue with the stuttering still persists. I am almost certain that ZFS is not the culprit, and have turned my focus instead to the network. Either the router or the FreeBSD network drivers for the network card in the NAS are to blame.
I was about to test my router theory by using a spare Gigabit switch I had, but the switch's power supply decided to release its magic smoke right when I plugged it in (talk about luck!). So I'm back to the drawing board, testing network card and kernel settings in FreeBSD to find the bottleneck.
Thanks for your patience while waiting for this guide to be finished. I thought I'd share a cautionary tale with a (somewhat) happy ending of how ZFS, in the end, kept my data safe from the worst enemy possible: myself. :)