As detailed in my previous entry from May, Academic Technology at UK preformed a very agreessive large-scale infrastructure update this summer. Or at least we tried…. The ‘Cylinder of Destiny’ worked just fine, took plenty of hard work to arrive at the correct client settings, get all the clients bound to AD, etc. But after plenty of hard work by Josh and the rest of the gang we have a fully Augmented ADODi setup.
XSAN, however didn’t go so smoothly. Which I guess shoudl have been expected since A> that was the supposedly simple part, and 2> the ADODi setup only ran into a couple of minor bumps.
The system consists of about 12 intel XServes, some new, some existing, an existing pair of QLogic 5600s with 2 interconnects, an existings Cisco 3500 gig-e switch, and 3 new Promise Vtrak e610fd/610j combos, each filled with 32 1TB seagate enterprise-class SATA drives. One would expect that, at worse, this setup would perform on par with our existing XSAN 1.4 installation with 6 G5 Xserves/1 intel Xserve, an older FC switch & Cisco, and 5 Xserve RAID units. But you would be wrong.
Immediately out of the gate we had major problems. Chris put together the initial configs on both the RAIDs and XSAN itself, and while it wasn’t exactly setup the way I would have done it — it was definitely reasonable, if not closer to Apple’s guidelines….But when we tried to copy some data over to it, we immediately saw speeds matching the performance of a G3 B&W’s orginal 6G internal drive. After many, many hours or configuring and reconfiguring things, even trying direct-attach of the Vtrak, we came to the conslusion that one of the 3 units was faulty. We did this after reconfiguring the device to be a RAID0 array of 16 drives, all in one cabinet. Even in this config, and direct-attached we were able to see ‘spikes’ of writes to nearly 600M/sec (sequential 0 writes via dd if=/dev/zero bs=1m of=/Volumes/bigraid ), mostly we were left with 150-200M/sec avg IO. This was way low…
Promise support is horrible, btw. While they initially responded quickly, the provided little assistance, and communication dropped off almost immediately. I should mention that we bought these units from another reseller, not via AAPL. When we put together the original purchase, it was not clear what the difference between off-the-shelf Promise gear was and what Apple was pushing — I had used an off-the-shelf 310e with XSAN in Jan, in fact that unit was ordered before Apple publicly annouced the promise. Combine that with the fact that we are encourages to buy things from particular vendors, a substantial price difference, the ability to install 1T drives, and the fact that we might still be waiting on the AAPL unit to be delivered, and you can understand our decision. Promise all-but refused (refuses) to discuss our issues with us, but we are still fairly sure that something is wrong with this unit. We left it out of the loop and rebuilt some volumes on the other two Vtraks.
At first things seemed slightly better. Some initial tests were ‘promising’, so to speak. We saw write-speeds of 150-180M/sec, this with 14 drive RAID6 arrays, still not what we hoped for out of the new gear, but anything north of 50M/sec would suffice for our install. But our jubulation didn’t last. We rebuilt the rest of the volumes and started copying data over from the current system. Checking back in on the copy late that night, I suspected trouble. After about 8 hours of copying, I expected to see at least 10M/sec worth of data, or say, 250G — we were copying from a mostly idle XSAN1.4 over gig-e. But I saw well less than 100…in fact less than 50. No worries, just off to a slow start, or maybe a path issue in the rsync, even at 5M/sec she’ll be done by the next afternoon or so ( we were moving about 1T).
But that wasn’t too be. A full 24 hours later the copy wasn’t even up to what should have been 8 hours worth. A quick check showed the copy had slowed down, spending most of its time moving a couple hundred K per sec, and occasionally spiking up to a few Meg/sec. Definitely not good.
We tried various other methods and many of them seemed to work for a while, but always crashed down to ~1M/sec or less. We rebooted clients, controllers, etc. We rebuilt volumes with different settings, we wrote scripts to distribute the copy, etc. We copied to a large FW drive using ‘ditto’ from the old SAN, this averaged over 25M/sec, then tried to copy back with ditto, cvcp, tar, rsync, etc. All would begin at 50-80M/sec and sometimes in minutes, sometimes in hours, be drug down to 1-2M/sec. WTF?
After a while, we determined that rebooting would bring the speeds back. Sometimes we could reboot the MDC or just fail-over the volume to another MDC and speeds would jump, other times we had to reboot the client doing the transfer, or move the transfer to another client. This isn’t good, no worries though 3 days to go-live still!
We decided to leave the existing WinXX lab and web users where they were on the old san, and moved the smaller community of Mac users onto the new san, since they needed to be hosted on the ADODi infrastrucutre. However, that didn’t last either as login times shot up to 5+ mins (if you could login at all). We briefly tried a 10.5.3 client on the SAN, but it seemed to have most of the same SAN speed issues over time (probably needed all MDCs at 10.5.3 also), and worse still had some AppleFileService (aka afpd) but where it would consume 500-700% cpu while delivering very little data.
In the end the Mac users are hosted on a 10.5.4 ADODi machine, but their storage comes from a direct-attached XServe RAID.
Since then I’ve tried two 10.5.5 developer releases on a single machine which is an MDC. After failing over the appropriate volume to that MDC, things are still bad, if not worse…slow speeds overall and performance diminishing over time…
We’ve filed an ‘issue’ with Apple, but do not have an XSAN maintenance contract. Our SE is going to try to assist us with both our Promise issue and the XSAN speed issue, which at this point is the real show stopper.