One for the RAID Experts
I'm hoping a kind person can save the day for me.
We replaced our fileserver a while ago, Summer last year I think, and we've been having a few issues with it. I notice when I was setting it up that sometimes, seemingly randomly, the lights on the RAID disks would seem to flash sequentially as though it was doing some kind of running maintenance. It's a 10x 1Tb SATA disk, Raid 6 array. I just presumed it was doing some parity maintenance or something. I didn't take too much notice as it didn't seem to affect the usability of the array, so I carried on.
We use Firefox in the office (stick with me here!), which tends to use AppData quite a lot, we have AppData redirected to our fileserver and we've noticed since this new fileserver that sometimes Firefox would hang - become unresponsive for a minute or so and then be fine. It occured to me that this might be related to the RAID array, but again, all seemed fine so took little notice. Yes, this probably is the point where I should have investigated further.
Anyway, we had a permission problem on the fileserver recently (it randomly added one of the admins as being from "Parent Object" even though they weren't listed), so I was fixing that by removing and re-applying inherit parent permissions. It turns out, this is a sure-fire way to reproduce the RAID hang scenario. So I downloaded the LSI MegaRAID Storage Manager to see if I could find any info. I am getting a lot of these:
ID = 113
SEQUENCE NUMBER = 51941
TIME = 11-10-2013 06:59:34
LOCALIZED MESSAGE = Controller ID: 0 Unexpected sense: PD = Int. Port 0 - 3:1:3Information unit CRC error detected, CDB = 0x28 0x00 0x48 0xff 0x22 0x00 0x00 0x00 0x80 0x00 , Sense = 0x70 0x00 0x0b 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x47 0x03 0x00 0x00 0x00 0x00
ID = 113
SEQUENCE NUMBER = 51940
TIME = 11-10-2013 01:53:06
LOCALIZED MESSAGE = Controller ID: 0 Unexpected sense: PD = Int. Port 0 - 3:1:0Unknown Sense Code, CDB = 0x28 0x00 0x00 0xa8 0xfe 0x88 0x00 0x01 0x00 0x00 , Sense = 0x70 0x00 0x0b 0x00 0x00 0x00 0x00 0x28 0x00 0x00 0x00 0x00 0x4b 0x04 0x00 0x00 0x00 0x00 0x00 0x28 0x52 0x04 0x01 0x00 0x50 0x03 0x00 0x57 0x00 0xfe 0xdb 0x00 0x50 0x06 0x05 0xb0 0x00 0x02 0x72 0xbf 0x00 0x01 0x0c 0x00 0x00 0x00 0x00 0x00
More of the CRC errors but a significant number of both, 24 of them yesterday.
When the server exhibits this "RAID Hang" situation the Performance Monitor shows approximately 100% Disk Time and a Average Disk Queue Length of 1, it will sit on or thereabouts those values for normally about 2 minutes, but can be shorter, then it will carry on as normal. The progress bar (for the file permissions) at this point would tend to just pause. The progress pause and the 100% Disk Time normally, but do not always correspond.
I then thought that it might be a Windows issue rather than RAID, my first thought was to check Search Indexing, which, for some reason I didn't switch on when setting up the server. I have been going though staff folders disabling indexing on the off chance that this is the cause - I'm not holding my breath. It turns out that that process is another good way to cause the hang.
Anyway, I'll stop there because I'll probably just babble on incoherently. Any thoughts and help would be very welcome!!
Server Specs are:
Fujitsu RX300 S6
Windows Server 2008 R2 Standard SP1
Dual Xeon E5645 (6-core 2.4Ghz)
RAID 1 Array: 2x 136Gb 6Gbps SAS (System)
RAID 6 Array: 10x 1Tb 3Gbps SATA (Data)
Raid Controller: "LSI RAID Ctrl SAS 6G 5/6 512MB (S2616)"
3Gbps teamed NIC
Thanks very much!
What kind of raid controller if any are you running and what kind of drives, are you using the latest drivers/firmware for the raid card or drives.
Thanks for the reply.
The RAID controller is the one that came in the RX300 S6 chassis, which according to a pdf on their site is based on an LSI SAS2008, device manager lists "LSI RAID Ctrl SAS 6G 5/6 512MB (S2616)". Drives are all Fujitsu ones, two 136Gb 6Gbps SAS drives in RAID 1 for the system, 10 1Tb 3Gbps SATA drives in RAID 6 for the data.
Haven't had a chance to update firmware yet, am going to try in October half term.
A good raid controller will do a read scan of all the sectors on a regular basis, so you find out if the disks are dead *before* you rely on them during a failure.
The reason it's hanging is probably because your 1TB drives are not optimized for RAID, it's not just a marketing thing, it means they have modified timeouts for reading dead sectors and such. This stops (or is supposed to) the array from hanging.
Ok, thanks for the info.
I've (somewhat-worringly) discovered that these Fujitsu drives are actually Seagate "Constellation" drives. I don't suppose you know if there are different firmwares for RAID and non-RAID, do you? Would be great if I could just "switch them". I will investigate anyway.
Thanks for the help.
Would seem the answer is no, no new firmware for the drives. :(
Is the RAID controller running the latest firmware?
I'm not exactly sure, I've managed to find the controller chip (LSI SAS2008) but I've not managed to find anything on the Fujitsu site for firmware downloads yet. :( I think it's a Fujitsu-designed controller using the LSI chip so I'm presuming it'd be them providing drivers but haven't had any luck finding it yet - still trying!
The constellations are the enterprise versions, so should be suitable for RAID, the controller really should be marking the ones with crc errors as failing or failed. I've found SATA less predictable with drive failures than SAS.
Ok, that's good to know DMcCoy, thanks.
It would seem the firmware on the controller is 4 versions old. It's running firmware from 2010 even though the server isn't nearly that old, maybe the controller was old stock. Anyway, will try updating the controller firmware when I get a chance.
Thank you all for your support. :)
Make sure you have a full backup before updating the raid controller!
Lol, good point. I will! Thanks!
Definitely a job for Oct half term I think!