Graphics
Intel® graphics drivers and software, compatibility, troubleshooting, performance, and optimization
20494 Discussions

Intel Matrix Storage Manager - RAID 5 - FAILED

idata
Employee
2,697 Views

I have a client that had a server go down on Friday. Here is what I know so far:

1. Server was running fine, but constant loud beep was occuring. Customer rebooted server, now we get Operating System Not Found.

2. I get there, determine that the first physical disk of the RAID 5 has failed. I get prompted to do a recovery, which i answer 'Y' to. It really doesnt do much, so i exit the program and try booting again. Still operating system not found.

3. I re-seated all of the hard drives (hot plug) - Still no help.

4. I purchased a new physical disk 0 (larger than the original 2 drives), went back into the storage manager BIOS, but i am not prompted to recover.

5. The new disk shows up as the first disk (non member), and the other two are showing as member disks.

RAID5 - FAILED.

Is there any way to recover from this? What could have happened? I would hate to delete the RAID, and start all over, but I am not sure i have much of a choice right now. I thought the whole purpose of a RAID 5 is so one disk could fail, and everything is still safe. I have this same config running at about 50 other clients, with no problems (YET). Anything I can do to prevent this from happening in the future?

Anyone with any miracle suggestions?

Thanks in advance!

 

Ashish
0 Kudos
4 Replies
idata
Employee
1,424 Views

Hi Ashish,

Basically, i think if the raid status is shown as RAID5 - Failed in the Oprom bios. This means that there is no more RAID configuration. Failed RAID notify that a disk drive that is part of a virtual disk,has failed and is no longer usable.

Note: If the drives are shown as failed, do not re-use a failed drive.

 

When using raid, it is always advisable to have same model of drives with the same firmware revision.

 

Before doing any tests: ======================= Usually, it is best to have a back up before performing any test as RAID is not backup. Anything can happen in practical.

1. Reboot the system into the RAID Oprom Bios

 

Aim: to see if the raid status can be shown as DEGRADED  If the array is showing as failed, follow these steps:

 

a) Disconnect the first drive and check for the raid status

 

 

If the array is showing as failed, follow these steps:

 

b) Reconnect the first drive, disconnect the second drive and check for the raid status

 

 

If the array is showing as failed, follow these steps:

 

c) Reconnect the second drive, disconnect the third drive and check for the raid status

 

 

2. If the array is showing as failed in the 3 case scenario from above, try the same test again.

 

 

Try to connect the drives onto another identical motherboard with exactly the same board revision and same firmware revision and connect the drives to same sata ports as in the original machine.

 

Aim: to see if the raid status can be shown as DEGRADED  If the array is showing as failed, follow these steps:

 

a) Disconnect the first drive and check for the raid status

 

 

If the array is showing as failed, follow these steps:

 

b) Reconnect the first drive, disconnect the second drive and check for the raid status

 

 

If the array is showing as failed, follow these steps:

 

c) Reconnect the second drive, disconnect the third drive and check for the raid status

3. Perform a Verified Backup if the raid status is shown as degraded hopefully.

If you think the hdds have some problem, use some hard drive tools to check the drives. Never test drives when a raid is in rebuilding state.

4. If the drives are not online or recognised:

 

It can be due:

 

Failed physical drive.

 

Excessive number of hard drive grown defects or hard drive block redirection events.

 

Data bus errors.

 

Power interruptions or an unexpected reboot.

 

Hard drive thermal issues.

5. Check power supply and power connections.

 

Check cables for proper installation, type, and length.

 

Check cable routing, reseat cable connectors.

 

Replace if faulty.

 

Review the firmware versions for the hard drive - check if there is any new firmware revision on the particular drive's website

 

Caution: It may not be possible to recover from a failed array.

Hope this helps a bit, Ashish.

Kind Regards,

 

Aryan.
0 Kudos
idata
Employee
1,424 Views

Thanks very much for the suggestions. I had tried most of that before posting, but i guess i was hoping for a miracle. I ended up deleting the RAID, and re-creating it. Hopefully, this doesnt happen again. Thanks again for the ideas.

Ashish

0 Kudos
idata
Employee
1,424 Views

An update and more help, please...

So, i replaced drive 0 with a new 320 gig Western Digital SATA Drive, rebuilt the entire server, and everything was good, except the hot plug cage for the RAID5 still beeped and had a red light on. I called the company that built the server (Systemax), but they really did not have a suggestion short of replacing the cage, so they sent a replacement. By the time i got there to replace the cage, the red light went off, and the system was stable for 3 weeks. So, we sent it back assuming we are finally done. Fast forward to last Friday, client calls me, and says the beeping started again, red light is on, and on the screen it says 'Operating System Not Found'. Unbelievable. He reboots the machine, and the hard drives all show good, but the RAID shows as 'FAILED'. It does boot into the OS, and starts its own rebuild. After about 30 minutes, the system reboots automatically back to the Operating System not Found Screen. So, this time i go in, hit ctrl-i to go into the RAID utility, it shows the RAID as FAILED again, all drives show good, but it asks me if i want to rebuild, to which i answer 'y', then after rebooting it starts rebuilding again. This time it stays on all night, completes the rebuild, and then reboots again into the 'Operating System not Found'. What the heck do i do now? The company is sending me another hot plug cage, but i don't see that helping, but maybe it will? HELP!

0 Kudos
idata
Employee
1,424 Views

Hi Ashish,

Sorry to hear that you got problem on your system again.

I would say if the system failed more than two times, then there is something that is causing the problem.

I will point at different possibilities, but you will need to troubleshooot it by yourself.

1. First Question: Are the hard drives on the Tested Hardware List for the Board. This is very important and sometimes, drives that have not been tested create a big incompatibility issue.

2. If the drives are on the list, the next question is are the firmware on the drives same as the one which the board manufacturer mentionned in the tested list. 

If the firmware are not the same, contact the drive manufacturer to know if there is any issue with the firmware, if possible use the same firmware that was on the tested list.

3. Have you updated the firmware of the board, raid controller and the hot swap backplane?

4. Have you tried different sata cable, power sata cable, if possible try another PSU.

5. Did you check if you have connected the cable between the backplane and the board correctly, because on Intel backplane you dont need to connect the 3 cables altogether (ipmb,ses,sgpio) 

Check on the board website for the cable management between the board, raid controller and the backplane

6. Dont use the backplane (drive cage), do the test out of the chassis to make sure everything works fine without the cage.

7. If you are using a Intel Matrix Manager, have you checked in the RAID event log for errors?

8. If everything seems to work fine, try to replace the board as a precaution.

Hope this may help you a bit.

Kind Regards,

Aryan.

0 Kudos
Reply