Maker Pro
Maker Pro

Oxidisation of Seagate & WDC PCBs

F

Franc Zabkar

I came across a reference to this Russian forum thread in a WDC forum:
http://maccentre.ru/board/viewtopic.php?t=70953&start=15

Here is Google's translator:
http://translate.google.com/translate_t?sl=ru&tl=en

The thread discusses oxidisation of contact pads in current Seagate
and Western Digital hard drives. The drives were used in typical
office and home environments, and are about a year old. The thread has
several detailed photos. All except the older tinned PCB appear to
show evidence of serious corrosion.

Is this the fallout from RoHS? Surely it's not the result of some cost
saving measure?

- Franc Zabkar
 
A

Arno

In comp.sys.ibm.pc.hardware.storage Franc Zabkar said:
I came across a reference to this Russian forum thread in a WDC forum:
http://maccentre.ru/board/viewtopic.php?t=70953&start=15
The thread discusses oxidisation of contact pads in current Seagate
and Western Digital hard drives. The drives were used in typical
office and home environments, and are about a year old. The thread has
several detailed photos. All except the older tinned PCB appear to
show evidence of serious corrosion.
Is this the fallout from RoHS? Surely it's not the result of some cost
saving measure?

The silver ones are not oxydized. Silver reacts with sulphur,
not oxygen. It is normal and cannot really prevented. It is
also not a problem in contacts that are not used, as the
process stops itselft after at thin coating is reached.

The golden ones look like the same thing to me. Maybe the
used a high silver content gold here. Sorry, I am noch a
chemist. But my parents used to deal in silver jewelery
and the look is characteristic.

I suspect air pollution as the root cause. As I said, it is
not a problem in this case, the sulphurisarion (?) process
will not eat through the traces. They are rather better
protected with this.

It would be a problem on the connectors though. But they will
have better and thicker gold anyways.

Arno
 
F

Franc Zabkar

Maybe not. There are other known culprits, like the drywall (gypsum
board,
sheetrock... whatever it's called in your region) that outgasses
hydrogen
sulphide. Some US construction of a few years ago is so bad with
this
toxic and corrosive gas emission that demolition of nearly-new
construction
is called for.

Corrosion of nearby copper is one of the symptoms of the nasty
product.

It's not just Russia that has this problem. The same issue comes up
frequently at the HDD Guru forums.

- Franc Zabkar
 
F

Franc Zabkar

Does this mean we should apply contact protector, such as De-Oxit, to
the PCBs to prevent corrosion?

One of the sticky threads at the HDD Guru forums recommends that the
preamp contacts on WD drives be scrubbed clean with a soft white
pencil eraser whenever they come in for data recovery.

- Franc Zabkar
 
S

Sergey Kubushyn

In sci.electronics.repair Franc Zabkar said:
It's not just Russia that has this problem. The same issue comes up
frequently at the HDD Guru forums.

I'm right here in the US and I had 3 of 3 WD 1TB drives failed at the same
time in RAID1 thus making the entire array dead. It is not that you can
simply buff that dark stuff off and you're good to go. Drive itself tries to
recover from failures by rewriting service info (remapping etc.) but
connection is unreliable and it trashes the entire disk beyound repair. Then
you have that infamous "click of death"... BTW, it is not just WD; others
are also that bad.

They had good old gold plated male/female headers on older drives and those
were reliable. Newer drives had, sorry for an expression, "gold plated" pads
and springy contacts from the drive heads. That would have them something
like $0.001 saving per drive wrt those headers and they took that road. Gold
plating was also of a cheapest variety possible, probably immersion so it
wouldn't last long. Newest drives from Seagate also have that construction
but pads look like tin plated, no gold. Don't know how long it would last.

What we are looking at is an example of a brilliant design with a touch of
genius--it DOES last long enough so they work past their warranty period and
at the same time it will NOT last enough to make it work very long past the
manucturer's warranty. I don't know if it is just greed/incompetence or a
deliberate design feature but if it is the latter my kudos to their
engineers for job well done :(
 
A

Arno

One of the sticky threads at the HDD Guru forums recommends that the
preamp contacts on WD drives be scrubbed clean with a soft white
pencil eraser whenever they come in for data recovery.

That sounds like BS to me. A soft pencil eraser cannot remove silver
sulfide, it is quite resilient. There are special silver cleaning
cloths that will do the trick.

Still, I doubt that this is a problem. It shoud not crawl between
working contacts, only unused ones.

Arno
 
A

Arno

I'm right here in the US and I had 3 of 3 WD 1TB drives failed at the same
time in RAID1 thus making the entire array dead. It is not that you can
simply buff that dark stuff off and you're good to go. Drive itself tries to
recover from failures by rewriting service info (remapping etc.) but
connection is unreliable and it trashes the entire disk beyound repair. Then
you have that infamous "click of death"... BTW, it is not just WD; others
are also that bad.

It is extremly unlikely for a slow chemical process to achive this
level of syncronicity. About as unlikely that it would be fair to call
it impossible

Your array died from a different cause that would affect all drives
simultaneously, such as a power spike.
They had good old gold plated male/female headers on older drives and those
were reliable. Newer drives had, sorry for an expression, "gold plated" pads
and springy contacts from the drive heads. That would have them something
like $0.001 saving per drive wrt those headers and they took that road. Gold
plating was also of a cheapest variety possible, probably immersion so it
wouldn't last long. Newest drives from Seagate also have that construction
but pads look like tin plated, no gold. Don't know how long it would last.

Tin lasts pretty long, unless you unplug/replug connectors. That is its
primary weakness.
What we are looking at is an example of a brilliant design with a touch of
genius--it DOES last long enough so they work past their warranty period and
at the same time it will NOT last enough to make it work very long past the
manucturer's warranty. I don't know if it is just greed/incompetence or a
deliberate design feature but if it is the latter my kudos to their
engineers for job well done :(

I think you are on the wrong trail here. Contact mechanics and
chemistry is well understood and has been studied longer than modern
electronics. So has metal plating technology in general.

Arno
 
A

Arno

In comp.sys.ibm.pc.hardware.storage Jeff Liebermann said:
On Thu, 08 Apr 2010 17:11:39 +1000, Franc Zabkar

I've NEVER had a
drive failure that was directly attributed to such contact corrosion.
It's usually something else that kills the drive.

I think people are jumping to conclusion, because the discolorarion
is what they can see (and think they understand). There is a posting
in this thread with a person that has had a 3-way RAID1 fail and
attributes it to the contact discoloration. Now, whith a slow chemical
process, the required level of synchronicity is as unlikely that
calling it impossible is fair.
Nope. If the contacts were tin-silver, 5% lead, or one of the other
low lead alloys, the corrosion would probably be white or light gray
in color. The dark black suggests there's at least some lead involved
or possibly dissimilar contact material.

Actually pure silver also sulfidizes (?) in this way. The
look is very characteristic. I think this is silver plating
we see. It is typically not a problem on contacts that
are in use, it does not crawl between contact points.

I suspect in the observed instances, this is a purely
aestetic problem and has no impact on HDD performance
or reliability whatsoever.

Arno
 
S

Sergey Kubushyn

In sci.electronics.repair Arno said:
It is extremly unlikely for a slow chemical process to achive this
level of syncronicity. About as unlikely that it would be fair to call
it impossible

Your array died from a different cause that would affect all drives
simultaneously, such as a power spike.

Yes, they did not die from contacts oxidation at that very same moment. I
can not even tell they all died the same month--that array might've been
running in degraded mode with one drive dead, then after some time second
drive died but it was still running on one remaining drive. And only when
the last one crossed the Styx the entire array went dead. I don't use
Windows so my machines are never turned off unless there is a real need for
this. And they are rarely updated once they are up and running so there is
no reboots. Typical uptime is more than a year.

I don't know though how I could miss a degradation alert if there was any.

All 3 drives in the array simply failed to start after reboot. There were
some media errors reported before reboot but all drives somehow worked. Then
the system got rebooted and all 3 drives failed with the same "click of
death."

The mechanism here is not that oxidation itself killed the drives. It never
happens that way. It was a main cause of a failure, but drives actually
performed suicide like body immune system kills that body when overreacting
to some kind of hemorrargic fever or so.

The probable sequence is something like this:

- Drives run for a long time with majority of the files never
accessed so it doesn't matter if that part of the disk where they
are stored is bad or not

- When the system is rebooted RAID array assembly is performed

- While this assembly is being performed a number of sectors on a
drive found to be defective and drive tries to remap them

- Such action involves rewriting service information

- Read/write operations are unreliable because of failing head
contacts so the service areas become filled with garbage

- Once the vital service information is damaged the drive is
essentially dead because its controller can not read vital data to
even start the disk

- The only hope for the controller to recover is to repeat the read
in hope that it might somehow get read. This is that infamous
"click of death" sound when drive tries to read the info again and
again. There is no way it can recover because that data are
trashed.

- Drives do NOT fail while they run, the failure happens on the next
reboot. The damage that would kill the drives on that reboot
happened way before that reboot though.

That suicide also can happen when some old file that was not accessed for
ages is read. That attempt triggers the suicide chain.
 
J

JW

It's a technique that has been used on edge connectors for many years.

Yup, and it works. I learned the technique when servicing Multibus I
systems, and still use it to this day.
 
A

Arno

Yes, they did not die from contacts oxidation at that very same moment. I
can not even tell they all died the same month--that array might've been
running in degraded mode with one drive dead, then after some time second
drive died but it was still running on one remaining drive. And only when
the last one crossed the Styx the entire array went dead.

Ah, I see. I did misunderstand that. May still be something
else but the contacts are a possible explanation with that.
I don't use Windows so my machines are never turned off unless there
is a real need for this. And they are rarely updated once they are
up and running so there is no reboots. Typical uptime is more than a
year.

So your disks worked and then refused to restart? Or you are running
a RAID1 without monitoring?
I don't know though how I could miss a degradation alert if there was any.

Well, if it is Linux with mdadm, it only sends one email per
degradation event in the default settings.
All 3 drives in the array simply failed to start after reboot. There were
some media errors reported before reboot but all drives somehow worked. Then
the system got rebooted and all 3 drives failed with the same "click of
death."
The mechanism here is not that oxidation itself killed the drives. It never
happens that way. It was a main cause of a failure, but drives actually
performed suicide like body immune system kills that body when overreacting
to some kind of hemorrargic fever or so.
The probable sequence is something like this:
- Drives run for a long time with majority of the files never
accessed so it doesn't matter if that part of the disk where they
are stored is bad or not

I run long smart selftest on all my drives (RAID or no) every
14 days to prevent that. Works well.
- When the system is rebooted RAID array assembly is performed
- While this assembly is being performed a number of sectors on a
drive found to be defective and drive tries to remap them
- Such action involves rewriting service information
- Read/write operations are unreliable because of failing head
contacts so the service areas become filled with garbage
- Once the vital service information is damaged the drive is
essentially dead because its controller can not read vital data to
even start the disk
- The only hope for the controller to recover is to repeat the read
in hope that it might somehow get read. This is that infamous
"click of death" sound when drive tries to read the info again and
again. There is no way it can recover because that data are
trashed.
- Drives do NOT fail while they run, the failure happens on the next
reboot. The damage that would kill the drives on that reboot
happened way before that reboot though.
That suicide also can happen when some old file that was not accessed for
ages is read. That attempt triggers the suicide chain.

Yes, that makes sense. However you should do surface scans on
RAIDed disks regularly, e.g. by long SMART selftests. This will
catch weak sectors early and other degradation as well.

Arno
 
A

Arno

In comp.sys.ibm.pc.hardware.storage Jeff Liebermann said:
On Sat, 10 Apr 2010 22:33:49 +0000 (UTC), Sergey Kubushyn
That's the real problem with RAID using identical drives. When one
drive dies, the others are highly likely to follow. I had that
experience in about 2003 with a Compaq something Unix server running
SCSI RAID 1+0 (4 drives). One drive failed, and I replacing it with a
backup drive, which worked. The drive failure was repeated a week
later when a 2nd drive failed. When I realized what was happening, I
ran a complete tape backup, replaced ALL the drives, and restored from
the the backup. That was just in time as both remaining drives were
dead when I tested them a few weeks later. I've experienced similar
failures since then, and have always recommended replacing all the
drives, if possible (which is impractical for large arrays).

For high reliability requirements it is also a good idea to use
different brand drives, to get a better distributed times between
failures. Some people have reported the effect you see.

A second thing that can cause this effect is when the disks are not
regularly surface scanned. I run a long SMART selftest on all disks,
also the RAIDed ones for this every 14 days. The remaining disks are
under more stress during array rebuild, especially if the have weak
sectors. This additional load can cause the remaining drives to
fail a lot faster, in the wort case during array rebuild.

Arno
 
A

Arno

In comp.sys.ibm.pc.hardware.storage Mike Tomlinson said:
It's a technique that has been used on edge connectors for many years.

It works with a harder eraser and it works for tin contacts with
a soft one. But it does not work for silver contacts, you need
to have at least some sand in th eraser for that.

Arno
 
S

Sergey Kubushyn

In sci.electronics.repair Arno said:
Ah, I see. I did misunderstand that. May still be something
else but the contacts are a possible explanation with that.

I don't think it is something else but everything is possible...
So your disks worked and then refused to restart? Or you are running
a RAID1 without monitoring?

They failed during weekly full backup. One of the files read failed and they
entered that infinite loop of restarting themself and retrying. Root
filesystem was also on that RAID1 array so there was no other choice than
to reboot. And on that reboot all 3 drives failed to start with the same
"click of death" syndrome.
Well, if it is Linux with mdadm, it only sends one email per
degradation event in the default settings.

Yep, I probably missed it when shoveling through mountains of spam.
I run long smart selftest on all my drives (RAID or no) every
14 days to prevent that. Works well.









Yes, that makes sense. However you should do surface scans on
RAIDed disks regularly, e.g. by long SMART selftests. This will
catch weak sectors early and other degradation as well.

I know but I simply didn't think all 3 drives can fail... I thought I have
enough redundancy because I put not 2 but 3 drives in that RAID1... And I
did have something like a test with regular weekly full backup that reads
all the files (not the entire disk media but at least all the files on it)
and that was that backup that triggered disk suicide.

Anyway lesson learned and I'm taking additional measures now. It was not a
very good experience loosing some of my work...

BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out of the
sealed bags with silica gel and all 4 of those had their contacts already
oxidized with a lot of black stuff. That makes me very suspicious that
conspiracy theory might be not all that crazy--that oxidation seems to be
pre-applied by the manufacturer.
 
R

Rod Speed

Sergey said:
I don't think it is something else but everything is possible...


They failed during weekly full backup. One of the files read failed
and they entered that infinite loop of restarting themself and
retrying. Root filesystem was also on that RAID1 array so there was
no other choice than to reboot. And on that reboot all 3 drives
failed to start with the same "click of death" syndrome.


Yep, I probably missed it when shoveling through mountains of spam.


I know but I simply didn't think all 3 drives can fail... I thought I
have enough redundancy because I put not 2 but 3 drives in that
RAID1... And I did have something like a test with regular weekly
full backup that reads all the files (not the entire disk media but
at least all the files on it) and that was that backup that triggered
disk suicide.

Anyway lesson learned and I'm taking additional measures now. It was
not a very good experience loosing some of my work...

BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out
of the sealed bags with silica gel and all 4 of those had their
contacts already oxidized with a lot of black stuff. That makes me
very suspicious that conspiracy theory might be not all that
crazy--that oxidation seems to be pre-applied by the manufacturer.

MUCH more likely that someone fucked up in the factory.
 
A

Arno

I know but I simply didn't think all 3 drives can fail... I thought I have
enough redundancy because I put not 2 but 3 drives in that RAID1... And I
did have something like a test with regular weekly full backup that reads
all the files (not the entire disk media but at least all the files on it)
and that was that backup that triggered disk suicide.
Anyway lesson learned and I'm taking additional measures now. It was not a
very good experience loosing some of my work...

Yes, I can imagine. I have my critical stuff also on a 3 way RAID1,
but with long SMART selftests every 2 weeks and 3 different drives,
two from WD and one from Samsung. One additional advantage of the
long SMART selftest is that with smartd you will get a warning
email on every failing test, i.e. one every two weeks. For additional
warning you can also run a daily short test, e.g..
BTW, I took a look at brand new WDC WD5000YS-01MPB1 drives, right out of the
sealed bags with silica gel and all 4 of those had their contacts already
oxidized with a lot of black stuff. That makes me very suspicious that
conspiracy theory might be not all that crazy--that oxidation seems to be
pre-applied by the manufacturer.

Urgh. These bags are airtight. No way the problem happened on your
side then. My two weeks old WD5000AADS-00S9B0 looks fine on the top
of the PCB. I think I will have a look underneath later.

Arno
 
S

Sergey Kubushyn

In sci.electronics.repair Arno said:
Yes, I can imagine. I have my critical stuff also on a 3 way RAID1,
but with long SMART selftests every 2 weeks and 3 different drives,
two from WD and one from Samsung. One additional advantage of the
long SMART selftest is that with smartd you will get a warning
email on every failing test, i.e. one every two weeks. For additional
warning you can also run a daily short test, e.g..

No matter what you do you can not prevent an occasional disaster :( One
MUST remember that "backup" in not a noun but a verb in imperative.
Urgh. These bags are airtight. No way the problem happened on your
side then. My two weeks old WD5000AADS-00S9B0 looks fine on the top
of the PCB. I think I will have a look underneath later.

Those 4 were fine on the top of PCB. Black stuff was underneath, on those
pads contacting with springy heads pins.
 
A

Arno

No matter what you do you can not prevent an occasional disaster :( One
MUST remember that "backup" in not a noun but a verb in imperative.
Indeed.
Those 4 were fine on the top of PCB. Black stuff was underneath, on those
pads contacting with springy heads pins.

Mine is fine on both sides. However there is a quite a bit of contact
area that looks and feels silver-plated to me, most notably areound
the screws and on the bottom the contacts to the head assembly.

Arno
 
Top