Welcome to HardwareForumz.com!
FAQFAQ      ProfileProfile    Private MessagesPrivate Messages   Log inLog in

ECC Errors

 
   Hardware Problem Solving Community! (Home) -> General Discussion RSS
Next:  TV video problems with ATI all in wonder 128 pci ..  
Author Message
jtrooney

External


Since: Dec 13, 2005
Posts: 1



(Msg. 1) Posted: Tue Dec 13, 2005 10:12 am
Post subject: ECC Errors
Archived from groups: alt>comp>hardware (more info?)

I have an intel server board running 4x1Gb sticks of ECC memory with 2
intel xeon processors. They system randomly dies every couple of
months. The odd part is that it is when the systems load is at a
minimum. I have gone throw and tested each stick of memory in each dimm
slot on the board and was lead to believe that one of the dimm slots
was bad. I replaced the motherboard and are still recieveing errors
with memtest. The errors that i recieve are ECC uncorrected errors, I
guess any idea as to where to go from here would be a great help.
Thanks in advance

--
Jeff Rooney
jtrooney.DeleteThis@nexdlevel.com

 >> Stay informed about: ECC Errors 
Back to top
Login to vote
Paul4

External


Since: Jul 27, 2004
Posts: 2307



(Msg. 2) Posted: Tue Dec 13, 2005 4:55 pm
Post subject: Re: ECC Errors [Login to view extended thread Info.]
Archived from groups: per prev. post (more info?)

In article <1134497574.538910.11250.TakeThisOut@z14g2000cwz.googlegroups.com>,
jtrooney.TakeThisOut@gmail.com wrote:

> I have an intel server board running 4x1Gb sticks of ECC memory with 2
> intel xeon processors. They system randomly dies every couple of
> months. The odd part is that it is when the systems load is at a
> minimum. I have gone throw and tested each stick of memory in each dimm
> slot on the board and was lead to believe that one of the dimm slots
> was bad. I replaced the motherboard and are still recieveing errors
> with memtest. The errors that i recieve are ECC uncorrected errors, I
> guess any idea as to where to go from here would be a great help.
> Thanks in advance
>
> --
> Jeff Rooney
> jtrooney.TakeThisOut@nexdlevel.com

As I understand it, ordinary ECC is SECDED (single error correction,
double error detection). If the BIOS has an option to enable "background
scrubbing", then the system will methodically go through the main
memory, when the system is not under load, and do test read cycles
on the memory. If a correctable error is found, the hardware will
try to repair it. The benefit of scrubbing, is that if there was
some background level of single bit errors showing in the memory,
they won't "accumulate". (Now, memory errors aren't always characterized
as being single bit errors -- corruption of entire words in memory
is possible, if there is an electrical disturbance during a write
operation. Like a electric floor polisher bumping the equipment
rack in the server room.)

A second form of error detection/correction is "chipkill", but
I don't know if chipkill is a candidate for scrubbing or not.
Chipkill focusses on groups of data at the nibble level (4 bits),
and since x4 width memory chips are popular on registered DIMMs,
is a good fit for such memory modules. If you have a registered
DIMM with x4 chips on it, it is even possible for the computer
to keep running, if an entire chip dies.

Now, that being said, if you can run memtest on one stick of memory
and still get errors, then scrubbing is not going to fix it. Your
problem is too severe.

RAM stability can be influenced by memory timing, memory clock
rate, memory voltage, and temperature. If you get a copy of
CPUZ (www.cpuid.com), you can see what timing the computer is
currently using. CPUZ also has a "report" generator, and it
can dump the contents of the SPD chip on the DIMM, if you need
to see what memory timings have been set in the DIMM's SPD
chip, to be used as defaults.

If this server is using registered memory, operation should be
"bulletproof". The electrical performance of registered busses
is so much better than how desktops work, that you shouldn't be
seeing anything like this. And with only one DIMM installed
for testing purposes, there is no excuse for errors.

If memtest86+ is always returning errors at the same memory
addresses, then the RAM could be bad. I've had memory with
stuck bits before, so it does happen.

On "enthusiast" boards, there would be the option to increase
the memory voltage. For example, DDR chips rated at DDR333 or
slower, have an industry standard of 2.5V. At DDR400, the
spec used industry-wide is 2.6V. If the board supports it,
setting the voltage on the DIMMs to 2.7 or 2.75V will sometimes
improve a background error rate. The Intel board should already
be using at least 2.6V, if the design staff had half a brain.
(You didn't mention the technology used on the board, so
this could be 3.3V for SDRAM, 2.5V for DDR, 1.8V for DDR2 and
so on.)

In eons past, engineers used to design boards without the
benefit of simulation tools. Such boards would crash once
a day, and the engineers were powerless to improve them.
Design tools have improved a lot since then, as have understandings
of how this stuff should work. Certainly an Intel designed
board, should be well clear of those bad design methods. That
leaves bad (budget) memory, a problem with power supply,
or some other environmental factor, as possible contributors.

Can you "borrow" some sticks from another computer ?
I'd be curious if every module you stuff in the system,
fails memtest86+.

Does the server board have a "hardware monitor" ? That is
the ability to monitor key voltages on the system. An
example of a freeware tool for accessing the hardware monitor,
would be MBM5 from mbm.livewiredev.com . But, for a server
board, you are more likely to need to use whatever tool
was bundled with the motherboard when you bought it. That is
because the hardware monitor chip would likely not be a
mainstream implementation, and will not be similar to the
desktop boards that MBM5 supports.

With the hardware monitor, even at the BIOS level, you can
look to see if the +3.3V, +5V, +12V and so on, are within
5% tolerance of the true value. If your 5V was below 4.75V,
you might want to get a multimeter and verify by hand, the
quality of the power delivered to the board. Same for the
other voltages. Power supplies and disks, are the two
weakest links in a computer. Followed by flaky unbranded
memory chips that die a year after you buy them...

Paul

 >> Stay informed about: ECC Errors 
Back to top
Login to vote
Display posts from previous:   
Related Topics:
cause of constant CRC errors? - every large file i download gets a CRC error I have downloaded a demo of FM2005 Gold from sigames and am awaiting arrival of the game from amazon, i thought i would try the demo i have downloaded the 161 mb files twice and during the extraction of ....

Random reboots and errors Win XP Pro - Have just recently started experiencing random reboots and errors. Was rock-solid many months, then suddenly, At start-up, I get "win has just recovered from a serious error" Every time. I have not been able to decipher the .dmp file to see ...

6600GT display errors - When running a game (FAR CRY) a short while ago the game crashed and rebooted the computer. When it restarted the screen was corrupted and unusable. It had the appearance of short horizontal lines one pixel high all over the screen; you could make out..

Help me deal with DVD drive errors - I have a DVD drive that is giving me this error on data disks I have made: CANNOT COPY DATA ERROR (CYCLIC REDUNDANCY CHECK) What can I do? Clean the drive? How? The disks in question read just fine on another machine I have. Thanks Geezer

CRC errors.. Is there software to identify the problem? - Hi All, I'm getting CRC errors when running some games. I think my harddisk is sub par. Its a WD2500 (three other letters on the end that I can't remember just now, but not relevant to my question) Couple of questions? Are there software utilities...
   Hardware Problem Solving Community! (Home) -> General Discussion All times are: Pacific Time (US & Canada) (change)
Page 1 of 1

 
You can post new topics in this forum
You can reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum



[ Contact us | Terms of Service/Privacy Policy ]