Older Intel Atom C2000 Series Server Chips May Stop Working After a While, and There’s no Fix

It takes time and efforts to debugging hardware and software to get a product right, but some bugs may be hard to reproduce, or only happen over time, and it appears some Intel Celeron C2000 series processor for microservers may stop working after about 18 months, with the likelihood of problems increasing over time, due to clock signals that stop functioning.

Atom C2000 Block Diagram

This is documented in Intel Atom Processor C2000 Product Family Specification Update, with Errata AVR 54 explaining the issue:

AVR54. System May Experience Inability to Boot or May Cease Operation

Problem: The SoC LPC_CLKOUT0 and/or LPC_CLKOUT1 signals (Low Pin Count bus clock
outputs) may stop functioning.
Implication: If the LPC clock(s) stop functioning the system will no longer be able to boot.
Workaround: A platform level change has been identified and may be implemented as a workaround
for this erratum.
Status: For the steppings affected, see Table 1, “Errata Summary Table” on page 9.

The table on page 9 shows stepping “B0” suffers from this problem. The issue affects existing motherboard and server based on Atom C2000, and companies like Cisco will provide replacements:

Recently, Cisco became aware of an issue related to a component manufactured by one supplier that affects some Cisco products. In some units, we have seen the clock signal component degrade over time. Although the Cisco products with this component are currently performing normally, we expect product failures to increase over the years, beginning after the unit has been in operation for approximately 18 months. Once the component has failed, the system will stop functioning, will not boot, and is not recoverable. This component is also used by other companies.

We have identified all Cisco products that have this component and worked with the supplier to quickly put a fix in place. All products shipping currently do not have this issue. To support our customers and partners, Cisco will proactively provide replacement products under warranty or covered by any valid services contract dated as of November 16, 2016, which have this component. Due to the age-based nature of the failure and the volume of replacements, we will be prioritizing orders based on the products’ time in operation.

The good news is that a new revision of the chip fixes the issue for new processors, but there’s no fix for older ones. So if you own any such systems, and they have stopped working or become unstable suddenly, it may be the reason. You also want to check if you can get a replacement while it is still under warranty whether it works or not.

Thanks to Mike for the tip.

Support CNX Software - Donate via PayPal or become a Patron on Patreon
Advertisements

12
Leave a Reply

avatar
12 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
9 Comment authors
SfinxsandbenderNobody of ImporttheguyukFossxplorer Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
Sander
Guest
Sander

“Workaround: A platform level change has been identified and may be implemented as a workaround for this erratum.”

What does that mean? Is that a software/firmware workaround for existing hardware, or does the workaround mean replacing the Atom for a new Atom?

JotaMG
Guest
JotaMG

Most probably means exchanging some of the hardware, but “platform level change” is a good PR workaround… hehe

Gaetano
Guest
Gaetano

… and, being a SoC, this will mean “replace the motherboard/cpuboard” :-/

At my work, I’ve forwarded those infos to my IP/Transport mgr; several Cisco products are involved, for instance some Nexus 9000 family, ASA5500, ISR4300 and so on.

Fossxplorer
Guest
Fossxplorer

@CNX: should “Intel Celeron C2000” be “Intel Atom C2000”?

theguyuk
Guest
theguyuk

Courious what is the cost involved in parts and labour replacing one, also what is the expected in use life, of such aerver.

Nobody of Import
Guest
Nobody of Import

@cnxsoft

That generally means that you’re getting a new board in most cases. The phrase is PR-speak to cover for the fact that the vendors, if this stuff is under a warranty, is going to be EATING this.

@theguyuk

Cost is probably the BoM cost of the board itself- since the device is soldered ONTO the board, like most SoC’s are done, it means replacement of the whole board. In most cases, this equates to pitching the affected device into the recycle bin and shipping you a new board/machine. It’d be like if someone fubared something on the Minnowboard SoC or a Pi’s, with the same predictable result.

Nobody of Import
Guest
Nobody of Import

And on that note…I’ll observe that you want Intel for WHAT reasons? >;-D

sandbender
Guest
sandbender

@theguyuk
Cisco indicated in their advisory that the systems generally start failing around 18 months. I have an Asrock C2750d4I that has the affected stepping and it’s been chugging along as a light duty home NAS for about a year now. I’m interested to see how this plays out for the little guys without Cisco support contracts because ASRock only has a 1 year warranty. Hopefully Intel will make good on it like they did with the Pentium FP bug. If not, I’ll soon be able to justify that D1541 to the wife 😉

theguyuk
Guest
theguyuk

@sandbender
Opening a system and changing a Cpu or Motherboard affects the failure rate becasuse disturbing connections means more chance of errors. Then you have security of data issues.

At home Nas, replace or offer you money off upgrade. ( Not everyone can open, take the bits out and replace , as not all are hardware trained or interested )

Cisco involved means I guess, business contracts.

Have it replaced, is it warranted, for how long? Are you getting new for old or repaired refurbished replacements? You can go on and on.

Motherboards heating and cooling become acceptable to more shock or bending induced faults. Same with connectors, some firms use to glue connectors together to reduce transport induced faults ( things have got better )

Sfinx
Guest
Sfinx

“platform level change” means you must switch to the new CPU family. AMD for example 😉

sandbender
Guest
sandbender

@theguyuk
I think you misunderstood, I have a custom NAS built around a C2750D4I motherboard, I was just wondering about possible RMA’s for the board itself. Not a whole NAS. Apparently SuperMicro is already offering a fix for their boards if requested, even if the board hasn’t failed yet. The fault is very specific and not something you could easily fake or ascribe to a different fault like cracking traces on the board (only two CLK pins die on the actual chip, you can’t just over volt it and claim it failed because of this issue).

@cnxsoft
After reading around on this some more it seems that the 18 months quoted by Cisco is very “proactive”. OVH and other’s with hundreds of boards deployed are reporting a very low failure rate even for equipment they’ve had deployed since release of the SoCs (3 years). Most of the vendors like Cisco and Synology are covered by an NDAs with Intel that prevents them from disclosing the actual failure rate. StH has a little more detail https://www.servethehome.com/intel-atom-c2000-series-bug-quiet/.

Advertisements