A few days ago we posted a news item in which we reported that some RTX 3080 owners experience issues. Games freeze or crash and returns to desktop. News is now building up on the actual (possible) root cause of it all.
I really wanted to dig in a little deeper before posting this, but after sinking my teeth in it, the theory is absolutely sound. The CTD (crash to desktop) issues are reported for several RTX 3080 cards from Zotac Trinity, MSI Ventus 3X OC, EVGA and ZOTAC cards. Likely most brands will stumble into the problem. New reports however also have indicated that Founder editions card owners sometimes see the issue. Should you experience the problem, you can temporarily tackle the problem by reducing the GPU clock speed by 50 a 100 MHz offset, possibly accompanied by a slight underclock. Of course, this quick temporary fix is at your own risk and requires the necessary knowledge. Download the latest possible Afterburner here.
What's going on?
So what is the cause of the CTD issue? Well, there are some theories. The crash to desktop seems to apply to GPU workloads that reside in a very high 2.010 MHz ~ 2.040 MHz ranges, and that's not your standard clock frequency. However, some games that are not very GPU bound, can result in higher GPU boost frequencies. Our German colleague Igor Wallossek had submitted a number of reasons for the problem, however one jumps out in particular.
You need to understand, and I've explained this before, manufacturers have had their hands tied during the development stages of Ampere based graphics cards. No testing software other than NVIDIA supplied test software was available to stress test. Just to prevent benchmark leaks (that did work well though!). The problem is that the NVIDIA test software is a fixed workload, emulating a game. If the test finishes, the AIB will see either a 'pass' or 'fail' result. So deep testing with other real-world applications has been a no-go at this development stage. And some applications do allow to boost that GPU at high frequencies. The problem does not seem to occur below a boost frequency of roughly 1950 MHz.
So on the reference card, six capacitors are located under the chip directly the voltage circuit (NVVDD and MSVDD voltages). These chips located there are called multilayer ceramic chip capacitors, and poscaps (conductive polymer tantalum solid capacitors).
Update: after further investigation of the components, what is referred to as poscaps everywhere, in fact, are spcaps (an imperceptibly varying component responsible for the same functionality often named and referred to being the same). None of the RTX 3000 cards would use poscaps or spcaps.
Palit GeForce RTX 3080 GamingPRO OC shows four SPCAPS (red), two MLCC clusters (20) (green)
The mlcc's (in green) are extra capable to filter high frequencies, therefore video cards with more mlcc's experience fewer problems than cards with spcaps/poscaps (red). That is why the small, more difficult to solder mlcc's are also a lot more expensive. Some manufacturers have opted to use less or no mlcc's at all, and therein is the problem to be found. Manufacturers can choose this themselves. Nvidia's own Founders Edition uses four sp-caps. Currently, it is very silent at the AIB partners, but if all this information turns out to be the correct assumption, then AIBs will have to revise their design and release boards with a fix in place. For the current boards out there a quick solution would be to lower the Boost frequency with perhaps a 50 MHz lower frequency, diverting the issue.
MSI GeForce RTX 3080 Gaming X Trio shows five SPCAPS, one MLCC clusters (10)
MSI GeForce RTX 3090 Gaming X Trio shows four SPCAPs and two MLCC clusters (20)
In short: specific implementations with POSCAP/SPCAP design are suspected of creating instability specifically with a particularly high boost clock. That results in itself in-game driver crashes and the dreaded CTD (crash to desktop). The solve, reconfigure POSCAPs/SPCAPS, and add MLCCs.
One AIB has confirmed this:
Hi all,
Recently there has been some discussion about the EVGA GeForce RTX 3080 series. During our mass production QC testing we discovered a full 6 POSCAPs solution cannot pass the real world applications testing. It took almost a week of R&D effort to find the cause and reduce the POSCAPs to 4 and add 20 MLCC caps prior to shipping production boards, this is why the EVGA GeForce RTX 3080 FTW3 series was delayed at launch. There were no 6 POSCAP production EVGA GeForce RTX 3080 FTW3 boards shipped.
But, due to the time crunch, some of the reviewers were sent a pre-production version with 6 POSCAP’s, we are working with those reviewers directly to replace their boards with production versions.
EVGA GeForce RTX 3080 XC3 series with 5 POSCAPs + 10 MLCC solution is matched with the XC3 spec without issues.Thanks
EVGA
We'll keep an eye out on this situation. Again, should you experience is, downclock a bit to see if that makes a difference. We're sure that AIB will release new BIOSes where needed, and we're also sure that some board designs will get revised. Also, I read in the forums somewhere that people where wondering if NVIDIA checks AIB PCBs, I can answer that as it has been a policy for many years. The answer to that is yes, all boards are validated by NVIDIA and need to be approved before the manufacturers can mass-produce them.
Is all this 100% certain to be the issue?
I wish I knew for sure as at this point nothing is certain, but the AIB report above was pretty clear and sure about it. I have yet to experience even one crash on any of my samples at hand, and that is the honest truth. Currently, we're also seeing reports of ASUS cards (using 100% MLCCs and founder edition cards using 100% MLCCs) with similar CTD behavior reported, that could be a placebo effect. But that is odd. Also, does not seem to be a PSU issue. Some users that experienced these issues have bought a new PSU, the problem returned. As stated in the end there is a quick fix, prohibits that boost clock to rising above 2 GHz. This would hardly affect your framerates TBH as the domain frequency your 3080 sits in with a nice GPU bound title, most often is the 1900 MHz domain, so it's the unusual frequencies. Probably there will be BIOS/driver updates soon as that is the quickest and most solid fix.
Update 2: word right now is that NVIDIA is working on a new driver and has already provided it to AIB/AICs for testing. While we have no idea what the driver actually changes. I've been making some rounds at board partners to check the status of this topic. There is also still doubt among AIBs that POS/SPCAPs are responsible. And thus some AIBs think this does not involve a need for the capacitor (re)configuration, but everybody is working / heavily testing and what is needed right now is time. Meanwhile, we've asked NVIDIA for a response. To be continued, but we have hopes that the issues can be solved with a driver and/or firmware revision.
Update 3: NVIDIA mentioned it's not certain that POSCAP vs MLCC is not necessarily the issue.
NVIDIA yesterday released driver revision GeForce 456.55 WHQL (download), in the descriptions "The new Game Ready Driver also improves stability in certain games on RTX 30 Series GPUs. What Nvidia did not mention its that it already holds a fix to help with the CTD issues:
Tim@NVIDIA - NVIDIA posted a driver this morning that improves stability. Regarding partner board designs, our partners regularly customize their designs and we work closely with them in the process. The appropriate number of POSCAP vs. MLCC groupings can vary depending on the design and is not necessarily indicative of quality.
So that means that NVIDIA is trying to fix the issue at driver level at this time. We've collected AIB responses here in this post.
Update 4: It seems the driver update really works (luckily). NVIDIA has capped the peak frequency no longer hitting that 2100 MHz domain. We're currently discussing what effect this has on overclocking as here the 'fix' would hurt the most.
Update 5: We've been examining post anew pre-driver status to observe what NVIDIA has been doing. Today I have tested many games like DOOM Eternal, Strange Brigade, Control, Battlefield V and high extreme FPS pushing Resident Evil, all without crashes at three resolutions tested. Apparently it's at titles like Horizon Zero Dawn that seems to be effected to most, specifically in a Quad HD resolution.
Reports from the web are that the driver does fix the issue at hand for most if not all people. So what did NVIDIA do?
Earlier today we have tested, analyzed (and confirmed) that NVIDIA has been tweaking the clock and voltage frequencies. Our homegrown AfterBurner can analyze and help here. Below you can compare the 456.38 and new 456.55 driver VF curve, it now is slightly different and clearly shifted to precisely 2000 MHz in the upper range. So NVIDIA has taken the edge off the frequency as well as a slightly lower voltage seems to be applied. The plot below is based on the FE card, not even AIB. So NVIDIA is applying this driver wide and for their own founder cards as well.
During testing, we also re-ran the benchmarks, and it had offset effects that are close to zero, meaning at 100 FPS you'd perhaps see a 1 FPS differential, but that can be easily assigned to random anomalies as well. As to why there is so little performance decrease is simple, not many games trigger the GPU all the way the end of the spectrum at say 2050 MHz. That's isolated to very few titles as most games are GPU bound and hover in the 1900 MHz domain.
We think it's fixed at driver level this way, but this leaves open the topic of AIB card OC products and tweaking stability, of course. Granted I have been stating in our reviews that Ampere seemed very hard to tweak. That picture fits wide into what we have seen and read in the past couple of days. Typically in the past, you had ~10% playroom for tweaking, these days it's just a few percent. I think the margins are so small these days (this goes for processors as well) that if something goes wrong, it falls outside that margin of error and immediately presents itself into behavior we have seen the past couple of days.
Should you still experience CTDs and have a GeForce RTX 3080, we'd love to hear from you in the comment thread below. But at this point, it seems stabilized with the 456.55 driver band-aid.