The Joys of Upgrading to Windows 11 - PC Diagnostics and Recovery
As a developer, I know that code takes time to mature and stabilise so I'd been putting off Windows 11 for some time. Not to mention I don't have as much time to tinker with hardware as I did during my student days. Windows 10 was working great, everything was stable. If something went wrong I'd have to spend a lot of time fixing the machine.
However, after hearing people talking about it and sometimes asking for tech support, I thought I might as well do the same so I could know my way around the new operating system.
Unfortunately, all didn't go so well. Perhaps it's because I upgraded the BIOS on my Gigabyte X570 Aorus Elite beforehand, from the factory 34q to 37d. There were major vulnerabilities with Ryzen CPUs that the upgrade would migitate so I thought I might as well do that at the same time.
So, let me share that experience here and maybe it might be of use to someone who ends up with similar problems.
*Stock photo by Laura Cleffmann via Unsplash.
Windows 11 - The Upgrade and Experience
For anyone lacking TPM hardware you can actually disable the requirement simply by downloading the ISO, then select the option in Rufus when you're making a bootable USB. AMD Ryzen processors seem to have it built-in apparently so my board already had the option under "Miscallaneous" but it's disabled by default on older versions of the BIOS.
First off, let's talk about the OS itself. Before the problems started happening, I did get to play around with the new operating system for several sessions.
After the first boot up into Windows 11 I immediately noticed the new centred task bar. I thought I'd be able to drag it to the side to give myself more vertical space like in previous Windows but was surprised you couldn't. Even with the "stuckrects3" registry hack it refused to move. It's a little deal breaker for me. The amount of vertical space on a 16:9 screen was already little as it is unless you use more than one screen or a high res screen.
The rest of the OS UI feels similar to what Metro was on Windows 8, which was basically an alternative interface layered on top of Windows 7 UI. In this case, it was Windows 10 underneath 11.
Blue Screens of Death Loop
Couple of hours in I suddenly had a blue screen with the error code 87 and the system couldn't automatically recover from it. It was stuck in a loop of blue screens until it eventually stopped and said I need a recovery disk to manually repair myself.
Backing Up Data When Windows is Broken
Back ups should really have been done before the major OS upgrade but if you find you can't get into Windows and still need to get files off the system drive, you can create a live USB drive.
Rufus allows you to make one easily. If you prefer the familiar Windows, you can download a Windows ISO from the official Microsoft site then use Rufus to create a "Windows to Go".
Or like me, you can download a Linux distro which is much lighter and speedier to run like Linux Mint instead. Then you can just plug in a USB drive, drag and drop files like you'd usually do.
So, I booted up with the Windows 11 USB I used to upgrade, chose repair, troubleshooting, advance, command line to get diagnosing.
In this case C, is where my copy of Windows 11 was installed and should be the same for most people. If you're not sure, you can see what drives are currently mounted by running diskpart and then doing the following.
Day 1 - System File Restore
The first command I ran was the following to fix any file corruption due to disk faults.
chkdsk /f C:
The chkdsk scan didn't take long on my 2TB PCI 4 M2 SSD. There were a few file issues which the tool repaired. The drive was only 2 years old and SMART wasn't reporting any issues either but I ran a bad sector scan anyway:
chkdsk /r C:
Sure enough as SMART reported, there were no bad sectors which should be the case with flash drives. So the next thing was to check for corrupted or missing system files:
Everything was reportedly fine according to the tool so after Googling a bit I read about the DISM tool which you could run with
dism /online /cleanup-image /restorehealth
This didn't work because "online" meant the current copy of Windows you're logged into and I was on the recovery disk so after some more searching I found you could specify the installation you wanted to fix by running.
dism /image:C: /cleanup-image /restorehealth
At this point, DISM tool still wouldn't work because it "source files could not be found" for repair. After some more searching I found the source files were actually inside a Windows Image file and you have to tell the tool which version you're trying to patch up. You run the following to find the index of the version of Windows you're after.
dism /get-wiminfo /wimfile:I:\sources\install.wim
I is the drive letter for my USB drive and I was using Windows 11 Pro so the index was 6 for me. Now to patch up any corrupted system files I can run:
dism /image:C: /cleanup-image /restorehealth
But wait... Now the progress bar was stuck and wouldn't even move 0.1%. It turns out the default space DISM uses for storing and patching files isn't big enough for it to work with. To fix that, you have to create a directory somewhere such as on the C drive itself then point the tool to it.
dism /image:C /cleanup-image /restorehealth /scratchdir:C:\temp_scratchfolder /source:wim:I:\sources\install.wim:6 /limitaccess
The file scan itself only takes less than 10 minutes but again, nothing was wrong according to DISM.
Day 2 - Rolling Back and Driver Updates
With Windows 11 continuing to give me the blue screen the of death, I thought I'd just go back to Windows 10. Unfortunately, that didn't go well either and did something similar. Sometimes setup didn't even reach the "install" screen before a blue screen appeared.
Some more searching suggested it could possibly be driver issues and AMD hardware support has always been poor with Windows. I wondered if it had something to do with the stock drivers that came with the Windows 10 installation.
I never did install Windows 10 from scratch and just upgraded free from my copy of Windows 7 so most of the drivers were already there so, it was possible the stock drivers didn't like my hardware. For the record it's:
Gigabyte X570 Aorus Elite
Ryzen 9 3950X 16 Core (Water cooled)
64GB 3200MHz DDR4 Corsair Vengeance LPX RAM (4x 16GB modules)
EVGA Geforce 1080
2TB PCI 4 M2 SSD
I made several attempts to install Windows 10 22H2 which ended up with blue screens so I thought I'd try older versions such as 21H1 and 19H1 because other people had issues with 22H2. Microsoft doesn't officially give you the option to download from them but if you use the Rufus tool (v3.21 at the time of writing), you can click on the dropdown arrow next to Select to download older ISOs from their servers.
You must enable auto-updates in Rufus' settings first before the dropdown arrow appears.
Unfortunately, older versions of Windows 10 did not fair much better. Still thinking it was Windows stock setup drivers, I retried several times and eventually got the OS installed. At that point, I immediately updated the AMD chipset drivers with Gigabyte's own.
Everything seemed stable for a while but the blue screens started to kick in again.
Day 3 - Hardware Testing
I tried reverting the BIOS upgrade and re-installing Windows 11 but again, I was stuck with the same blue screen loop.
After doing some more searching, there were suggestions it could be a hardware fault; possibly with storage or RAM. Most of the blue screens were ntf.sys related which suggests the system drive breaking down but given the disk scan results I did before there shoudn't be anything wrong with storage.
On the other hand, there were signs of failing RAM:
- Blue screens happening have different error messages; they're not the same.
- Blue screens occur inconsistently.
- Downloads break midway with "network error".
- Every attempt at installing new software claims "installation" corrupted.
So it was worth testing out the RAM. I was out of options anyway software wise.
Windows has it's own RAM tester named Windows Memory Diagnostic. I tried that first and surprisingly "hardware problems were detected". Downside with tools in Windows is the results are logged inside proprietary files most of the time and in this case you're supposed to use Event Viewer to read.
If you can't even get into Windows, how can you read the results?
So I decided to continue with MemTest instead. There's a MemTest86+ version that runs with a more dated UI too so I chose to go with the former. Again, you'll need a spare USB drive to install the tool and boot from it.
Default test options are fine but the tool itself suggests running in parallel mode so that it provides a more realistic load for multi-core processors. It's also just slightly faster the the default single core tests.
1 pass for one of my 16GB 3200MHz DDR4 RAM module takes about 1 hour in single core mode. Parallel mode is around 10 minutes faster. It's recommended to go through the full 4 passes so the RAM has undergone some proper stress.
Interestingly enough, I noticed two of my Corsair Vegeance RAM modules had identical serial numbers which some say may suggest they're counterfeit... maybe they were the issue. Unfortunately, after all the scans were complete, none of the modules had errors.
So again, I was baffled.
Next I thought I'd try installing Windows 10 22H2 with a single module of RAM since they were working apparently. The install went through smoothly and I could use Windows without any blue screens! However, as soon as I plugged in the other 3 modules the blue screen loop started...
Searching around for similar issues related to the number of modules installed I discovered some people were getting blue screens when their RAM is running in dual channel mode due to bad modules or incompatible RAM. In my case, the RAM were all of the same spec, brand and were tested fine so compatibility couldn't be the issue.
A single stick of RAM was stable so the channel mode might have been the issue. To test, I installed another module so that they were running in dual channel mode. You'll have to check your motherboard manual as you may have to use certain slots to run two modules in this mode. You may think you can have three modules and keep dual channel mode but not all boards support "flex" mode that juggles between single and dual channel mode.
Anyway, after installing two modules, dual channel mode was running and everything was still stable. Next I plugged in the remainder of the RAM and at this point the blue screens started to happen again.
Just to verify it wasn't the RAM modules themselves that were the issue (even though the tests were fine), I tried running the system again using the other two modules but it was still fine.
So it looks like there's an issue with the motherboard or BIOS when I have 4 RAM modules installed.
Now I run the system on Windows 10 22H2 with half the RAM I initially had which is enough but, I'm still baffled how an attempt to upgrade to Windows 11 22H2 caused so many issues. The Windows 7 to 10 upgrade was a lot smoother but then again I never did update the BIOS. Even so, I don't quite see how the BIOS update would have broken something. Otherwise it should have stopped working right away or at least work again once I reverted.
In any case, I'm just glad my machine's up and running again; still with 32GB of RAM in dual channel mode much like my last build. 3 days of diagnosing was enough really.