How to identify and resolve the Azure VMs OS Blue Screen Issues after Windows Updates Failures

Issue: While trying to connect VM and seeing blue screen. This is happing after windows updates failures.

Overview
Blue screens are generally caused by problems with our computer’s hardware or drivers related. Sometimes, they can be caused by issues with Windows kernel or other registry entries mismatches. 
When a blue screen occurs, Windows automatically creates a “minidump” such as memory dump file that contains information about the crash and saves it to our disk. We can view information about these minidumps to help identify the cause of the blue screen.

Root Cause Analysis:
After going through memory dump file in detailed, I came to know that few registry entries got replaced by recent windows updates failures, hence OS is not able to load with existing system boot entries and it’s causes the OS failures.

How to fix the issues and start the  VM successfully:

Since this is the VM and we can’t login with safe mode /network mode to trouble shoot the issues. We have to perform the following steps to trouble shoot and fix the issues.
  1. Take the snapshot of failure OS drive of our VM.
  2. Create the new drive with that snapshot
  3. Attach that drive to any other running VM as secondary drive for trouble shooting the issues.
  4. After going the memory dump file and found some clue on what could be the reason that we are having issues. I found some detailed online articles and I followed those instructions.
reg load HKLM\BROKENSYSTEM f:\windows\system32\config\SYSTEM.hiv

REM Enable Serial Console
bcdedit /store f:\boot\bcd /set {bootmgr} displaybootmenu yes 
bcdedit /store f:\boot\bcd /set {bootmgr} timeout 10 
bcdedit /store f:\boot\bcd /set {bootmgr} bootems yes 
bcdedit /store f:\boot\bcd /ems {<BOOT LOADER IDENTIFIER>} ON 
bcdedit /store f:\boot\bcd /emssettings EMSPORT:1 EMSBAUDRATE:115200

REM Suggested configuration to enable OS Dump
REG ADD "HKLM\BROKENSYSTEM\ControlSet001\Control\CrashControl" /v CrashDumpEnabled /t REG_DWORD /d 2 /f 
REG ADD "HKLM\BROKENSYSTEM\ControlSet001\Control\CrashControl" /v DumpFile /t REG_EXPAND_SZ /d "%SystemRoot%\MEMORY.DMP" /f 
REG ADD "HKLM\BROKENSYSTEM\ControlSet001\Control\CrashControl" /v NMICrashDump /t REG_DWORD /d 1 /f 
REG ADD "HKLM\BROKENSYSTEM\ControlSet002\Control\CrashControl" /v CrashDumpEnabled /t REG_DWORD /d 2 /f 
REG ADD "HKLM\BROKENSYSTEM\ControlSet002\Control\CrashControl" /v DumpFile /t REG_EXPAND_SZ /d "%SystemRoot%\MEMORY.DMP" /f 
REG ADD "HKLM\BROKENSYSTEM\ControlSet002\Control\CrashControl" /v NMICrashDump /t REG_DWORD /d 1 /f 
reg unload HKLM\BROKENSYSTEM

          5. Go back to Registry Editor and find the BROKENSYSTEM hive unnormal, a lot of entries missing. Checked filesystem that we were loading the wrong hive. 
a. Load hive f:\windows\system32\config\SYSTEM
b. Locate to ControlSet001\Control\CrashControl and ControlSet001\Control\CrashControl
c. Set AutoReboot value as 0, set CrashDumpEnabled value as 1.
d. Tried to disable ‘debug’ but failed with wrong command.
e. Unload hive.

6. We get the message that a windows update was installed before the can’t RDP/ No-boot issue occurred. So decided to run below command to revert the image to the status where windows update was not installed.
dism.exe /image:F:\ /cleanup-image /revertpendingactions
 
7. After the ‘dism’ operation is finished successfully, detach the fixed disk bluescreen from the recovery VM. Then use PowerShell to swap the fixed disk to the VM. 
$name = 'AZ2TFSEVLDT1'
$resourceGroupName = 'bgitfswrg'
$diskname='bluescreen'
$diskResourceInstanceID="<Resource instance ID of the fixed disk>"
 
#Get the VM details
$vm = get-azurermvm -ResourceGroupName $resourceGroupName -Name $name
 
#Set the new disk properties and update the VM
Set-AzureRmVMOSDisk -VM $vm -Name $diskname  -ManagedDiskId $diskResourceInstanceID | Update-AzureRmVM
 
8. Then the VM is boot up successfully.

Comments