Findings in Windows events during this kind of scenarios:
1. A NetoLogon service is getting paused.
2. Unable to rollback operation on NTDS Database.
3. An attempt to write the edb.log return failed.
Directory Service Logs:
ESX Logs:
Observed below events in the VM logs at same point of time prior to the net logon service pause.
Why this behavior?
Based on the events, active Directory database is encountering problems with respect to read and write operations to the NTDS database.
A sequence of events is observed indicating a possible AD database corruption. After multiple failures to update the directory database it results in a condition wherein users cannot logon to AD, and as a proactive measure the NetLogon service is paused by AD. This causes users or machines to unable to authenticate and logon to the server or domain.
Suspected Causes:
Possible causes can be,
· Database Corruption
· Snapshot process causing the performance hit, freezing the system, especially the disk IO.
· Antivirus scanning the database and corresponding files
Also this issue can happen due to unsuccessful P2V conversion of the DC or DC is restored from a snapshot.
Suggestions and Recommendations:
- Offline defragmentation of AD database
- Check with application team if any specific tasks are running which is interfering with the snapshot backup process rendering the system to be non-responsive.
- Confirm that Antivirus scan timings and also it excludes NTDS and other AD related folders from the scan selection list.
I recommend to create another DC (VM) and move all roles to the new DC, then demote the OLD DC and if required promote it as a DC again. This is to avoid situations like offline defragmentation, repair and restore of the database.