This week one of my domain controllers developed a curious problem. I don't like curious problems, especially ones that rear their heads after the server reboots.
The error was an NTDS General event 2103, which indicates that the AD database "was restored using an unsupported procedure and Net Logon service has been paused". Research and KB Article 875495 lists event 2103 and 3 other events related to a condition known as USN Rollback.
This DC is running Windows 2003 SP2, so based on the article, I should be seeing at least the more serious NTDS Replication 2095 event as well, due to a hotfix in SP1 that made the error logging somewhat more verbose. But I'm not. This makes it more curious. Am I in a rollback state or not?
KB 8759495 also lists some possible causes of this state, some of which are possible in a virtual environment - the case for this DC. It points me to another KB Article 888794 which lists out a bunch of considerations for hosting DCs as VMs. However our environment met all the requirements, including one related to write caching on disks, as our host machine has battery backed disk caching. So I rule out that we actively caused a potential rollback.
Repadmin has a switch (/showutdvec) that can be used to determine USN status by displaying the up-to-dateness vector USN for all DCs that replicate a common naming context. If the direct replication partners have a higher USN for the DC in question than that DC has for itself, that's considered evidence of a USN rollback. My DC did not have this problem, as it had a USN higher than it's partners. So at this point I couldn't confirm or deny a true USN rollback issue, however it seemed the the DC "thought" it was having this problem. Maybe I could figure out why the DC was in this limbo.
So I returned to the original article to look for specific causes. One line reads, "Starting an AD domain controller whose AD database file was restored (copied) into place by using an imaging program such as Norton Ghost."
Thinking back, the conversion of this DC from physical to virtual did not go as smoothly as I would have hoped. I remembered I had to resolve some issue where I was getting an error in the logs related to the directory database file not being where the OS expected it, even though the path on the server hadn't changed during the conversion. It was odd at the time, but the posted fix seemed to clear the issue and I'd moved on.
I'm guessing that perhaps that was the start of my issues - maybe the P2V process made the OS think the database was different copy even though it wasn't. The result was that the server thought it was rolled back, but the USNs never reflected a problem. So I decided it was better to be safe than sorry and assume this "limbo" condition was not how I wanted to leave things.
The resolution for USN rollback is a forced removal of the domain controller from AD. Since this is a DC in a child domain that's being phased out, very few changes happen to that domain so I wasn't concerned about possibly loosing changes that may have been made on that DC. It was only the FSMO holder for one role which was easily seized by the other DC.
My decision now is to decided between bringing up a replacement DC for this domain next week or just run one DC for the time being and try to speed up the remaining tasks that need to be done before we can removed the child domain all together.
But that's for another day!
Subscribe to:
Post Comments (Atom)
Is this the only way?
ReplyDeleteThis was the only solution I found. When it comes to domain controllers, I don't spend a lot of time trying to "fix" them. In many cases, it's simply easier to remove a misbehaving one and bring a fresh one up with a new name and use NTDSUTIL to remove any legacy references to the failed one.
ReplyDelete