Quantcast
Channel: Ask the Directory Services Team
Viewing all 76 articles
Browse latest View live

Does your logon hang after a password change on win 8.1 /2012 R2/win10?

$
0
0

Hi, Linda Taylor here, Senior Escalation Engineer from the Directory Services team in the UK.

I have been working on this issue which seems to be affecting many of you globally on windows 8.1, 2012 R2 and windows 10, so I thought it would be a good idea to explain the issue and workarounds while we continue to work on a proper fix here.

The symptoms are such that after a password change, logon hangs forever on the welcome screen:

clip_image002

How annoying….

The underlying issue is a deadlock between several components including DPAPI and the redirector.

For full details or the issue, workarounds and related fixes check out my post on the ASKPFEPLAT blog here http://blogs.technet.com/b/askpfeplat/archive/2016/01/11/does-your-win-8-1-2012-r2-win10-logon-hang-after-a-password-change.aspx

This is now fixed in the following updates:

Windows 8.1, 2012 R2, 2012 install:

For Windows 10 TH2 build 1511 install:

I hope this helps,

Linda


Previewing Server 2016 TP4: Temporary Group Memberships

$
0
0

Disclaimer: Windows Server 2016 is still in a Technical Preview state – the information contained in this post may become inaccurate in the future as the product continues to evolve.

Hello, Ryan Ries here again with some juicy new Active Directory hotness. Windows Server 2016 is right around the corner, and it’s bringing a ton of new features and improvements with it. Today we’re going to talk about one of the new things you’ll be seeing in Active Directory, which you might see referred to as “expiring links,” or what I like to call “temporary group memberships.”

One of the challenges that every security-conscious Active Directory administrator has faced is how to deal with contractors, vendors, temporary employees and anyone else who needs temporary access to resources within your Active Directory environment. Let’s pretend that your Information Security team wants to perform an automated vulnerability scan of all the devices on your network, and to do this, they will need a service account with Domain Administrator privileges for 5 business days. Because you are a wise AD administrator, you don’t like the idea of this service account that will be authenticating against every device on the network having Domain Administrator privileges, but the CTO of the company says that you have to give the InfoSec team what they want.

(Trust me, this stuff really happens.)

So you strike a compromise, claiming that you will grant this service account temporary membership in the Domain Admins group for 5 days while the InfoSec team conducts their vulnerability scan. Now you could just manually remove the service account from the group after 5 days, but you are a busy admin and you know you’re going to forget to do that. You could also set up a scheduled task to run after 5 days that runs a script that removes the service account from the Domain Admins group, but let’s explore a couple of more interesting options.

The Old Way

One old-school way of accomplishing this is through the use of dynamic objects in 2003 and later. Dynamic objects are automatically deleted (leaving no tombstone behind) after their entryTTL expires. Using this knowledge, our plan is to create a security group called “Temp DA for InfoSec” as a dynamic object with a TTL (time-to-live) of 5 days. Then we’re going to put the service account into the temporary security group. Then we are going to add the temporary security group to the Domain Admins group. The service account is now a member of Domain Admins because of the nested group membership, and once the temporary security group automatically disappears in 5 days, the nested group membership will be broken and the service account will no longer be a member of Domain Admins.

Creating dynamic objects is not as simple as just right-clicking in AD Users & Computer and selecting “New > Dynamic Object,” but it’s still pretty easy if you use ldifde.exe and a simple text file. Below is an example:

clip_image002
Figure 1: Creating a Dynamic Object with ldifde.exe.

dn: cn=Temp DA For InfoSec,ou=Information Security,dc=adatum,dc=com
changeType: add
objectClass: group
objectClass: dynamicObject
entryTTL: 432000
sAMAccountName: Temp DA For InfoSec

In the text file, just supply the distinguished name of the security group you want to create, and make sure it has both the group objectClass and the dynamicObject objectClass. I set the entryTTL to 432000 in the screen shot above, which is 5 days in seconds. Import the object into AD using the following command:
  ldifde -i -f dynamicGroup.txt

Now if you go look at the newly-created group in AD Users & Computers, you’ll see that it has an entryTTL attribute that is steadily counting down to 0:

clip_image004
Figure 2: Dynamic Security Group with an expiry date.

You can create all sorts of objects as Dynamic Objects by the way, not just groups. But enough about that. We came here to see how the situation has improved in Windows Server 2016. I think you’ll like it better than the somewhat convoluted Dynamic Objects solution I just described.

The New Hotness (Windows Server 2016 Technical Preview 4, version 1511.10586.122)

For our next trick, we’ll need to enable the Privileged Access Management Feature in our Windows Server 2016 forest. Another example of an optional feature is the AD Recycle Bin. Keep in mind that just like the AD Recycle Bin, once you enable the Privileged Access Management feature in your forest, you can’t turn it off. This feature also requires a Windows Server 2016 or “Windows Threshold” forest functional level:

clip_image006
Figure 3: This AD Optional Feature requires a Windows Server 2016 or “Windows Threshold” Forest Functional Level.

It’s easy to enable with PowerShell:
Enable-ADOptionalFeature ‘Privileged Access Management Feature’ -Scope ForestOrConfigurationSet -Target adatum.com

Now that you’ve done this, you can start setting time limits on group memberships directly. It’s so easy:
Add-ADGroupMember -Identity ‘Domain Admins’ -Members ‘InfoSecSvcAcct’ -MemberTimeToLive (New-TimeSpan -Days 5)

Now isn’t that a little easier and more straightforward? Our InfoSec service account now has temporary membership in the Domain Admins group for 5 days. And if you want to view the time remaining in a temporary group membership in real time:
Get-ADGroup ‘Domain Admins’ -Property member -ShowMemberTimeToLive

clip_image008
Figure 4: Viewing the time-to-live on a temporary group membership.

So that’s cool, but in addition to convenience, there is a real security benefit to this feature that we’ve never had before. I’d be remiss not to mention that with the new Privileged Access Management feature, when you add a temporary group membership like this, the domain controller will actually constrain the Kerberos TGT lifetime to the shortest TTL that the user currently has. What that means is that if a user account only has 5 minutes left in its Domain Admins membership when it logs on, the domain controller will give that account a TGT that’s only good for 5 more minutes before it has to be renewed, and when it is renewed, the PAC (privilege attribute certificate) will no longer contain that group membership! You can see this in action using klist.exe:

clip_image010
Figure 5: My Kerberos ticket is only good for about 8 minutes because of my soon-to-expire group membership.

Awesome.

Lastly, it’s worth noting that this is just one small aspect of the upcoming Privileged Access Management feature in Windows Server 2016. There’s much more to it, like shadow security principals, bastion forests, new integrations with Microsoft Identity Manager, and more. Read more about what’s new in Windows Server 2016 here.

Until next time,

Ryan “Domain Admin for a Minute” Ries

Are your DCs too busy to be monitored?: AD Data Collector Set solutions for long report compile times or report data deletion

$
0
0

Hi all, Herbert Mauerer here. In this post we’re back to talk about the built-in AD Diagnostics Data collector set available for Active Directory Performance (ADPERF) issues and how to ensure a useful report is generated when your DCs are under heavy load.

Why are my domain controllers so busy you ask? Consider this: Active Directory stands in the center of the Identity Management for many customers. It stores the configuration information for many critical line of business applications. It houses certificate templates, is used to distribute group policy and is the account database among many other things. All sorts of network-based services use Active Directory for authentication and other services.

As mentioned there are many applications which store their configuration in Active Directory, including the details of the user context relative to the application, plus objects specifically created for the use of these applications.

There are also applications that use Active Directory as a store to synchronize directory data. There are products like Forefront Identity Manager (and now Microsoft Identity Manager) where synchronizing data is the only purpose. I will not discuss whether these applications are meta-directories or virtual directories, or what class our Office 365 DirSync belongs to…

One way or the other, the volume and complexity of Active Directory queries has a constant trend of increasing, and there is no end in sight.

So what are my Domain Controllers doing all day?

We get this questions a lot from our customers. It often seems as if the AD Admins are the last to know what kind of load is put onto the domain controllers by scripts, applications and synchronization engines. And they are not made aware of even significant application changes.

But even small changes can have a drastic effect on the DC performance. DCs are resilient, but even the strongest warrior may fall against an overwhelming force.  Think along the lines of “death by a thousand cuts”.  Consider applications or scripts that run non-optimized or excessive queries on many, many clients during or right after logon and it will feel like a distributed DoS. In this scenario, the domain controller may get bogged down due to the enormous workload issued by the clients. This is one of the classic scenarios when it comes to Domain Controller performance problems.

What resources exist today to help you troubleshoot AD Performance scenarios?

We have already discussed the overall topic in this blog, and today many customer requests start with the complaint that the response times are bad and the LSASS CPU time is high. There also is a blog post specifically on the toolset we’ve had since Windows Server 2008. We also updated and brought back the Server Performance Advisor toolset. This toolset is now more targeted at trend analysis and base-lining.  If a video is more your style, Justin Turner revealed our troubleshooting process at Ignite.

The reports generated by this data collection are hugely useful for understanding what is burdening the Domain Controllers. There are fewer cases where DCs are responding slowly, but there is no significant utilization seen. We released a blog on that scenario and also gave you a simple method to troubleshoot long-running LDAP queries at our sister site.  So what’s new with this post?

The AD Diagnostic Data Collector set report “report.html” is missing or compile time is very slow

In recent months, we have seen an increasing number of customers with incomplete Data Collector Set reports. Most of the time, the “report.html” file is missing:

This is a folder where the creation of the report.html file was successful:

image

This folder has exceeded the limits for reporting:

image

Notice the report.html file is missing in the second folder example. Also take note that the ETL and BLG files are bigger. What’s the reason for this?

The Data Collector Set report generation process uncovered:

  • When the data collection ends, the process “tracerpt.exe” is launched to create a report for the folder where the data was collected.
  • “tracerpt.exe” runs with “below normal” priority so it does not get full CPU attention especially if LSASS is busy as well.
  • “tracerpt.exe” runs with one worker thread only, so it cannot take advantage of more than one CPU core.
  • “tracerpt.exe” accumulates RAM usage as it runs.
  • “tracerpt.exe” has six hours to complete a report. If it is not done within this time, the report is terminated.
  • The default settings of the system AD data collector deletes the biggest data set first that exceed the 1 Gigabyte limit. The biggest single file in the reports is typically “Active Directory.etl”.  The report.html file will not get created if this file does not exist.

I worked with a customer recently with a pretty well-equipped Domain Controller (24 server-class CPUs, 256 GB RAM). The customer was kind enough to run a few tests for various report sizes, and found the following metrics:

  • Until the time-out of six hours is hit, “tracerpt.exe” consumes up to 12 GB of RAM.
  • During this time, one CPU core was allocated 100%. If a DC is in a high-load condition, you may want to increase the base priority of “tracerpt.exe” to get the report to complete. This is at the expense of CPU time potentially impacting purpose of said server and in turn clients.
  • The biggest data set that could be completed within the six hours had an “Active Directory.etl” of 3 GB.

If you have lower-spec and busier machines, you shouldn’t expect the same results as this example (On a lower spec machine with a 3 GB ETL file, the report.html file would likely fail to compile within the 6-hour window).

What a bummer, how do you get Performance Logging done then?

Fortunately, there are a number of parameters for a Data Collector Set that come to the rescue. Before you can use any of them you first need one of the more custom Data Collector Sets. You can play with a variety of settings, based on the purpose of the collection.

In Performance Monitor you can create a custom set on the “User Defined” folder by right-clicking it, to bring up the New -> Data Collector Set option in the context menu:

image

This launches a wizard that prompts you for a number of parameters for the new set.

The first thing it wants is a name for the new set:

image

The next step is to select a template. It may be one of the built-in templates or one exported from another computer as an XML file you select through the “Browse” button. In our case, we want to create a clone of “Active Directory Diagnostics”:

image

The next step is optional, and it’s specifies the storage location for the reports. You may want to select a volume with more space or lower IO load than the default volume:

image

There is one more page in the wizard, but there is no reason to make any more changes here. You can click “Finish” on this page.

The default settings are fine for an idle DC, but if you find your ETL files are too large, your reports are not generated, or it takes too long to process the data, you will likely want to make the following configuration changes.

For a real “Big Data Collector Set” we first want to make important changes to the storage strategy of the set that are available in the “Data Manager” log:

image

The most relevant settings are “Resource Policy” and “Maximum Root Path Size”. I recommend starting with the settings as shown below:

image

Notice, I’ve changed the Resource policy from “Delete largest” to “Delete oldest”. I’ve also increased the Maximum root path size from 1024 to 2048 MB.  You can run some reports to learn what the best size settings are for you. You might very well end up using 10 GB or more for your reports.

The second crucial parameter for your custom sets is the run interval for the data collection. It is five minutes by default. You can adjust that in the properties of the collector in the “Stop Condition” tab. In many cases shortening the data collection is a viable step if you see continuous high load:

image

You should avoid going shorter than two minutes, as this is the maximum LDAP query duration by default. (If you have LDAP queries that reach this threshold, they would not show up in a report that is less than two minutes in length.) In fact, I would suggest the minimum interval be set to three minutes.

One very attractive option is automatically restarting the data collection if a certain size of data collection is exceeded. You need to use common sense when you look at the multiple reports, e.g. the ratio of long-running queries is then shown in the logs. But it is definitely better than no report.

If you expect to exceed the 1 GB limit often, you certainly should adjust the total size of collections (Maximum root path size) in the “Data Manager”.

So how do I know how big the collection is while running it?

You can take a look at the folder of the data collection in Explorer, but you will notice it is pretty lazy updating it with the current size of the collection:

image

Explorer only updates the folder if you are doing something with the files. It sounds strange, but attempting to delete a file will trigger an update:

image

Now that makes more sense…

If you see the log is growing beyond your expectations, you can manually stop it before the stop condition hits the threshold you have configured:

image

Of course, you can also start and stop the reporting from a command line using the logman instructions in this post.

Room for improvement

We are aware there is room for improvement to get bigger data sets reported in a shorter time. The good news is that much of these special configuration changes won’t be needed once your DCs are running on Windows Server 2016. We will talk about that in a future post.

Thanks for reading.

Herbert

Setting up Virtual Smart card logon using Virtual TPM for Windows 10 Hyper-V VM Guests

$
0
0
Hello Everyone, my name is Raghav and I’m a Technical Advisor for one of the Microsoft Active Directory support teams. This is my first blog and today I’ll share with you how to configure a Hyper-V environment in order to enable virtual smart card logon to VM guests by leveraging a new Windows 10 feature: virtual Trusted Platform Module (TPM).

Here’s a quick overview of the terminology discussed in this post:
  • Smart cards are physical authentication devices, which improve on the concept of a password by requiring that users actually have their smart card device with them to access the system, in addition to knowing the PIN, which provides access to the smart card.
  • Virtual smart cards (VSCs) emulate the functionality of traditional smart cards, but instead of requiring the purchase of additional hardware, they utilize technology that users already own and are more likely to have with them at all times. Theoretically, any device that can provide the three key properties of smart cards (non-exportability, isolated cryptography, and anti-hammering) can be commissioned as a VSC, though the Microsoft virtual smart card platform is currently limited to the use of the Trusted Platform Module (TPM) chip onboard most modern computers. This blog will mostly concern TPM virtual smart cards.
    For more information, read Understanding and Evaluating Virtual Smart Cards.
  • Trusted Platform Module - (As Christopher Delay explains in his blog) TPM is a cryptographic device that is attached at the chip level to a PC, Laptop, Tablet, or Mobile Phone. The TPM securely stores measurements of various states of the computer, OS, and applications. These measurements are used to ensure the integrity of the system and software running on that system. The TPM can also be used to generate and store cryptographic keys. Additionally, cryptographic operations using these keys take place on the TPM preventing the private keys of certificates from being accessed outside the TPM.
  • Virtualization-based security – The following Information is taken directly from https://technet.microsoft.com/en-us/itpro/windows/keep-secure/windows-10-security-guide
    • One of the most powerful changes to Windows 10 is virtual-based security. Virtual-based security (VBS) takes advantage of advances in PC virtualization to change the game when it comes to protecting system components from compromise. VBS is able to isolate some of the most sensitive security components of Windows 10. These security components aren’t just isolated through application programming interface (API) restrictions or a middle-layer: They actually run in a different virtual environment and are isolated from the Windows 10 operating system itself.
    • VBS and the isolation it provides is accomplished through the novel use of the Hyper V hypervisor. In this case, instead of running other operating systems on top of the hypervisor as virtual guests, the hypervisor supports running the VBS environment in parallel with Windows and enforces a tightly limited set of interactions and access between the environments. Think of the VBS environment as a miniature operating system: It has its own kernel and processes. Unlike Windows, however, the VBS environment runs a micro-kernel and only two processes called trustlets
  • Local Security Authority (LSA) enforces Windows authentication and authorization policies. LSA is a well-known security component that has been part of Windows since 1993. Sensitive portions of LSA are isolated within the VBS environment and are protected by a new feature called Credential Guard.
  • Hypervisor-enforced code integrity verifies the integrity of kernel-mode code prior to execution. This is a part of the Device Guard feature.
VBS provides two major improvements in Windows 10 security: a new trust boundary between key Windows system components and a secure execution environment within which they run. A trust boundary between key Windows system components is enabled though the VBS environment’s use of platform virtualization to isolate the VBS environment from the Windows operating system. Running the VBS environment and Windows operating system as guests on top of Hyper-V and the processor’s virtualization extensions inherently prevents the guests from interacting with each other outside the limited and highly structured communication channels between the trustlets within the VBS environment and Windows operating system.
VBS acts as a secure execution environment because the architecture inherently prevents processes that run within the Windows environment – even those that have full system privileges – from accessing the kernel, trustlets, or any allocated memory within the VBS environment. In addition, the VBS environment uses TPM 2.0 to protect any data that is persisted to disk. Similarly, a user who has access to the physical disk is unable to access the data in an unencrypted form.
clip_image002[4]
VBS requires a system that includes:
  • Windows 10 Enterprise Edition
  • A-64-bit processor
  • UEFI with Secure Boot
  • Second-Level Address Translation (SLAT) technologies (for example, Intel Extended Page Tables [EPT], AMD Rapid Virtualization Indexing [RVI])
  • Virtualization extensions (for example, Intel VT-x, AMD RVI)
  • I/O memory management unit (IOMMU) chipset virtualization (Intel VT-d or AMD-Vi)
  • TPM 2.0
Note: TPM 1.2 and 2.0 provides protection for encryption keys that are stored in the firmware. TPM 1.2 is not supported on Windows 10 RTM (Build 10240); however, it is supported in Windows 10, Version 1511 (Build 10586) and later.
Among other functions, Windows 10 uses the TPM to protect the encryption keys for BitLocker volumes, virtual smart cards, certificates, and the many other keys that the TPM is used to generate. Windows 10 also uses the TPM to securely record and protect integrity-related measurements of select hardware.
Now that we have the terminology clarified, let’s talk about how to set this up.

Setting up Virtual TPM
First we will ensure we meet the basic requirements on the Hyper-V host.
On the Hyper-V host, launch msinfo32 and confirm the following values:

The BIOS Mode should state “UEFI”.

clip_image001
Secure Boot State should be On.
clip_image002
Next, we will enable VBS on the Hyper-V host.
  1. Open up the Local Group Policy Editor by running gpedit.msc.
  2. Navigate to the following settings: Computer Configuration, Administrative Templates, System, Device Guard. Double-click Turn On Virtualization Based Security. Set the policy to Enabled, click OK,
clip_image004
Now we will enable Isolated User Mode on the Hyper-V host.
1. To do that, go to run type appwiz.cpl on the left pane find Turn Windows Features on or off.
Check Isolated User Mode, click OK, and then reboot when prompted.
clip_image006
This completes the initial steps needed for the Hyper-V host.
Now we will enable support for virtual TPM on your Hyper-V VM guest
Note: Support for Virtual TPM is only included in Generation 2 VMs running Windows 10.
To enable this on your Windows 10 generation 2 VM. Open up the VM settings and review the configuration under the Hardware, Security section. Enable Secure Boot and Enable Trusted Platform Module should both be selected.
clip_image008
That completes the Virtual TPM part of the configuration.  We will now work on working on virtual Smart Card configuration.
Setting up Virtual Smart Card
In the next section, we create a certificate template so that we can request a certificate that has the required parameters needed for Virtual Smart Card logon.
These steps are adapted from the following TechNet article: https://technet.microsoft.com/en-us/library/dn579260.aspx
Prerequisites and Configuration for Certificate Authority (CA) and domain controllers
  • Active Directory Domain Services
  • Domain controllers must be configured with a domain controller certificate to authenticate smartcard users. The following article covers Guidelines for enabling smart card logon: http://support.microsoft.com/kb/281245
  • An Enterprise Certification Authority running on Windows Server 2012 or Windows Server 2012 R2. Again, Chris’s blog covers neatly on how to setup a PKI environment.
  • Active Directory must have the issuing CA in the NTAuth store to authenticate users to active directory.
Create the certificate template
1. On the CA console (certsrv.msc) right click on Certificate Template and select Manage
clip_image010
2. Right-click the Smartcard Logon template and then click Duplicate Template
clip_image012
3. On the Compatibility tab, set the compatibility settings as below
clip_image014
4. On the Request Handling tab, in the Purpose section, select Signature and smartcard logon from the drop down menu
clip_image016
5. On the Cryptography Tab, select the Requests must use on of the following providers radio button and then select the Microsoft Base Smart Card Crypto Provider option.
clip_image018
Optionally, you can use a Key Storage Provider (KSP). Choose the KSP, under Provider Category select Key Storage Provider. Then select the Requests must use one of the following providers radio button and select the Microsoft Smart Card Key Storage Provider option.
clip_image020
6. On the General tab: Specify a name, such as TPM Virtual Smart Card Logon. Set the validity period to the desired value and choose OK

7. Navigate to Certificate Templates. Right click on Certificate Templates and select New, then Certificate Template to Issue.  Select the new template you created in the prior steps.

clip_image022
Note that it usually takes some time for this certificate to become available for issuance.

Create the TPM virtual smart card

Next we’ll create a virtual Smart Card on the Virtual Machine by using the Tpmvscmgr.exe command-line tool.

1. On the Windows 10 Gen 2 Hyper-V VM guest, open an Administrative Command Prompt and run the following command:
tpmvsmgr.exe create /name myVSC /pin default /adminkey random /generate
clip_image024
You will be prompted for a pin.  Enter at least eight characters and confirm the entry.  (You will need this pin in later steps)

Enroll for the certificate on the Virtual Smart Card Certificate on Virtual Machine.
1. In certmgr.msc, right click Certificates, click All Tasks then Request New Certificate.
clip_image025
2. On the certificate enrollment select the new template you created earlier.
clip_image027
3. It will prompt for the PIN associated with the Virtual Smart Card. Enter the PIN and click OK.
clip_image029
4. If the request completes successfully, it will display Certificate Installation results page
clip_image031
5. On the virtual machine select sign-in options and select security device and enter the pin
clip_image033
That completes the steps on how to deploy Virtual Smart Cards using a virtual TPM on virtual machines.  Thanks for reading!

Raghav Mahajan

The Version Store Called, and They’re All Out of Buckets

$
0
0

Hello, Ryan Ries back at it again with another exciting installment of esoteric Active Directory and ESE database details!

I think we need to have another little chat about something called the version store.

The version store is an inherent mechanism of the Extensible Storage Engine and a commonly seen concept among databases in general. (ESE is sometimes referred to as Jet Blue. Sometimes old codenames are so catchy that they just won’t die.) Therefore, the following information should be relevant to any application or service that uses an ESE database (such as Exchange,) but today I’m specifically focused on its usage as it pertains to Active Directory.

The version store is one of those details that the majority of customers will never need to think about. The stock configuration of the version store for Active Directory will be sufficient to handle any situation encountered by 99% of AD administrators. But for that 1% out there with exceptionally large and/or busy Active Directory deployments, (or for those who make “interesting” administrative choices,) the monitoring and tuning of the version store can become a very important topic. And quite suddenly too, as replication throughout your environment grinds to a halt because of version store exhaustion and you scramble to figure out why.

The purpose of this blog post is to provide up-to-date (as of the year 2016) information and guidance on the version store, and to do it in a format that may be more palatable to many readers than sifting through reams of old MSDN and TechNet documentation that may or may not be accurate or up to date. I can also offer more practical examples than you would probably get from straight technical documentation. There has been quite an uptick lately in the number of cases we’re seeing here in Support that center around version store exhaustion. While the job security for us is nice, knowing this stuff ahead of time can save you from having to call us and spend lots of costly support hours.

Version Store: What is it?

As mentioned earlier, the version store is an integral part of the ESE database engine. It’s an area of temporary storage in memory that holds copies of objects that are in the process of being modified, for the sake of providing atomic transactions. This allows the database to roll back transactions in case it can’t commit them, and it allows other threads to read from a copy of the data while it’s in the process of being modified. All applications and services that utilize an ESE database use version store to some extent. The article “How the Data Store Works” describes it well:

“ESE provides transactional views of the database. The cost of providing these views is that any object that is modified in a transaction has to be temporarily copied so that two views of the object can be provided: one to the thread inside that transaction and one to threads in other transactions. This copy must remain as long as any two transactions in the process have different views of the object. The repository that holds these temporary copies is called the version store. Because the version store requires contiguous virtual address space, it has a size limit. If a transaction is open for a long time while changes are being made (either in that transaction or in others), eventually the version store can be exhausted. At this point, no further database updates are possible.”

When Active Directory was first introduced, it was deployed on machines with a single x86 processor with less than 4 GB of RAM supporting NTDS.DIT files that ranged between 2MB and a few hundred MB. Most of the documentation you’ll find on the internet regarding the version store still has its roots in that era and was written with the aforementioned hardware in mind. Today, things like hardware refreshes, OS version upgrades, cloud adoption and an improved understanding of AD architecture are driving massive consolidation in the number of forests, domains and domain controllers in them, DIT sizes are getting bigger… all while still relying on default configuration values from the Windows 2000 era.

The number-one killer of version store is long-running transactions. Transactions that tend to be long-running include, but are not limited to:

– Deleting a group with 100,000 members
– Deleting any object, not just a group, with 100,000 or more forward/back links to clean
– Modifying ACLs in Active Directory on a parent container that propagate down to many thousands of inheriting child objects
– Creating new database indices
– Having underpowered or overtaxed domain controllers, causing transactions to take longer in general
– Anything that requires boat-loads of database modification
– Large SDProp and garbage collection tasks
– Any combination thereof

I will show some examples of the errors that you would see in your event logs when you experience version store exhaustion in the next section.

Monitoring Version Store Usage

To monitor version store usage, leverage the Performance Monitor (perfmon) counter:

‘\\dc01\Database ==> Instances(lsass/NTDSA)\Version buckets allocated’

image
(Figure 1: The ‘Version buckets allocated’ perfmon counter.)

The version store divides the amount of memory that it has been given into “buckets,” or “pages.” Version store pages need not (and in AD, they do not) equal the size of database pages elsewhere in the database. We’ll get into the exact size of these buckets in a minute.

During typical operation, when the database is not busy, this counter will be low. It may even be zero if the database really just isn’t doing anything. But when you perform one of those actions that I mentioned above that qualify as “long-running transactions,” you will trigger a spike in the version store usage. Here is an example of me deleting a group that contains 200,000 members, on a DC running 2012 R2 with 1 64bit CPU:

image(Figure 2: Deleting a group containing 200k members on a 2012 R2 DC with 1 64bit CPU.)

The version store spikes to 5332 buckets allocated here, seconds after I deleted the group, but as long as the DC recovers and falls back down to nominal levels, you’ll be alright. If it stays high or even maxed out for extended periods of time, then no more database transactions for you. This includes no more replication. This is just an example using the common member/memberOf relationship, but any linked-value attribute relationship can cause this behavior. (I’ve talked a little about linked value attributes before here.) There are plenty of other types of objects that may invoke this same kind of behavior, such as deleting an RODC computer object, and then its msDs-RevealedUsers links must be processed, etc..

I’m not saying that deleting a group with fewer than 200K members couldn’t also trigger version store exhaustion if there are other transactions taking place on your domain controller simultaneously or other extenuating circumstances. I’ve seen transactions involving as few as 70K linked values cause major problems.

After you delete an object in AD, and the domain controller turns it into a tombstone, each domain controller has to process the linked-value attributes of that object to maintain the referential integrity of the database. It does this in “batches,” usually 1000 or 10,000 depending on Windows version and configuration. This was only very recently documented here. Since each “batch” of 1000 or 10,000 is considered a single transaction, a smaller batch size will tend to complete faster and thus require less version store usage. (But the overall job will take longer.)

An interesting curveball here is that having the AD Recycle Bin enabled will defer this action by an msDs-DeletedObjectLifetime number of days after an object is deleted, since that’s the appeal behind the AD Recycle Bin – it allows you to easily restore deleted objects with all their links intact. (More detail on the AD Recycle Bin here.)

When you run out of version storage, no other database transactions can be committed until the transaction or transactions that are causing the version store exhaustion are completed or rolled back. At this point, most people start rebooting their domain controllers, and this may or may not resolve the immediate issue for them depending on exactly what’s going on. Another thing that may alleviate this issue is offline defragmentation of the database. (Or reducing the links batch size, or increasing the version store size – more on that later.) Again, we’re usually looking at 100+ gigabyte DITs when we see this kind of issue, so we’re essentially talking about pushing the limits of AD. And we’re also talking about hours of downtime for a domain controller while we do that offline defrag and semantic database analysis.

Here, Active Directory is completely tapping out the version store. Notice the plateau once it has reached its max:

image(Figure 3: Version store being maxed out at 13078 buckets on a 2012 R2 DC with 1 64bit CPU.)

So it has maxed out at 13,078 buckets.

When you hit this wall, you will see events such as these in your event logs:

Log Name: Directory Service
Source: Microsoft-Windows-ActiveDirectory_DomainService
Date: 5/16/2016 5:54:52 PM
Event ID: 1519
Task Category: Internal Processing
Level: Error
Keywords: Classic
User: S-1-5-21-4276753195-2149800008-4148487879-500
Computer: DC01.contoso.com
Description:
Internal Error: Active Directory Domain Services could not perform an operation because the database has run out of version storage.

And also:

Log Name: Directory Service
Source: NTDS ISAM
Date: 5/16/2016 5:54:52 PM
Event ID: 623
Task Category: (14)
Level: Error
Keywords: Classic
User: N/A
Computer: DC01.contoso.com
Description:
NTDS (480) NTDSA: The version store for this instance (0) has reached its maximum size of 408Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

The peculiar “408Mb” figure that comes along with that last event leads us into the next section…

How big is the Version Store by default?

The “How the Data Store Works” article that I linked to earlier says:

“The version store has a size limit that is the lesser of the following: one-fourth of total random access memory (RAM) or 100 MB. Because most domain controllers have more than 400 MB of RAM, the most common version store size is the maximum size of 100 MB.”

Incorrect.

And then you have other articles that have even gone to print, such as this one, that say:

“Typically, the version store is 25 percent of the physical RAM.”

Extremely incorrect.

What about my earlier question about the bucket size? Well if you consulted this KB article you would read:

The value for the setting is the number of 16KB memory chunks that will be reserved.”

Nope, that’s wrong.

Or if I go to the MSDN documentation for ESE:

“JET_paramMaxVerPages
This parameter reserves the requested number of version store pages for use by an instance.

Each version store page as configured by this parameter is 16KB in size.”

Not true.

The pages are not 16KB anymore on 64bit DCs. And the only time that the “100MB” figure was ever even close to accurate was when domain controllers were 32bit and had 1 CPU. But today, domain controllers are 64bit and have lots of CPUs. Both version store bucket size and number of version store buckets allocated by default both double based on whether your domain controller is 32bit or 64bit. And the figure also scales a little bit based on how many CPUs are in your domain controller.

So without further ado, here is how to calculate the actual number of buckets that Active Directory will allocate by default:

(2 * (3 * (15 + 4 + 4 * #CPUs)) + 6400) * PointerSize / 4

Pointer size is 4 if you’re using a 32bit processor, and 8 if you’re using a 64bit processor.

And secondly, version store pages are 16KB if you’re on a 32bit processor, and 32KB if you’re on a 64bit processor. So using a 64bit processor effectively quadruples the default size of your AD version store. To convert number of buckets allocated into bytes for a 32bit processor:

(((2 * (3 * (15 + 4 + 4 * 1)) + 6400) * 4 / 4) * 16KB) / 1MB

And for a 64bit processor:

(((2 * (3 * (15 + 4 + 4 * 1)) + 6400) * 8 / 4) * 32KB) / 1MB

So using the above formulae, the version store size for a single-core, 64bit DC would be ~408MB, which matches that event ID 623 we got from ESE earlier. It also conveniently matches 13078 * 32KB buckets, which is where we plateaued with our perfmon counter earlier.

If you had a 4-core, 64bit domain controller, the formula would come out to ~412MB, and you will see this line up with the event log event ID 623 on that machine. When a 4-core, Windows 2008 R2 domain controller with default configuration runs out of version store:

Log Name:      Directory Service
Source:        NTDS ISAM
Date:          5/15/2016 1:18:25 PM
Event ID:      623
Task Category: (14)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      dc02.fabrikam.com
Description:
NTDS (476) NTDSA: The version store for this instance (0) has reached its maximum size of 412Mb. It is likely that a long-running transaction is preventing cleanup of the version store and causing it to build up in size. Updates will be rejected until the long-running transaction has been completely committed or rolled back.

The version store size for a single-core, 32bit DC is ~102MB. This must be where the original “100MB” adage came from. But as you can see now, that information is woefully outdated.

The 6400 number in the equation comes from the fact that 6400 is the absolute, hard-coded minimum number of version store pages/buckets that AD will give you. Turns out that’s about 100MB, if you assumed 16KB pages, or 200MB if you assume 32KB pages. The interesting side-effect from this is that the documented “EDB max ver pages (increment over the minimum)” registry entry, which is the supported way of increasing your version store size, doesn’t actually have any effect unless you set it to some value greater than 6400 decimal. If you set that registry key to something less than 6400, then it will just get overridden to 6400 when AD starts. But if you set that registry entry to, say, 9600 decimal, then your version store size calculation will be:

(((2 *(3 * (15 + 4 + 4 * 1)) + 9600) * 8 / 4) * 32KB) / 1MB = 608.6MB

For a 64bit, 1-core domain controller.

So let’s set those values on a DC, then run up the version store, and let’s get empirical up in here:

image(Figure 4: Version store exhaustion at 19478 buckets on a 2012 R2 DC with 1 64bit CPU.)

(19478 * 32KB) / 1MB = 608.7MB

And wouldn’t you know it, the event log now reads:

image(Figure 5: The event log from the previous version store exhaustion, showing the effect of setting the “EDB max ver pages (increment over the minimum)” registry value to 9600.)

Here’s a table that shows version store sizes based on the “EDB max ver pages (increment over the minimum)” value and common CPU counts:

Buckets

1 CPU

2 CPUs

4 CPUs

8 CPUs

16 CPUs

6400

(The default)

x64: 410 MB

x86: 103 MB

x64: 412 MB

x86: 103 MB

x64: 415 MB

x86: 104 MB

x64: 421 MB

x86: 105 MB

x64: 433 MB

x86: 108 MB

9600

x64: 608 MB

x86: 152 MB

x64: 610 MB

x86: 153 MB

x64: 613 MB

x86: 153 MB

x64: 619MB

x86: 155 MB

x64: 631 MB

x86: 158 MB

12800

x64: 808 MB

x86: 202 MB

x64: 810 MB

x86: 203 MB

x64: 813 MB

x86: 203 MB

x64: 819 MB

x86: 205 MB

x64: 831 MB

x86: 208 MB

16000

x64: 1008 MB

x86: 252 MB

x64: 1010 MB

x86: 253 MB

x64: 1013 MB

x86: 253 MB

x64: 1019 MB

x86: 255 MB

x64: 1031 MB

x86: 258 MB

19200

x64: 1208 MB

x86: 302 MB

x64: 1210 MB

x86: 303 MB

x64: 1213 MB

x86: 303 MB

x64: 1219 MB

x86: 305 MB

x64: 1231 MB

x86: 308 MB

Sorry for the slight rounding errors – I just didn’t want to deal with decimals. As you can see, the number of CPUs in your domain controller only has a slight effect on the version store size. The processor architecture, however, makes all the difference. Good thing absolutely no one uses x86 DCs anymore, right?

Now I want to add a final word of caution.

I want to make it clear that we recommend changing the “EDB max ver pages (increment over the minimum)” only when necessary; when the event ID 623s start appearing. (If it ain’t broke, don’t fix it.) I also want to reiterate the warnings that appear on the support KB, that you must not set this value arbitrarily high, you should increment this setting in small (50MB or 100MB increments,) and that if setting the value to 19200 buckets still does not resolve your issue, then you should contact Microsoft Support. If you are going to change this value, it is advisable to change it consistently across all domain controllers, but you must also carefully consider the processor architecture and available memory on each DC before you change this setting. The version store requires a contiguous allocation of memory – precious real-estate – and raising the value too high can prevent lsass from being able to perform other work. Once the problem has subsided, you should then return this setting back to its default value.

In my next post on this topic, I plan on going into more detail on how one might actually troubleshoot the issue and track down the reason behind why the version store exhaustion is happening.

Conclusions

There is a lot of old documentation out there that has misled many an AD administrator on this topic. It was essentially accurate at the time it was written, but AD has evolved since then. I hope that with this post I was able to shed more light on the topic than you probably ever thought was necessary. It’s an undeniable truth that more and more of our customers continue to push the limits of AD beyond that which was originally conceived. I also want to remind the reader that the majority of the information in this article is AD-specific. If you’re thinking about Exchange or Certificate Services or Windows Update or DFSR or anything else that uses an ESE database, then you need to go figure out your own application-specific details, because we don’t use the same page sizes or algorithms as those guys.

I hope this will be valuable to those who find themselves asking questions about the ESE version store in Active Directory.

With love,

Ryan “Buckets of Fun” Ries


Deploying Group Policy Security Update MS16-072 \ KB3163622

$
0
0

My name is Ajay Sarkaria & I work with the Windows Supportability team at Microsoft. There have been many questions on deploying the newly released security update MS16-072.

This post was written to provide guidance and answer questions needed by administrators to deploy the newly released security update, MS16-072 that addresses a vulnerability. The vulnerability could allow elevation of privilege if an attacker launches a man-in-the-middle (MiTM) attack against the traffic passing between a domain controller and the target machine on domain-joined Windows computers.

The table below summarizes the KB article number for the relevant Operating System:

Article # Title Context / Synopsis
MSKB 3163622 MS16-072: Security Updates for Group Policy: June 14, 2016 Main article for MS16-072
MSKB 3159398 MS16-072: Description of the security update for Group Policy: June 14, 2016 MS16-072 for Windows Vista / Windows Server 2008, Window 7 / Windows Server 2008 R2, Windows Server 2012, Window 8.1 / Windows Server 2012 R2
MSKB 3163017 Cumulative update for Windows 10: June 14, 2016 MS16-072 For Windows 10 RTM
MSKB 3163018 Cumulative update for Windows 10 Version 1511 and Windows Server 2016 Technical Preview 4: June 14, 2016 MS16-072 For Windows 10 1511 + Windows Server 2016 TP4
MSKB 3163016 Cumulative Update for Windows Server 2016 Technical Preview 5: June 14 2016 MS16-072 For Windows Server 2016 TP5
TN: MS16-072 Microsoft Security Bulletin MS16-072 – Important Overview of changes in MS16-072
What does this security update change?

The most important aspect of this security update is to understand the behavior changes affecting the way User Group Policy is applied on a Windows computer. MS16-072 changes the security context with which user group policies are retrieved. Traditionally, when a user group policy is retrieved, it is processed using the user’s security context.

After MS16-072 is installed, user group policies are retrieved by using the computer’s security context. This by-design behavior change protects domain joined computers from a security vulnerability.

When a user group policy is retrieved using the computer’s security context, the computer account will now need “read” access to retrieve the group policy objects (GPOs) needed to apply to the user.

Traditionally, all group policies were read if the “user” had read access either directly or being part of a domain group e.g. Authenticated Users

What do we need to check before deploying this security update?

As discussed above, by default “Authenticated Users” have “Read” and “Apply Group Policy” on all Group Policy Objects in an Active Directory Domain.

Below is a screenshot from the Default Domain Policy:

If permissions on any of the Group Policy Objects in your active Directory domain have not been modified, are using the defaults, and as long as Kerberos authentication is working fine in your Active Directory forest (i.e. there are not Kerberos errors visible in the system event log on client computers while accessing domain resources), there is nothing else you need to make sure before you deploy the security update.

In some deployments, administrators may have removed the “Authenticated Users” group from some or all Group Policy Objects (Security filtering, etc.)

In such cases, you will need to make sure of the following before you deploy the security update:

  1. Check if “Authenticated Users” group read permissions were removed intentionally by the admins. If not, then you should probably add those back. For example, if you do not use any security filtering to target specific group policies to a set of users, you could add “Authenticated Users” back with the default permissions as shown in the example screenshot above.
  2. If the “Authenticated Users” permissions were removed intentionally (security filtering, etc), then as a result of the by-design change in this security update (i.e. to now use the computer’s security context to retrieve user policies), you will need to add the computer account retrieving the group policy object (GPO) to “Read” Group Policy (and not “Apply group policy“).

    Example Screenshot:

In the above example screenshot, let’s say an Administrator wants “User-Policy” (Name of the Group Policy Object) to only apply to the user with name “MSFT Ajay” and not to any other user, then the above is how the Group Policy would have been filtered for other users. “Authenticated Users” has been removed intentionally in the above example scenario.

Notice that no other user or group is included to have “Read” or “Apply Group Policy” permissions other than the default Domain Admins and Enterprise Admins. These groups do not have “Apply Group Policy” by default so the GPO would not apply to the users of these groups & apply only to user “MSFT Ajay”

What will happen if there are Group Policy Objects (GPOs) in an Active Directory domain that are using security filtering as discussed in the example scenario above?

Symptoms when you have security filtering Group Policy Objects (GPOs) like the above example and you install the security update MS16-072:

  • Printers or mapped drives assigned through Group Policy Preferences disappear.
  • Shortcuts to applications on users’ desktop are missing
  • Security filtering group policy does not process anymore
  • You may see the following change in gpresult: Filtering: Not Applied (Unknown Reason)
What is the Resolution?

Simply adding the “Authenticated Users” group with the “Read” permissions on the Group Policy Objects (GPOs) should be sufficient. Domain Computers are part of the “Authenticated Users” group. “Authenticated Users” have these permissions on any new Group Policy Objects (GPOs) by default. Again, the guidance is to add just “Read” permissions and not “Apply Group Policy” for “Authenticated Users”

What if adding Authenticated Users with Read permissions is not an option?

If adding “Authenticated Users” with just “Read” permissions is not an option in your environment, then you will need to add the “Domain Computers” group with “Read” Permissions. If you want to limit it beyond the Domain Computers group: Administrators can also create a new domain group and add the computer accounts to the group so you can limit the “Read Access” on a Group Policy Object (GPO). However, computers will not pick up membership of the new group until a reboot. Also keep in mind that with this security update installed, this additional step is only required if the default “Authenticated Users” Group has been removed from the policy where user settings are applied.

Example Screenshots:

Now in the above scenario, after you install the security update, as the user group policy needs to be retrieved using the system’s security context, (domain joined system being part of the “Domain Computers” security group by default), the client computer will be able to retrieve the user policies required to be applied to the user and the same will be processed successfully.

How to identify GPOs with issues:

In case you have already installed the security update and need to identify Group Policy Objects (GPOs) that are affected, the easy way is just to do a simple gpupdate /force on a Windows client computer and then run the gpresult /h new-report.html -> Open the new-report.html and review for any errors like: “Reason Denied: Inaccessible, Empty or Disabled”

What if there are lot of GPOs?

A script is available which can detect all Group Policy Objects (GPOs) in your domain which may have the “Authenticated Users” missing “Read” Permissions
You can get the script from here: https://gallery.technet.microsoft.com/Powershell-script-to-cc281476

Pre-Reqs:

  • The script can run only on Windows 7 and above Operating Systems which have the RSAT or GPMC installed or Domain Controllers running Windows Server 2008 R2 and above
  • The script works in a single domain scenario.
  • The script will detect all GPOs in your domain (Not Forest) which are missing “Authenticated Users” permissions & give the option to add “Authenticated Users” with “Read” Permissions (Not Apply Group Policy). If you have multiple domains in your Active Directory Forest, you will need to run this for each domain.
    • Domain Computers are part of the Authenticated Users group
  • The script can only add permissions to the Group Policy Objects (GPOs) in the same domain as the context of the current user running the script. In a multi domain forest, you must run it in the context of the Domain Admin of the other domain in your forest.

Sample Screenshots when you run the script:

In the first sample screenshot below, running the script detects all Group Policy Objects (GPOs) in your domain which has the “Authenticated Users” missing the Read Permission.

image

If you hit “Y”, you will see the below message:

image

What if there are AGPM managed Group Policy Objects (GPOs)?

Follow the steps below to add “Authenticated Users” with Read Permissions:

To change the permissions for all managed GPO’s and add Authenticated Users Read permission follow these steps:

Re-import all Group Policy Objects (GPOs) from production into the AGPM database. This will ensure the latest copy of production GPO’s.

clip_image002[1]

clip_image004[1]

Add either “Authenticated Users” or “Domain Computers” the READ permission using the Production Delegation Tab by selecting the security principal, granting the “READ” role then clicking “OK”

clip_image006

Grant the selected security principal the “Read” role.

clip_image008

Delegation tab depicting Authenticated Users having the READ permissions.

clip_image010

Select and Deploy GPOs again:
Note:  To modify permissions on multiple AGPM-managed GPOs, use shift+click or ctrl+click to select multiple GPO’s at a time then deploy them in a single operation.
CTRL_A does not select all policies.

clip_image012

clip_image014

The targeted GPO now have the new permissions when viewed in AD:

clip_image016

Below are some Frequently asked Questions we have seen:

Frequently Asked Questions (FAQs):

Q1) Do I need to install the fix on only client OS? OR do I also need to install it on the Server OS?

A1) It is recommended you patch Windows and Windows Server computers which are running Windows Vista, Windows Server 2008 and newer Operating Systems (OS), regardless of SKU or role, in your entire domain environment. These updates only change behavior from a client (as in “client-server distributed system architecture”) standpoint, but all computers in a domain are “clients” to SYSVOL and Group Policy; even the Domain Controllers (DCs) themselves

Q2) Do I need to enable any registry settings to enable the security update?

A2) No, this security update will be enabled when you install the MS16-072 security update, however you need to check the permissions on your Group Policy Objects (GPOs) as explained above

Q3) What will change in regard to how group policy processing works after the security update is installed?

A3) To retrieve user policy, the connection to the Windows domain controller (DC) prior to the installation of MS16-072 is done under the user’s security context. With this security update installed, instead of user’s security context, Windows group policy clients will now force local system’s security context, therefore forcing Kerberos authentication

Q4) We already have the security update MS15-011 & MS15-014 installed which hardens the UNC paths for SYSVOL & NETLOGON & have the following registry keys being pushed using group policy:

  • RequirePrivacy=1
  • RequireMutualAuthentication=1
  • RequireIntegrity=1

Should the UNC Hardening security update with the above registry settings not take care of this vulnerability when processing group policy from the SYSVOL?

A4) No. UNC Hardening alone will not protect against this vulnerability. In order to protect against this vulnerability, one of the following scenarios must apply: UNC Hardened access is enabled for SYSVOL/NETLOGON as suggested, and the client computer is configured to require Kerberos FAST Armoring

– OR –

UNC Hardened Access is enabled for SYSVOL/NETLOGON, and this particular security update (MS16-072 \ KB3163622) is installed

Q5) If we have security filtering on Computer objects, what change may be needed after we install the security update?

A5) Nothing will change in regard to how Computer Group Policy retrieval and processing works

Q6) We are using security filtering for user objects and after installing the update, group policy processing is not working anymore

A6) As noted above, the security update changes the way user group policy settings are retrieved. The reason for group policy processing failing after the update is installed is because you may have removed the default “Authenticated Users” group from the Group Policy Object (GPO). The computer account will now need “read” permissions on the Group Policy Object (GPO). You can add “Domain Computers” group with “Read” permissions on the Group Policy Object (GPO) to be able to retrieve the list of GPOs to download for the user

Example Screenshot as below:

Q7) Will installing this security update impact cross forest user group policy processing?

A7) No, this security update will not impact cross forest user group policy processing. When a user from one forest logs onto a computer in another forest and the group policy setting “Allow Cross-Forest User Policy and Roaming User Profiles” is enabled, the user group policy during the cross forest logon will be retrieved using the user’s security context.

Q8) Is there a need to specifically add “Domain Computers” to make user group policy processing work or adding “Authenticated Users” with just read permissions should suffice?

A8) Yes, just adding “Authenticated Users” with Read permissions should suffice. If you already have “Authenticated Users” added with at-least read permissions on a GPO, there is no further action required. “Domain Computers” are by default part of the “Authenticated Users” group & user group policy processing will continue to work. You only need to add “Domain Computers” to the GPO with read permissions if you do not want to add “Authenticated Users” to have “Read”

Thanks,

Ajay Sarkaria

Supportability Program Manager – Windows

Edits:
6/29/16 – added script link and prereqs
7/11/16 – added information about AGPM

Access-Based Enumeration (ABE) Concepts (part 1 of 2)

$
0
0

Hello everyone, Hubert from the German Networking Team here.  Today I want to revisit a topic that I wrote about in 2009: Access-Based Enumeration (ABE)

This is the first part of a 2-part Series. This first part will explain some conceptual things around ABE.  The second part will focus on diagnostic and troubleshooting of ABE related problems.  This post will be updated when the second part goes live.

Access-Based Enumeration has existed since Windows Server 2003 SP1 and has not change in any significant form since my Blog post in 2009. However, what has significantly changed is its popularity.

With its integration into V2 (2008 Mode) DFS Namespaces and the increasing demand for data privacy, it became a tool of choice for many architects. However, the same strict limitations and performance impact it had in Windows Server 2003 still apply today. With this post, I hope to shed some more light here as these limitations and the performance impact are either unknown or often ignored. Read on to gain a little insight and background on ABE so that you:

  1. Understand its capabilities and limitations
  2. Gain the background knowledge needed for my next post on how to troubleshoot ABE

Two things to keep in mind:

  • ABE is not a security feature (it’s more of a convenience feature)
  • There is no guarantee that ABE will perform well under all circumstances. If performance issues come up in your deployment, disabling ABE is a valid solution.

So without any further ado let’s jump right in:

What is ABE and what can I do with it?
From the TechNet topic:

“Access-based enumeration displays only the files and folders that a user has permissions to access. If a user does not have Read (or equivalent) permissions for a folder, Windows hides the folder from the user’s view. This feature is active only when viewing files and folders in a shared folder; it is not active when viewing files and folders in the local file system.”

Note that ABE has to check the user’s permissions at the time of enumeration and filter out files and folders they don’t have Read permissions to. Also note that this filtering only applies if the user is attempting to access the share via SMB versus simply browsing the same folder structure in the local file system.
For example, let’s assume you have an ABE enabled file server share with 500 files and folders, but a certain user only has read permissions to 5 of those folders. The user is only able to view 5 folders when accessing the share over the network. If the user logons to this server and browses the local file system, they will see all of the files and folders.

In addition to file server shares, ABE can also be used to filter the links in DFS Namespaces.
With V2 Namespaces DFSN got the capability to store permissions for each DFSN link, and apply those permissions to the local file system of each DFSN Server.
Those NTFS permissions are then used by ABE to filter directory enumerations against the DFSN root share thus removing DFSN links from the results sent to the client.

Therefore, ABE can be used to either hide sensitive information in the link/folder names, or to increase usability by hiding hundreds of links/folders the user does not have access to.

How does it work?
The filtering happens on the file server at the time of the request.
Any Object (File / Folder / Shortcut / Reparse Point / etc.) where the user has less than generic read permissions is omitted in the response by the server.
Generic Read means:

  • List Folder / Read Data
  • Read Attributes
  • Read Extended Attributes
  • Read Permissions

If you take any of these permissions away, ABE will hide the object.
So you could create a scenario (i.e. remove the Read Permission permission) where the object is hidden from the user, but he/she could still open/read the file or folder if the user knows its name.

That brings us to the next important conceptual point we need to understand:

ABE does not do access control.
It only filters the response to a Directory Enumeration. The access control is still done through NTFS.
Aside from that ABE only works when the access happens through the Server Service (aka the Fileserver). Any access locally to the file system is not affected by ABE. Restated:

“Access-based enumeration does not prevent users from obtaining a referral to a folder target if they already know the DFS path of the folder with targets. Permissions set using Windows Explorer or the Icacls command on namespace roots or folders without targets control whether users can access the DFS folder or namespace root. However, they do not prevent users from directly accessing a folder with targets. Only the share permissions or the NTFS file system permissions of the shared folder itself can prevent users from accessing folder targets.” Recall what I said earlier, “ABE is not a security feature”. TechNet

ABE does not do any caching.
Every requests causes a filter calculation. There is no cache. ABE will repeat the same exact work for identical directory enumerations by the same user.

ABE cannot predict the permissions or the result.
It has to do the calculations for each object in every level of your folder hierarchy every time it is accessed.
If you use inheritance on the folder structure, a user will have the same permission and thus the same filter result from ABE through the entire folder structure. Still ABE as to calculate this result, consuming CPU Cycles in the process.
If you enable ABE on such a folder structure you are just wasting CPU cycles without any gain.

With those basics out of the way, let’s dive into the mechanics behind the scenes:

How the filtering calculation works

  1. When a QUERY_DIRECTORY request (https://msdn.microsoft.com/en-us/library/cc246551.aspx) or its SMB1 equivalent arrives at the server, the server will get a list of objects within that directory from the filesystem.
  2. With ABE enabled, this list is not immediately sent out to the client, but instead passed over to the ABE for processing.
  3. ABE will iterate through EVERY object of this list and compare the permission of the user with the objects ACL.
  4. The objects where the user does not have generic read access are removed from the list.
  5. After ABE has completed its processing, the client receives the filtered list.

This yields two effects:

  • This comparison is an active operation and thus consumes CPU Cycles.
  • This comparison takes time, and this time is passed down to the User as the results will only be sent, when the comparisons for the entire directory are completed.

This brings us directly to the core point of this Blog:
In order to successfully use ABE in your environment you have to manage both effects.
If you don’t, ABE can cause a wide spread outage of your File services.

The first effect can cause a complete saturation of your CPUs (all cores at 100%).
This does not only increase the response times of the Fileserver to its clients to a magnitude where the Server is not accepting any new connections or the clients kill their connection after not getting a response from the server for several minutes, but it can also prevent you from establishing a remote desktop connection to the server to make any changes (like disabling ABE for instance).

The second effect can increase the response times of your fileserver (even if its otherwise Idle) to a magnitude that is not accepted by the Users anymore.
The comparison for a single directory enumeration by a single user can keep one CPU in your server busy for quite some time, thus making it more likely for new incoming requests to overlap with already running ABE calculations. This eventually results in a Backlog adding further to the delays experienced by your clients.

To illustrate this let’s roll some numbers:
A little disclaimer:
The following calculation is what I’ve seen, your results may differ as there are many moving pieces in play here. In other words, your mileage may vary. That aside, the numbers seen here are not entirely off but stem from real production environments. Performance of Disk and CPU and other workloads play into these numbers as well.
Thus the calculation and numbers are for illustration purposes only. Don’t use it to calculate your server’s performance capabilities.

Let’s assume you have a DFS Namespace with 10,000 links that is hosted on DFS Servers that have 4 CPUs with 3.5 GHz (also assuming RSS is configured correctly and all 4 CPUs are used by the File service: https://blogs.technet.microsoft.com/networking/2015/07/24/receive-side-scaling-for-the-file-servers/ ).
We usually expect single digit millisecond response times measured at the fileserver to achieve good performance (network latency obviously adds to the numbers seen on the client).
In our scenario above (10,000 Links, ABE, 3.5 Ghz CPU) it is not unseen that a single enumeration of the namespace would take 500ms.

CPU cores and speed

DFS Namespace Links

RSS configured per recommendations

ABE enabled?

Response time

4 @ 3.5 GHz

10,000

Yes

No

<10ms

4 @ 3.5 GHz

10,000

Yes

Yes

300 – 500 ms

That means a single CPU can handle up to 2 Directory Enumerations per Second. Multiplied by 4 CPUs the server can handle 8 User Requests per Second. Any more than those 8 requests and we push the Server into a backlog.
Backlog in this case means new requests are stuck in the Processor Queue behind other requests, therefore multiplying the wait time.
This can reach dimensions where the client (and the user) is waiting for minutes and the client eventually decides to kill the TCP connection, and in case of DFSN, fail over to another server.

Anyone remotely familiar with Fileserver Scalability probably instantly recognizes how bad and frightening those numbers are.  Please keep in mind, that not every request sent to the server is a QUERY_DIRECTORY request, and all other requests such as Write, Read, Open, Close etc. do not cause an ABE calculation (however they suffer from an ABE-induced lack of CPU resources in the same way).
Furthermore, the Windows File Service Client caches the directory enumeration results if SMB2 or SMB3 is used (https://technet.microsoft.com/en-us/library/ff686200(v=ws.10).aspx ).
There is no such Cache for SMB1. Thus SMB1 Clients will send more Directory Enumeration Requests than SMB2 or SMB3 Clients (particularly if you keep the F5 key pressed).
It should now be obvious that you should use SMB2/3 versus SMB1 and ensure you leave the caches enabled if you use ABE on your servers.

As you might have realized by now, there is no easy or reliable way to predict the CPU demand of ABE. If you are developing a completely new environment you usually cannot forecast the proportion of QUERY_DIRECTORY requests in relation to the other requests or the frequency of the same.

Recommendations!
The most important recommendation I can give you is:

Do not enable ABE unless you really need to.
Let’s take the Users Home shares as an example:
Usually there is no user browsing manually through this structure, but instead the users get a mapped drive pointing to their folder. So the usability aspect does not count.  Additionally most users will know (or can find out from the Office Address book) the names or aliases of their colleagues. So there is no sensitive information to hide here.  For ease of management most home shares live in big namespace or server shares, what makes them very unfit to be used with ABE.  In many cases the user has full control (or at least write permissions) inside his own home share.  Why should I waste my CPU Cycles to filter the requests inside someone’s Home Share?

Considering all those points, I would be intrigued to learn about a telling argument to enable ABE on User Home Shares or Roaming Profile Shares.  Please sound off in the comments.

If you have a data structure where you really need to enable ABE, your file service concept needs to facilitate these four requirements:

You need Scalability.
You need the ability to increase the number of CPUs doing the ABE calculations in order to react to increasing numbers (directory sizes, number of clients, usage frequency) and thus performance demand.
The easiest way to achieve this is to do ABE Filtering exclusively in DFS Domain Namespaces and not on the Fileservers.
By that you can add easily more CPUs by just adding further Namespace Servers in the sites where they are required.
Also keep in mind, that you should have some redundancy and that another server might not be able to take the full additional load of a failing server on top of its own load.

You need small chunks
The number of objects that ABE needs to check for each calculation is the single most important factor for the performance requirement.
Instead of having a single big 10,000 link namespace (same applies to directories on file servers) build 10 smaller 1,000 link-namespaces and combine them into a DFS Cascade.
By that ABE just needs to filter 1,000 objects for every request.
Just re-do the example calculation above with 250ms, 100ms, 50ms or even less.
You will notice that you are suddenly able to reach very decent numbers in terms of Requests/per Second.
The other nice side effect is, that you will do less calculations, as the user will usually follow only one branch in the directory tree, and is thus not causing ABE calculations for the other branches.

You need Separation of Workloads.
Having your SQL Server run on the same machine as your ABE Server can cause a lack of Performance for both workloads.
Having ABE run on you Domain Controller exposes your Domain Controller Role to the risk of being starved of CPU Cycles and thus not facilitating Domain Logons anymore.

You need to test and monitor your performance
In many cases you are deploying a new file service concept into an existing environment.
Thus you can get some numbers regarding QUERY_DIRECTORY requests, from the existing DFS / Fileservers.
Build up your Namespace / Shares as you envisioned and use the File Server Capacity Tool (https://msdn.microsoft.com/en-us/library/windows/hardware/dn567658(v=vs.85).aspx ) to simulate the expected load against it.
Monitor the SMB Service Response Times, the Processor utilization and Queue length and the feel on the client while browsing through the structures.
This should give you an idea on how many servers you will need, and if it is required to go for a slimmer design of the data structures.
Keep monitoring those values through the lifecycle of your file server deployment in order to scale up in time.
Any deployment of new software, clients or the normal increase in data structure size could throw off your initial calculations and test results.
This point should imho be outlined very clearly in any concept documentation.

This concludes the first part of this Blog Series.
I hope you found it worthwhile and got an understanding how to successfully design a File service with ABE.
Now to round off your knowledge, or if you need to troubleshoot a Performance Issue on an ABE-enabled Server, I strongly encourage you to read the second part of this Blog Series. This post will be updated as soon as it’s live.

With best regards,
Hubert

Access-Based Enumeration (ABE) Troubleshooting (part 2 of 2)

$
0
0

Hello everyone! Hubert from the German Networking Team here again with part two of my little Blog Post Series about Access-Based Enumeration (ABE). In the first part I covered some of the basic concepts of ABE. In this second part I will focus on monitoring and troubleshooting Access-based enumeration.
We will begin with a quick overview of Windows Explorer’s directory change notification mechanism (Change Notify), and how that mechanism can lead to performance issues before moving on to monitoring your environment for performance issues.

Change Notify and its impact on DFSN servers with ABE

Let’s say you are viewing the contents of a network share while a file or folder is added to the share remotely by someone else. Your view of this share will be updated automatically with the new contents of the share without you having to manually refresh (press F5) your view.
Change Notify is the mechanism that makes this work in all SMB Protocols (1,2 and 3).
The way it works is quite simple:

  1. The client sends a CHANGE_NOTIFY request to the server indicating the directory or file it is interested in. Windows Explorer (as an application on the client) does this by default for the directory that is currently in focus.
  2. Once there is a change to the file or directory in question, the server will respond with a CHANGE_NOTIFY Response, indicating that a change happened.
  3. This causes the client to send a QUERY_DIRECTORY request (in case it was a directory or DFS Namespace) to the server to find out what has changed.

    QUERY_DIRECTORY is the thing we discussed in the first post that causes ABE filter calculations. Recall that it’s these filter calculation that result in CPU load and client-side delays.
    Let’s look at a common scenario:
  4. During login, your users get a mapped drive pointing at a share in a DFS Namespace.
  5. This mapped drive causes the clients to connect to your DFSN Servers
  6. The client sends a Change Notification (even if the user hasn’t tried to open the mapped drive in Windows Explorer yet) for the DFS Root.

    Nothing more happens until there is a change on the server-side. Administrative work, such as adding and removing links, typically happens during business hours, whenever the administrators find the time, or the script that does it, runs.

    Back to our scenario. Let’s have a server-side change to illustrate what happens next:
  7. We add a Link to the DFS Namespace.
  8. Once the DFSN Server picks up the new link in the namespace from Active directory, it will create the corresponding reparse point in its local file system.
    If you do not use Root Scalability Mode (RSM) this will happen almost at the same time on all of the DFS Servers in that namespace. With RSM the changes will usually be applied by the different DFS servers over the next hour (or whatever your SyncInterval is set to).
  9. These changes trigger CHANGE_NOTIFY responses to be sent out to any client that indicated interest in changes to the DFS Root on that server. This usually applies to hundreds of clients per DFS server.
  10. This causes hundreds of Clients to send QUERY_DIRECTORY requests simultaneously.

What happens next strongly depends on the size of your namespace (larger namespaces lead to longer duration per ABE calculation) and the number of Clients (aka Requests) per CPU of the DFSN Server (remember the calculation from the first part?)

As your Server does not have hundreds of CPUs there will definitely be some backlog. The numbers above decide how big this backlog will be, and how long it takes for the server to work its way back to normal. Keep in mind that while pedaling out of the backlog situation, your server still has to answer other, ongoing requests that are unrelated to our Change Notify Event.
Suffice it to say, this backlog and the CPU demand associated with it can also have negative impact to other jobs.  For example, if you use this DFSN server to make a bunch of changes to your namespace, these changes will appear to take forever, simply because the executing server is starved of CPU Cycles. The same holds true if you run other workloads on the same server or want to RDP into the box.

So! What can you do about it?
As is common with an overloaded server, there are a few different approaches you could take:

  • Distribute the load across more servers (and CPU cores)
  • Make changes outside of business hours
  • Disable Change Notify in Windows Explorer

Approach

Method

Distribute the load / scale up

An expensive way to handle the excessive load is to throw more servers/CPU cores into the DFS infrastructure. In theory, you could increase the number of Servers and the number of CPUs to a level where you can handle such peak loads without any issues, but that can be a very expensive approach.

Make changes outside business hours

Depending on your organizations structure, your business needs, SLAs and other requirements, you could simply make planned administrative changes to your Namespaces outside the main business hours, when there are less clients connected to your DFSN Servers.

Disable Change Notify in Windows Explorer

You can set:
NoRemoteChangeNotify
NoRemoteRecursiveEvents
See https://support.microsoft.com/en-us/kb/831129
to prevent Windows Explorer from sending Change Notification Requests.
This is however a client-side setting that disables this functionality (change notify) not just for DFS shares but also for any fileserver it is working with. Thus you have to actively press F5 to see changes to a folder or a share in your Windows Explorer. This might or might not be a big deal for your users.

Monitoring ABE

As you may have realized by now, ABE is not a fire and forget technology—it needs constant oversight and occasional tuning. We’ve mainly discussed the design and “tuning” aspect so far. Let’s look into the monitoring aspect.

Using Task Manager / Process Explorer

This is a bit tricky, unfortunately, as any load caused by ABE shows up in Task Manager inside the System process (as do many other things on the server). In order to correlate high CPU utilization in the System process to ABE load, you need to use a tool such as Process Explorer and configure it to use public symbols. With this configured properly, you can drill deeper inside the System Process and see the different threads and the component names. We need to note, that ABE and the Fileserver both use functions in srv.sys and srv2.sys. So strictly speaking it’s not possible to differentiate between them just by the component names. However, if you are troubleshooting a performance problem on an ABE-enabled server where most of the threads in the System process are sitting in functions from srv.sys and srv2.sys, then it’s very likely due to expensive ABE filter calculations. This is, aside from disabling ABE, the best approach to reliably prove your problem to be caused by ABE.

Using Network trace analysis

Looking at CPU utilization shows us the server-side problem. We must use other measures to determine what the client-side impact is, one approach is to take a network trace and analyze the SMB/SMB2 Service Response times. You may however end up having to capture the trace on a mirrored switch port. To make analysis of this a bit easier, Message Analyzer has an SMB Service Performance chart you can use.

clip_image002

You get there by using a New Viewer, like below.

smbserviceperf

Wireshark also has a feature that provides you with statistics under Statistics -> Service Response Times -> SMB2. Ignore the values for ChangeNotify (its normal that they are several seconds or even minutes). All other response times translate into delays for the clients. If you see values over a second, you can consider your files service not only to be slow but outright broken.
While you have that trace in front of you, you can also look for SMB/TCP Connections that are terminated abnormally by the Client as the server failed to respond to the SMB Requests in time. If you have any of those, then you have clients unable to connect to your file service, likely throwing error messages.

Using Performance Monitor

If your server is running Windows Server 2012 or newer, the following performance counters are available:

Object

Counter

Instance

SMB Server Shares

Avg. sec /Data Request

<Share that has ABE Enabled>

SMB Server Shares

Avg. sec/Read

‘’

SMB Server Shares

Avg. sec/Request

‘’

SMB Server Shares

Avg. sec/Write

‘’

SMB Server Shares

Avg. Data Queue Length

‘’

SMB Server Shares

Avg. Read Queue Length

‘’

SMB Server Shares

Avg. Write Queue Length

‘’

SMB Server Shares

Current Pending Requests

‘’

clip_image005

Most noticeable here is Avg. sec/Request counter as this contains the response time to the QUERY_DIRECTORY requests (Wireshark displays them as Find Requests). The other values will suffer from a lack of CPU Cycles in varying ways but all indicate delays for the clients. As mentioned in the first part: We expect single digit millisecond response times from non-ABE Fileservers that are performing well. For ABE-enabled Servers (more precisely Shares) the values for QUERY_DIRECTORY / Find Requests will always be higher due to the inevitable length of the ABE Calculation.

When you reached a state where all the other SMB Requests aside of the QUERY_DIRECTORY are constantly responded to in less than 10ms and the QUERY_DIRECTORY constantly in less than 50ms you have a very good performing Server with ABE.

Other Symptoms

There are other symptoms of ABE problems that you may observe, however, none of them on their own is very telling, without the information from the points above.

At a first glance a high CPU Utilization and a high Processor Queue lengths are indicators of an ABE problem, however they are also indicators of other CPU-related performance issues. Not to mention there are cases where you encounter ABE performance problems without saturating all your CPUs.

The Server Work Queues\Active Threads (NonBlocking) will usually raise to their maximum allowed limit (MaxThreadsPerQueue ) as well as the Server Work Queues\Queue Length increasing. Both indicate that the Fileserver is busy, but on their own don’t tell you how bad the situation is. However, there are scenarios where the File server will not use up all Worker Threads allowed due to a bottleneck somewhere else such as in the Disk Subsystem or CPU Cycles available to it.

See the following should you choose to setup long-term monitoring (which you should) in order to get some trends:

Number of Objects per Directory or Number of DFS Links
Number of Peak User requests (Performance Counter: Requests / sec.)
Peak Server Response time to Find Requests or Performance Counter: Avg. sec/Request
Peak CPU Utilization and Peak Processor Queue length.

If you collect those values every day (or a shorter interval), you can get a pretty good picture how much head-room you have left with your servers at the moment and if there are trends that you need to react to.

Feel free to add more information to your monitoring to get a better picture of the situation.  For example: gather information on how many DFS servers were active at any given day for a certain site, so you can explain if unusual high numbers of user requests on the other servers come from a server downtime.

ABELevel

Some of you might have heard about the registry key ABELevel. The ABELevel value specifies the maximum level of the folders on which the ABE feature is enabled. While the title of the KB sounds very promising, and the hotfix is presented as a “Resolution”, the hotfix and registry value have very little practical application.  Here’s why:
ABELevel is a system-wide setting and does not differentiate between different shares on the same server. If you host several shares, you are unable to filter to different depths as the setting forces you to go for the deepest folder hierarchy. This results in unnecessary filter calculations for shares.

Usually the widest directories are on the upper levels—those levels that you need to filter.  Disabling the filtering for the lower level directories doesn’t yield much of a performance gain, as those small directories don’t have much impact on server performance, while the big top-level directories do.  Furthermore, the registry value doesn’t make any sense for DFS Namespaces as you have only one folder level there and you should avoid filtering on your fileservers anyway.

While we are talking about Updates

Here is one that you should install:
High CPU usage and performance issues occur when access-based enumeration is enabled in Windows 8.1 or Windows 7 – https://support.microsoft.com/en-us/kb/2920591

Furthermore you should definitely review the Lists of recommended updates for your server components:
DFS
https://support.microsoft.com/en-us/kb/968429 (2008 / 2008 R2)
https://support.microsoft.com/en-us/kb/2951262 (2012 / 2012 R2)

File Services
https://support.microsoft.com/en-us/kb/2473205 (2008 / 2008 R2)
https://support.microsoft.com/en-us/kb/2899011 (2012 / 2012 R2)

Well then, this concludes this small (my first) blog series.
I hope you found reading it worthwhile and got some input for your infrastructures out there.

With best regards
Hubert


Troubleshooting failed password changes after installing MS16-101

$
0
0

Hi!

Linda Taylor here, Senior Escalation Engineer in the Directory Services space.

I have spent the last month working with customers worldwide who experienced password change failures after installing the updates under Ms16-101 security bulletin KB’s (listed below), as well as working with the product group in getting those addressed and documented in the public KB articles under the known issues section. It has been busy!

In this post I will aim to provide you with a quick “cheat sheet” of known issues and needed actions as well as ideas and troubleshooting techniques to get there.

Let’s start by understanding the changes.

The following 6 articles describe the changes in MS16-101 as well as a list of Known issues. If you have not yet applied MS16-101 I would strongly recommend reading these and understanding how they may affect you.

        3176492 Cumulative update for Windows 10: August 9, 2016
        3176493 Cumulative update for Windows 10 Version 1511: August 9, 2016
        3176495 Cumulative update for Windows 10 Version 1607: August 9, 2016
        3178465 MS16-101: Security update for Windows authentication methods: August 9, 2016
        3167679 MS16-101: Description of the security update for Windows authentication methods: August 9, 2016
        3177108 MS16-101: Description of the security update for Windows authentication methods: August 9, 2016

The good news is that this month’s updates address some of the known issues with MS16-101.

The bad news is that not all the issues are caused by some code defect in MS16-101 and in some cases the right solution is to make your environment more secure by ensuring that the password change can happen over Kerberos and does not need to fall back to NTLM. That may include opening TCP ports used by Kerberos, fixing other Kerberos problems like missing SPN’s or changing your application code to pass in a valid domain name.

Let’s start with the basics…

Symptoms:

After applying MS16-101 fixes listed above, password changes may fail with the error code

“The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.”
Or
“The system cannot contact a domain controller to service the authentication request. Please try again later.”

This text maps to the error codes below:

Hexadecimal

Decimal

Symbolic

Friendly

0xc0000388

1073740920

STATUS_DOWNGRADE_DETECTED

The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

0x80074f1

1265

ERROR_DOWNGRADE_DETECTED

The system detected a possible attempt to compromise security. Please make sure that you can contact the server that authenticated you.

Question: What does MS16-101 do and why would password changes fail after installing it?

Answer: As documented in the listed KB articles, the security updates that are provided in MS16-101 disable the ability of the Microsoft Negotiate SSP to fall back to NTLM for password change operations in the case where Kerberos fails with the STATUS_NO_LOGON_SERVERS (0xc000005e) error code.
In this situation, the password change will now fail (post MS16-101) with the above mentioned error codes (ERROR_DOWNGRADE_DETECTED / STATUS_DOWNGRADE_DETECTED).
Important: Password RESET is not affected by MS16-101 at all in any scenario. Only password change using the Negotiate package is affected.

So, now you understand the change, let’s look at the known issues and learn how to best identify and resolve those.

Summary and Cheat Sheet

To make it easier to follow I have matched the ordering of known issues in this post with the public KB articles above.

First, when troubleshooting a failed password change post MS16-101 you will need to understand HOW and WHERE the password change is happening and if it is for a domain account or a local account. Here is a cheat sheet.

Summary of SCENARIO’s and a quick reference table of actions needed.

Scenario / Known issue #

Description

Action Needed

1.

Domain password change fails via CTRL+ALT+DEL and shows an error like this:

clip_image001

Text: “System detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you. “

Troubleshoot using this guide and fix Kerberos.

2.

Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION and it now returns STATUS_DOWNGRADE_DETECTED (after installing Ms16-101) causing your application to behave in an expected way or even crash.

Note: In these cases password change works ok when correct new password is entered that complies with the password policy.

Install October fixes in the table below.

3.

Local user account password change fails via CTRL+ALT+DEL or application code.

Install October fixes in the table below.

4.

Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

None. By design.

5.

Domain password change fails via application code when a good password is entered.

This is the case where if you pass a servername to NetUserChangePassword, the password change will fail post MS16-101. This is because it would have previously worked and relied on NTLM. NTLM is insecure and Kerberos is always preferred. Therefore passing a domain name here is the way forward.

One thing to note for this one is that most of the ADSI and C#/.NET changePassword API’s end up calling NetUserChangePassword under the hood. Therefore, also passing invalid domain names to these API’s will fail. I have provided a detailed walkthrough example in this post with log snippets.

Troubleshoot using this guide and fix code to use Kerberos.

6.

After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.

To resolve this issue, see KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.

7.

After you install the security updates that are described in MS16-101, remote, programmatic changes of a local user account password remotely, and password changes across untrusted forest fail with the STATUS_DOWNGRADE_DETECTED error as documented in this post.

This happens because the operation relies on NTLM fall-back since there is no Kerberos without a trust.  NTLM fall-back is forbidden by MS16-101.

For this scenario you will need to install October fixes in the table below and set the registry key NegoAllowNtlmPwdChangeFallback documented in KB’s below which allows the NTLM fall back to happen again and unblocks this scenario.

http://support.microsoft.com/kb/3178465
http://support.microsoft.com/kb/3167679
http://support.microsoft.com/kb/3177108
http://support.microsoft.com/kb/3176492
http://support.microsoft.com/kb/3176495
http://support.microsoft.com/kb/3176493

Note: you may also consider using this registry key in an emergency for Known Issue#5 when it takes time to update the application code. However please read the above articles carefully and only consider this as a short term solution for scenario 5.


Table of Fixes for known issues above release 2016.10.11, taken from MS16-101 Security Bulletin:

OS

Fix needed

Vista / W2K8

Re-install 3167679, re-released 2016.10.11

Win7 / W2K8 R2

Install 3192391 (security only)
or
Install 3185330 (monthly rollup that includes security fixes)

WS12

3192393 (security only)
or
3185332 (monthly rollup that includes security fixes)

Win8.1 / WS12 R2

3192392 (security only)
OR
3185331 ((monthly rollup that includes security fixes)

Windows 10

For 1511: 3192441 Cumulative update for Windows 10 Version 1511: October 11, 2016
For 1607: 3194798 Cumulative update for Windows 10 Version 1607 and Windows Server 2016: October 11, 2016

Troubleshooting

As I mentioned, this post is intended to support the documentation of the known issues in the Ms16-101 KB articles and provide help and guidance for troubleshooting. It should help you identify which known issue you are experiencing as well as provide resolution suggestions for each case.

I have also included a troubleshooting walkthrough of some of the more complex example cases. We will start with the problem definition, and then look at the available logs and tools to identify a suitable resolution. The idea is to teach “how to fish” because there can be many different scenario’s and hopefully you can apply these techniques and use the log files documented here to help resolve the issues when needed.

Once you know the scenario that you are using for the password change the next step is usually to collect some data on the server or client where the password change is occuring. For example if you have a web server running a password change application and doing password changes on behalf of users, you will need to collect the logs there. If in doubt collect the logs from all involved machines and then look for the right one doing the password change using the snippets in the examples. Here are the helpful logs.

DATA COLLECTION

The same logs will help in all the scenario’s.

LOGS

1. SPENGO debug log/ LSASS.log

To enable this log run the following commands from an elevated admin CMD prompt to set the below registry keys:

reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v SPMInfoLevel /t REG_DWORD /d 0xC03E3F /f
reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v LogToFile /t REG_DWORD /d 1 /f
reg add HKLM\SYSTEM\CurrentControlSet\Control\LSA /v NegEventMask /t REG_DWORD /d 0xF /f


  • This will log Negotiate debug output to the %windir%\system32\lsass.log.
  • There is no need for reboot. The log is effective immediately.
  • Lsass.log is a text file that is easy to read with a text editor such as Wordpad.

2. Netlogon.log:

This log has been around for many years and is useful for troubleshooting DC LOCATOR traffic. It can be used together with a network trace to understand why the STATUS_NO_LOGON_SERVERS is being returned for the Kerberos password change attempt.

· To enable Netlogon debug logging run the following command from an elevated CMD prompt:

            nltest /dbflag:0x26FFFFFF

· The resulting log is found in %windir%\debug\netlogon.log & netlogon.bak

· There is no need for reboot. The log is effective immediately. See also 109626 Enabling debug logging for the Net Logon service

· The Netlogon.log (and Netlogon.bak) is a text file.

           Open the log with any text editor (I like good old Notepad.exe)

3. Collect a Network trace during the password change issue using the tool of your choice.

Scenario’s, Explanations and Walkthrough’s:

When reading this you should keep in mind that you may be seeing more than one scenario. The best thing to do is to start with one, fix that and see if there are any other problems left.

1. Domain password change fails via CTRL+ALT+DEL

This is most likely a Kerberos DC locator failure of some kind where the password changes were relying on NTLM before installing MS16-101 and are now failing. This is the simplest and easiest case to resolve using basic Kerberos troubleshooting methods.

Solution: Fix Kerberos.

Some tips from cases which we saw:

1. Use the Network trace to identify if the necessary communication ports are open. This was quite a common issue. So start by checking this.

         In order for Kerberos password changes to work communication on TCP port 464 needs to be open between the client doing the
         password change and the domain controller.

Note on RODC: Read-only domain controllers (RODCs) can service password changes if the user is allowed by the RODCs password replication policy. Users who are not allowed by the RODC password policy require network connectivity to a read/write domain controller (RWDC) in the user account domain to be able to change the password.

           To check whether TCP port 464 is open, follow these steps (also documented in KB3167679):

             a. Create an equivalent display filter for your network monitor parser. For example:

                            ipv4.address== <ip address of client> && tcp.port==464

             b. In the results, look for the “TCP:[SynReTransmit” frame.

If you find these, then investigate firewall and open ports. It is often useful to take a simultaneous trace from the client and the domain controller and check if the packets are arriving at the other end.

2. Make sure that the target Kerberos names are valid.

  • IP addresses are not valid Kerberos names
  • Kerberos supports short names and fully qualified domain names. Like CONTOSO or Contoso.com

3. Make sure that service principal names (SPNs) are registered correctly.

For more information on troubleshooting Kerberos see https://blogs.technet.microsoft.com/askds/2008/05/14/troubleshooting-kerberos-authentication-problems-name-resolution-issues/ or https://technet.microsoft.com/en-us/library/cc728430(v=ws.10).aspx

2. Domain password change fails via application code with an INCORRECT/UNEXPECTED Error code when a password which does not meet password complexity is entered.

For example, before installing MS16-101, such password change may have returned a status like STATUS_PASSWORD_RESTRICTION. After installing Ms16-101 it returns STATUS_DOWNGRADE_DETECTED causing your application to behave in an expected way or even crash.

Note: In this scenario, password change succeeds when correct new password is entered that complies with the password policy.

Cause:

This issue is caused by a code defect in ADSI whereby the status returned from Kerberos was not returned to the user by ADSI correctly.
Here is a more detailed explanation of this one for the geek in you:

Before MS16-101 behavior:

           1. An application calls ChangePassword method from using the ADSI LDAP provider.
           Setting and changing passwords with the ADSI LDAP Provider is documented here.
           Under the hood this calls Negotiate/Kerberos to change the password using a valid realm name.
           Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

          2. A 2nd changepassword call is made via NetUserChangePassword API with an intentional realmname as the <dcname> which uses
           Negotiate and will retry Kerberos. Kerberos fails with STATUS_NO_LOGON_SERVERS because a DC name is not a valid realm name.

         3.Negotiate then retries over NTLM which succeeds or returns the same previous failure status.

The password change fails if a bad password was entered and the NTLM error code is returned back to the application. If a valid password was entered, everything works because the 1st change password call passes in a good name and if Kerberos works, the password change operation succeeds and you never enter into step 3.

Post MS16-101 behavior /why it fails with MS16-101 installed:

         1. An application calls ChangePassword method from using the ADSI LDAP provider. This calls Negotiate for the password change with
          a valid realm name.
         Kerberos returns STATUS_PASSWORD_RESTRICTION or Other failure code.

         2. A 2nd ChangePassword call is made via NetUserChangePassword with a <dcname> as realm name which fails over Kerberos with
         STATUS_NO_LOGON_SERVERS which triggers NTLM fallback.

          3. Because NTLM fallback is blocked on MS16-101, Error STATUS_DOWNGRADE_DETECTED is returned to the calling app.

Solution: Easy. Install the October update which will fix this issue. The fix lies in adsmsext.dll included in the October updates.

Again, here are the updates you need to install, Taken from MS16-101 Security Bulletin:

OS

Fix needed

Vista / W2K8

Re-install 3167679, re-released 2016.10.11

Win7 / W2K8 R2

Install 3192391 (security only)
or
Install 3185330 (monthly rollup that includes security fixes)

WS12

3192393 (security only)
or
3185332 (monthly rollup that includes security fixes)

Win8.1 / WS12 R2

3192392 (security only)
OR
3185331 ((monthly rollup that includes security fixes)

Windows 10

For 1511: 3192441 Cumulative update for Windows 10 Version 1511: October 11, 2016
For 1607: 3194798 Cumulative update for Windows 10 Version 1607 and Windows Server 2016: October 11, 2016

3.Local user account password change fails via CTRL+ALT+DEL or application code.

Installing October updates above should also resolve this.

MS16-101 had a defect where Negotiate did not correctly determine that the password change was local and would try to find a DC using the local machine as the domain name.

This failed and NTLM fallback was no longer allowed post MS16-101. Therefore, the password changes failed with STATUS_DOWNGRADE_DETECTED.

Example:

One such scenario which I saw where password changes of local user accounts via ctrl+alt+delete failed with the message “The system detected a possible attempt to compromise security. Please ensure that you can contact the server that authenticated you.” Was when you have the following group policy set and you try to change a password of a local account:

Policy

Computer Configuration \ Administrative Templates \ System \ Logon\“Assign a default domain for logon”

Path

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System\DefaultLogonDomain

Setting

DefaultLogonDomain

Data Type

REG_SZ

Value

“.”    (less quotes). The period or “dot” designates the local machine name

Notes

Cause: In this case, post MS16-101 Negotiate incorrectly determined that the account is not local and tried to discover a DC using \\<machinename> as the domain and failed. This caused the password change to fail with the STATUS_DOWNGRADE_DETECTED error.

Solution: Install October fixes listed in the table at the top of this post.

4.Passwords for disabled and locked out user accounts cannot be changed using Negotiate method.

MS16-101 purposely disabled changing the password of locked-out or disabled user account passwords via Negotiate by design.

Important: Password Reset is not affected by MS16-101 at all in any scenario. Only password change. Therefore, any application which is doing a password Reset will be unaffected by Ms16-101.

Another important thing to note is that MS16-101 only affects applications using Negotiate. Therefore, it is possible to change locked-out and disabled account password using other method’s such as LDAPs.

For example, the PowerShell cmdlet Set-ADAccountPassword will continue to work for locked out and disabled account password changes as it does not use Negotiate.

5. Troubleshooting Domain password change failure via application code when a good password is entered.

This is one of the most difficult scenarios to identify and troubleshoot. And therefore I have provided a more detailed example here including sample code, the cause and solution.

In summary, the solution for these cases is almost always to correct the application code which maybe passing in an invalid domain name such that Kerberos fails with STATUS_NO_LOGON_SERVERS.

Scenario:

An application is using system.directoryservices.accountmanagement namespace to change a users password.
https://msdn.microsoft.com/en-us/library/system.directoryservices.accountmanagement(v=vs.110).aspx

After installing Ms16-101 password changes fail with STATUS_DOWNGRADE_DETECTED. Example .NET failing code snippet using PowerShell which worked before MS16-101:

<snip>

Add-Type -AssemblyName System.DirectoryServices.AccountManagement
$ct = [System.DirectoryServices.AccountManagement.ContextType]::Domain
$ctoptions = [System.DirectoryServices.AccountManagement.ContextOptions]::SimpleBind -bor [System.DirectoryServices.AccountManagement.ContextOptions]::ServerBind
$pc = New-Object System.DirectoryServices.AccountManagement.PrincipalContext($ct, “contoso.com”,”OU=Accounts,DC=Contoso,DC=Com”, ,$ctoptions)
$idType = [System.DirectoryServices.AccountManagement.IdentityType]::SamAccountName  
$up = [System.DirectoryServices.AccountManagement.UserPrincipal]::FindByIdentity($pc,$idType, “TestUser”)
$up.ChangePassword(“oldPassword!123”, “newPassword!123”)

<snip>

Data Analysis

There are 2 possibilities here:
(a) The application code is passing an incorrect domain name parameter causing Kerberos password change to fail to locate a DC.
(b)  Application code is good and Kerberos password change fails for other reason like blocked port or DNS issue or missing SPN.

Let’s start with (a) The application code is passing an incorrect domain name/parameter causing Kerberos password change to fail to locate a DC.

(a) Data Analysis Walkthrough Example based on a real case:

1. Start with Lsass.log (SPNEGO trace)

If you are troubleshooting a password change failure after MS16-101 look for the following text in Lsass.log to indicate that Kerberos failed and NTLM fallback was forbidden by Ms16-101:

Failing Example:

[ 9/13 10:23:36] 492.2448> SPM-WAPI: [11b0.1014] Dispatching API (Message 0)
[ 9/13 10:23:36] 492.2448> SPM-Trace: [11b0] LpcDispatch: dispatching ChangeAccountPassword (1a)
[ 9/13 10:23:36] 492.2448> SPM-Trace: [11b0] LpcChangeAccountPassword()
[ 9/13 10:23:36] 492.2448> SPM-Helpers: [11b0] LsapCopyFromClient(0000005EAB78C9D8, 000000DA664CE5E0, 16) = 0
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword:
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: NegoExtender
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: Kerberos
[ 9/13 10:23:36] 492.2448> SPM-Warning: Failed to change password for account Test: 0xc000005e
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, attempting: NTLM
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, NTLM failed: not allowed to change domain passwords
[ 9/13 10:23:36] 492.2448> SPM-Neg: NegChangeAccountPassword, returning: 0xc0000388

  • 0xc000005E is STATUS_NO_LOGON_SERVERS
    0xc0000388 is STATUS_DOWNGRADE_DETECTED

If you see this, it means Kerberos failed to locate a Domain Controller in the domain and fallback to NTLM is not allowed by Ms16-101. Next you should look at the Netlogon.log and the Network trace to understand why.

2. Network trace

Look at the network trace and filter the traffic based on the client IP, DNS and any authentication related traffic.
You may see the client is requesting a Kerberos ticket using an invalid SPN like:


Source

Destination

Description

Client

DC1

KerberosV5:TGS Request Realm: CONTOSO.COM Sname: ldap/contoso.com             {TCP:45, IPv4:7}

DC1

Client

KerberosV5:KRB_ERROR  – KDC_ERR_S_PRINCIPAL_UNKNOWN (7)  {TCP:45, IPv4:7}

So here the client tried to get a ticket for this ldap\Contoso.com SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere.

  • This is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com

Next let’s check the Netlogon.log

3. Netlogon.log:

Open the log with any text editor (I like good old Notepad.exe) and check the following:

  • Is a valid domain name being passed to DC locator?

Invalid names such as \\servername.contoso.com or IP address \\x.y.x.w will cause dclocator to fail and thus Kerberos password change to return STATUS_NO_LOGON_SERVERS. Once that happens NTLM fall back is not allowed and you get a failed password change.

If you find this issue examine the application code and make necessary changes to ensure correct domain name format is being passed to the ChangePassword API that is being used.

Example of failure in Netlogon.log:

[MISC] [PID] DsGetDcName function called: client PID=1234, Dom:\\contoso.com Acct:(null) Flags: IP KDC
[MISC] [PID] DsGetDcName function returns 1212 (client PID=1234): Dom:\\contoso.com Acct:(null) Flags: IP KDC

\\contoso.com is not a valid domain name. (contoso.com is a valid domain name)

This Error translates to:

0x4bc

1212

ERROR_INVALID_DOMAINNAME

The format of the specified domain name is invalid.

winerror.h

So what happened here?

The application code passed an invalid TargetName to kerberos. It used the domain name as a server name and so we see the SPN of LDAP\contoso.com.

The client tried to get a ticket for this SPN and failed with KDC_ERR_S_PRINCIPAL_UNKNOWN because this SPN is not registered anywhere. As Noted: this is expected. A valid LDAP SPN is example like ldap\DC1.contoso.com.

The application code then tried the password change again and passed in \\contoso.com as a domain name for the password change. Anything beginning with \\ as domain name is not valid. IP address is not valid. So DCLOCATOR will fail to locate a DC when given this domain name. We can see this in the Netlogon.log and the Network trace.

Conclusion and Solution

If the domain name is invalid here, examine the code snippet which is doing the password change to understand why the wrong name is passed in.

The fix in these cases will be to change the code to ensure a valid domain name is passed to Kerberos to allow the password change to successfully happen over Kerberos and not NTLM. NTLM is not secure. If Kerberos is possible, it should be the protocol used.

SOLUTION

The solution here was to remove “ContextOptions.ServerBind |  ContextOptions.SimpleBind ” and allow the code to use the default (Negotiate). Note, because we were using a domain context but ServerBind this caused the issue. Negotiate with Domain context is the option that works and is successfully able to use kerberos.

Working code:

<snip>
Add-Type -AssemblyName System.DirectoryServices.AccountManagement
$ct = [System.DirectoryServices.AccountManagement.ContextType]::Domain
$pc = New-Object System.DirectoryServices.AccountManagement.PrincipalContext($ct, “contoso.com”,”OU=Accounts,DC=Contoso,DC=Com”)
$idType = [System.DirectoryServices.AccountManagement.IdentityType]::SamAccountName  
$up = [System.DirectoryServices.AccountManagement.UserPrincipal]::FindByIdentity($pc,$idType, “TestUser”)
$up.ChangePassword(“oldPassword!123”, “newPassword!123”)

<snip>

Why does this code work before MS16-101 and fail after?

ContextOptions are documented here: https://msdn.microsoft.com/en-us/library/system.directoryservices.accountmanagement.contextoptions(v=vs.110).aspx

Specifically: “This parameter specifies the options that are used for binding to the server. The application can set multiple options that are linked with a bitwise OR operation. “

Passing in a domain name such as contoso.com with the ContextOptions ServerBind or SimpleBind causes the client to attempt to use an SPN like ldap\contoso.com because it expects the name which is passed in to be a ServerName.

This is not a valid SPN and does not exist, therefore this will fail and as a result Kerberos will fail with STATUS_NO_LOGON_SERVERS.
Before MS16-101, in this scenario, the Negotiate package would fall back to NTLM, attempt the password change using NTLM and succeed.
Post MS16-101 this fall back is not allowed and Kerberos is enforced.

(b) If Application Code is good but Kerberos fails to locate a DC for other reason

If you see a correct domain name and SPN’s in the above logs, then the issue is that kerberos fails for some other reason such as blocked TCP ports. In this case revert to Scenario 1 to troubleshoot why Kerberos failed to locate a Domain Controller.

There is a chance that you may also have both (a) and (b). Traces and logs are the best tools to identify.

Scenario6: After you install MS 16-101 update, you may encounter 0xC0000022 NTLM authentication errors.

I will not go into detail of this scenario as it is well described in the KB article KB3195799 NTLM authentication fails with 0xC0000022 error for Windows Server 2012, Windows 8.1, and Windows Server 2012 R2 after update is applied.

That’s all for today! I hope you find this useful. I will update this post if any new information arises.

Linda Taylor | Senior Escalation Engineer | Windows Directory Services
(A well established member of the content police.)

Using Debugging Tools to Find Token and Session Leaks

$
0
0

Hello AskDS readers and Identity aficionados. Long time no blog.

Ryan Ries here, and today I have a relatively “hardcore” blog post that will not be for the faint of heart. However, it’s about an important topic.

The behavior surrounding security tokens and logon sessions has recently changed on all supported versions of Windows. IT professionals – developers and administrators alike – should understand what this new behavior is, how it can affect them, and how to troubleshoot it.

But first, a little background…

Figure 1 - Tokens

Figure 1 – Tokens

Windows uses security tokens (or access tokens) extensively to control access to system resources. Every thread running on the system uses a security token, and may own several at a time. Threads inherit the security tokens of their parent processes by default, but they may also use special security tokens that represent other identities in an activity known as impersonation. Since security tokens are used to grant access to resources, they should be treated as highly sensitive, because if a malicious user can gain access to someone else’s security token, they will be able to access resources that they would not normally be authorized to access.

Note: Here are some additional references you should read first if you want to know more about access tokens:

If you are an application developer, your application or service may want to create or duplicate tokens for the legitimate purpose of impersonating another user. A typical example would be a server application that wants to impersonate a client to verify that the client has permissions to access a file or database. The application or service must be diligent in how it handles these access tokens by releasing/destroying them as soon as they are no longer needed. If the code fails to call the CloseHandle function on a token handle, that token can then be “leaked” and remain in memory long after it is no longer needed.

And that brings us to Microsoft Security Bulletin MS16-111.

Here is an excerpt from that Security Bulletin:

Multiple Windows session object elevation of privilege vulnerabilities exist in the way that Windows handles session objects.

A locally authenticated attacker who successfully exploited the vulnerabilities could hijack the session of another user.
To exploit the vulnerabilities, the attacker could run a specially crafted application.
The update corrects how Windows handles session objects to prevent user session hijacking.

Those vulnerabilities were fixed with that update, and I won’t further expound on the “hacking/exploiting” aspect of this topic. We’re here to explore this from a debugging perspective.

This update is significant because it changes how the relationship between tokens and logon sessions is treated across all supported versions of Windows going forward. Applications and services that erroneously leak tokens have always been with us, but the penalty paid for leaking tokens is now greater than before. After MS16-111, when security tokens are leaked, the logon sessions associated with those security tokens also remain on the system until all associated tokens are closed… even after the user has logged off the system. If the tokens associated with a given logon session are never released, then the system now also has a permanent logon session leak as well. If this leak happens often enough, such as on a busy Remote Desktop/Terminal Server where users are logging on and off frequently, it can lead to resource exhaustion on the server, performance issues and denial of service, ultimately causing the system to require a reboot to be returned to service.

Therefore, it’s more important than ever to be able to identify the symptoms of token and session leaks, track down token leaks on your systems, and get your application vendors to fix them.

How Do I Know If My Server Has Leaks?

As mentioned earlier, this problem affects heavily-utilized Remote Desktop Session Host servers the most, because users are constantly logging on and logging off the server. The issue is not limited to Remote Desktop servers, but symptoms will be most obvious there.

Figuring out that you have logon session leaks is the easy part. Just run qwinsta at a command prompt:

Figure 2 - qwinsta

Figure 2 – qwinsta

Pay close attention to the session ID numbers, and notice the large gap between session 2 and session 152. This is the clue that the server has a logon session leak problem. The next user that logs on will get session 153, the next user will get session 154, the next user will get session 155, and so on. But the session IDs will never be reused. We have 150 “leaked” sessions in the screenshot above, where no one is logged on to those sessions, no one will ever be able to log on to those sessions ever again (until a reboot,) yet they remain on the system indefinitely. This means each user who logs onto the system is inadvertently leaving tokens lying around in memory, probably because some application or service on the system duplicated the user’s token and didn’t release it. These leaked sessions will forever be unusable and soak up system resources. And the problem will only get worse as users continue to log on to the system. In an optimal situation where there were no leaks, sessions 3-151 would have been destroyed after the users logged out and the resources consumed by those sessions would then be reusable by subsequent logons.

How Do I Find Out Who’s Responsible?

Now that you know you have a problem, next you need to track down the application or service that is responsible for leaking access tokens. When an access token is created, the token is associated to the logon session of the user who is represented by the token, and an internal reference count is incremented. The reference count is decremented whenever the token is destroyed. If the reference count never reaches zero, then the logon session is never destroyed or reused. Therefore, to resolve the logon session leak problem, you must resolve the underlying token leak problem(s). It’s an all-or-nothing deal. If you fix 10 token leaks in your code but miss 1, the logon session leak will still be present as if you had fixed none.

Before we proceed: I would recommend debugging this issue on a lab machine, rather than on a production machine. If you have a logon session leak problem on your production machine, but don’t know where it’s coming from, then install all the same software on a lab machine as you have on the production machine, and use that for your diagnostic efforts. You’ll see in just a second why you probably don’t want to do this in production.

The first step to tracking down the token leaks is to enable token leak tracking on the system.

Modify this registry setting:

HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Kernel
    SeTokenLeakDiag = 1 (DWORD)

The registry setting won’t exist by default unless you’ve done this before, so create it. It also did not exist prior to MS16-111, so don’t expect it to do anything if the system does not have MS16-111 installed. This registry setting enables extra accounting on token issuance that you will be able to detect in a debugger, and there may be a noticeable performance impact on busy servers. Therefore, it is not recommended to leave this setting in place unless you are actively debugging a problem. (i.e. don’t do it in production exhibit A.)

Prior to the existence of this registry setting, token leak tracing of this kind used to require using a checked build of Windows. And Microsoft seems to not be releasing a checked build of Server 2016, so… good timing.

Next, you need to configure the server to take a full or kernel memory dump when it crashes. (A live kernel debug may also be an option, but that is outside the scope of this article.) I recommend using DumpConfigurator to configure the computer for complete crash dumps. A kernel dump should be enough to see most of what we need, but get a Complete dump if you can.

Figure 3 - DumpConfigurator

Figure 3 – DumpConfigurator

Then reboot the server for the settings to take effect.

Next, you need users to log on and off the server, so that the logon session IDs continue to climb. Since you’re doing this in a lab environment, you might want to use a script to automatically logon and logoff a set of test users. (I provided a sample script for you here.) Make sure you’ve waited 10 minutes after the users have logged off to verify that their logon sessions are permanently leaked before proceeding.

Finally, crash the box. Yep, just crash it. (i.e. don’t do it in production exhibit B.) On a physical machine, this can be done by hitting Right-Ctrl+Scroll+Scroll if you configured the appropriate setting with DumpConfigurator earlier. If this is a Hyper-V machine, you can use the following PowerShell cmdlet on the Hyper-V host:

Debug-VM -VM (Get-VM RDS1) -InjectNonMaskableInterrupt

You may have at your disposal other means of getting a non-maskable interrupt to the machine, such as an out-of-band management card (iLO/DRAC, etc.,) but the point is to deliver an NMI to the machine, and it will bugcheck and generate a memory dump.

Now transfer the memory dump file (C:\Windows\Memory.dmp usually) to whatever workstation you will use to perform your analysis.

Note: Memory dumps may contain sensitive information, such as passwords, so be mindful when sharing them with strangers.

Next, install the Windows Debugging Tools on your workstation if they’re not already installed. I downloaded mine for this demo from the Windows Insider Preview SDK here. But they also come with the SDK, the WDK, WPT, Visual Studio, etc. The more recent the version, the better.

Next, download the MEX Debugging Extension for WinDbg. Engineers within Microsoft have been using the MEX debugger extension for years, but only recently has a public version of the extension been made available. The public version is stripped-down compared to the internal version, but it’s still quite useful. Unpack the file and place mex.dll into your C:\Debuggers\winext directory, or wherever you installed WinDbg.

Now, ensure that your symbol path is configured correctly to use the Microsoft public symbol server within WinDbg:

Figure 4 - Example Symbol Path in WinDbg

Figure 4 – Example Symbol Path in WinDbg

The example symbol path above tells WinDbg to download symbols from the specified URL, and store them in your local C:\Symbols directory.

Finally, you are ready to open your crash dump in WinDbg:

Figure 5 - Open Crash Dump from WinDbg

Figure 5 – Open Crash Dump from WinDbg

After opening the crash dump, the first thing you’ll want to do is load the MEX debugging extension that you downloaded earlier, by typing the command:

Figure 6 - .load mex

Figure 6 – .load mex

The next thing you probably want to do is start a log file. It will record everything that goes on during this debugging session, so that you can refer to it later in case you forgot what you did or where you left off.

Figure 7 - !logopen

Figure 7 – !logopen

Another useful command that is among the first things I always run is !DumpInfo, abbreviated !di, which simply gives some useful basic information about the memory dump itself, so that you can verify at a glance that you’ve got the correct dump file, which machine it came from and what type of memory dump it is.

Figure 8 - !DumpInfo

Figure 8 – !DumpInfo

You’re ready to start debugging.

At this point, I have good news and I have bad news.

The good news is that there already exists a super-handy debugger extension that lists all the logon session kernel objects, their associated token reference counts, what process was responsible for creating the token, and even the token creation stack, all with a single command! It’s
!kdexts.logonsession, and it is awesome.

The bad news is that it doesn’t work… not with public symbols. It only works with private symbols. Here is what it looks like with public symbols:

Figure 9 - !kdexts.logonsession - public symbols lead to lackluster output

Figure 9 – !kdexts.logonsession – public symbols lead to lackluster output

As you can see, most of the useful stuff is zeroed out.

Since public symbols are all you have unless you work at Microsoft, (and we wish you did,) I’m going to teach you how to do what
!kdexts.logonsession does, manually. The hard way. Plus some extra stuff. Buckle up.

First, you should verify whether token leak tracking was turned on when this dump was taken. (That was the registry setting mentioned earlier.)

Figure 10 - SeTokenLeakTracking = <no type information>

Figure 10 – x nt!SeTokenLeakTracking = <no type information>

OK… That was not very useful. We’re getting <no type information> because we’re using public symbols. But this symbol corresponds to the SeTokenLeakDiag registry setting that we configured earlier, and we know that’s just 0 or 1, so we can just guess what type it is:

Figure 11 - db nt!SeTokenLeakTracking L1

Figure 11 – db nt!SeTokenLeakTracking L1

The db command means “dump bytes.” (dd, or “dump DWORDs,” would have worked just as well.) You should have a symbol for
nt!SeTokenLeakTracking if you configured your symbol path properly, and the L1 tells the debugger to just dump the first byte it finds. It should be either 0 or 1. If it’s 0, then the registry setting that we talked about earlier was not set properly, and you can basically just discard this dump file and get a new one. If it’s 1, you’re in business and may proceed.

Next, you need to locate the logon session lists.

Figure 12 - dp nt!SepLogonSessions L1

Figure 12 – dp nt!SepLogonSessions L1

Like the previous step, dp means “display pointer,” then the name of the symbol, and L1 to just display a single pointer. The 64-bit value on the right is the pointer, and the 64-bit value on the left is the memory address of that pointer.

Now we know where our lists of logon sessions begin. (Lists, plural.)

The SepLogonSessions pointer points to not just a list, but an array of lists. These lists are made up of _SEP_LOGON_SESSION_REFERENCES structures.

Using the dps command (display contiguous pointers) and specifying the beginning of the array that we got from the last step, we can now see where each of the lists in the array begins:

Figure 13 - dps 0xffffb808`3ea02650 – displaying pointers that point to the beginning of each list in the array

Figure 13 – dps 0xffffb808`3ea02650 – displaying pointers that point to the beginning of each list in the array

If there were not very many logon sessions on the system when the memory dump was taken, you might notice that not all the lists are populated:

Figure 14 - Some of the logon session lists are empty because not very many users had logged on in this example

Figure 14 – Some of the logon session lists are empty because not very many users had logged on in this example

The array doesn’t fill up contiguously, which is a bummer. You’ll have to skip over the empty lists.

If we wanted to walk just the first list in the array (we’ll talk more about dt and linked lists in just a minute,) it would look something like this:

Figure 15 - Walking the first list in the array and using !grep to filter the output

Figure 15 – Walking the first list in the array and using !grep to filter the output

Notice that I used the !grep command to filter the output for the sake of brevity and readability. It’s part of the Mex debugger extension. I told you it was handy. If you omit the !grep AccountName part, you would get the full, unfiltered output. I chose “AccountName” arbitrarily as a keyword because I knew that was a word that was unique to each element in the list. !grep will only display lines that contain the keyword(s) that you specify.

Next, if we wanted to walk through the entire array of lists all at once, it might look something like this:

Figure 16 - Walking through the entire array of lists!

Figure 16 – Walking through the entire array of lists!

OK, I realize that I just went bananas there, but I’ll explain what just happened step-by-step.

When you are using the Mex debugger extension, you have access to many new text parsing and filtering commands that can truly enhance your debugging experience. When you look at a long command like the one I just showed, read it from right to left. The commands on the right are fed into the command to their left.

So from right to left, let’s start with !cut -f 2 dps ffffb808`3ea02650

We already showed what the dps <address> command did earlier. The !cut -f 2 command filters that command’s output so that it only displays the second part of each line separated by whitespace. So essentially, it will display only the pointers themselves, and not their memory addresses.

Like this:

Figure 17 - Using !cut to select just the second token in each line of output

Figure 17 – Using !cut to select just the second token in each line of output

Then that is “piped” line-by-line into the next command to the left, which was:

!fel -x “dt nt!_SEP_LOGON_SESSION_REFERENCES @#Line -l Next”

!fel is an abbreviation for !foreachline.

This command instructs the debugger to execute the given command for each line of output supplied by the previous command, where the @#Line pseudo-variable represents the individual line of output. For each line of output that came from the dps command, we are going to use the dt command with the -l parameter to walk that list. (More on walking lists in just a second.)

Next, we use the !grep command to filter all of that output so that only a single unique line is shown from each list element, as I showed earlier.

Finally, we use the !count -q command to suppress all of the output generated up to that point, and instead only tell us how many lines of output it would have generated. This should be the total number of logon sessions on the system.

And 380 was in fact the exact number of logon sessions on the computer when I collected this memory dump. (Refer to Figure 16.)

Alright… now let’s take a deep breath and a step back. We just walked an entire array of lists of structures with a single line of commands. But now we need to zoom in and take a closer look at the data structures contained within those lists.

Remember, ffffb808`3ea02650 was the very beginning of the entire array.

Let’s examine just the very first _SEP_LOGON_SESSION_REFERENCES entry of the first list, to see what such a structure looks like:

Figure 18 - dt _SEP_LOGON_SESSION_REFERENCES* ffffb808`3ea02650

Figure 18 – dt _SEP_LOGON_SESSION_REFERENCES* ffffb808`3ea02650

That’s a logon session!

Let’s go over a few of the basic fields in this structure. (Skipping some of the more advanced ones.)

  • Next: This is a pointer to the next element in the list. You might notice that there’s a “Next,” but there’s no “Previous.” So, you can only walk the list in one direction. This is a singly-linked list.
  • LogonId: Every logon gets a unique one. For example, “0x3e7” is always the “System” logon.
  • ReferenceCount: This is how many outstanding token references this logon session has. This is the number that must reach zero before the logon session can be destroyed. In our example, it’s 4.
  • AccountName: The user who does or used to occupy this session.
  • AuthorityName: Will be the user’s Active Directory domain, typically. Or the computer name if it’s a local account.
  • TokenList: This is a doubly or circularly-linked list of the tokens that are associated with this logon session. The number of tokens in this list should match the ReferenceCount.

The following is an illustration of a doubly-linked list:

Figure 19 - Doubly or circularly-linked list

Figure 19 – Doubly or circularly-linked list

Flink” stands for Forward Link, and “Blink” stands for Back Link.

So now that we understand that the TokenList member of the _SEP_LOGON_SESSION_REFERENCES structure is a linked list, here is how you walk that list:

Figure 20 - dt nt!_LIST_ENTRY* 0xffffb808`500bdba0+0x0b0 -l Flink

Figure 20 – dt nt!_LIST_ENTRY* 0xffffb808`500bdba0+0x0b0 -l Flink

The dt command stands for “display type,” followed by the symbol name of the type that you want to cast the following address to. The reason why we specified the address 0xffffb808`500bdba0 is because that is the address of the _SEP_LOGON_SESSION_REFERENCES object that we found earlier. The reason why we added +0x0b0 after the memory address is because that is the offset from the beginning of the structure at which the TokenList field begins. The -l parameter specifies that we’re trying to walk a list, and finally you must specify a field name (Flink in this case) that tells the debugger which field to use to navigate to the next node in the list.

We walked a list of tokens and what did we get? A list head and 4 data nodes, 5 entries total, which lines up with the ReferenceCount of 4 tokens that we saw earlier. One of the nodes won’t have any data – that’s the list head.

Now, for each entry in the linked list, we can examine its data. We know the payloads that these list nodes carry are tokens, so we can use dt to cast them as such:

Figure 21 - dt _TOKEN*0xffffb808`4f565f40+8+8 - Examining the first token in the list

Figure 21 – dt _TOKEN*0xffffb808`4f565f40+8+8 – Examining the first token in the list

The reason for the +8+8 on the end is because that’s the offset of the payload. It’s just after the Flink and Blink as shown in Figure 19. You want to skip over them.

We can see that this token is associated to SessionId 0x136/0n310. (Remember I had 380 leaked sessions in this dump.) If you examine the UserAndGroups member by clicking on its DML (click the link,) you can then use !sid to see the SID of the user this token represents:

Figure 22 - Using !sid to see the security identifier in the token

Figure 22 – Using !sid to see the security identifier in the token

The token also has a DiagnosticInfo structure, which is super-interesting, and is the coolest thing that we unlocked when we set the SeTokenLeakDiag registry setting on the machine earlier. Let’s look at it:

Figure 23 - Examining the DiagnosticInfo structure of the first token

Figure 23 – Examining the DiagnosticInfo structure of the first token

We now have the process ID and the thread ID that was responsible for creating this token! We could examine the ImageFileName, or we could use the ProcessCid to see who it is:

Figure 24 - Using !mex.tasklist to find a process by its PID

Figure 24 – Using !mex.tasklist to find a process by its PID

Oh… Whoops. Looks like this particular token leak is lsass’s fault. You’re just going to have to let the *ahem* application vendor take care of that one.

Let’s move on to a different token leak. We’re moving on to a different memory dump file as well, so the memory addresses are going to be different from here on out.

I created a special token-leaking application specifically for this article. It looks like this:

Figure 25 - RyansTokenGrabber.exe

Figure 25 – RyansTokenGrabber.exe

It monitors the system for users logging on, and as soon as they do, it duplicates their token via the DuplicateToken API call. I purposely never release those tokens, so if I collect a memory dump of the machine while this is running, then evidence of the leak should be visible in the dump, using the same steps as before.

Using the same debugging techniques I just demonstrated, I verified that I have leaked logon sessions in this memory dump as well, and each leaked session has an access token reference that looks like this:

Figure 26 - A _TOKEN structure shown with its attached DiagnosticInfo

Figure 26 – A _TOKEN structure shown with its attached DiagnosticInfo

And then by looking at the token’s DiagnosticInfo, we find that the guilty party responsible for leaking this token is indeed RyansTokenGrabber.exe:

Figure 27 - The process responsible for leaking this token

Figure 27 – The process responsible for leaking this token

By this point you know who to blame, and now you can go find the author of RyansTokenGrabber.exe, and show them the stone-cold evidence that you’ve collected about how their application is leaking access tokens, leading to logon session leaks, causing you to have to reboot your server every few days, which is a ridiculous and inconvenient thing to have to do, and you shouldn’t stand for it!

We’re almost done. but I have one last trick to show you.

If you examine the StackTrace member of the token’s DiagnosticInfo, you’ll see something like this:

Figure 28 - DiagnosticInfo.CreateTrace

Figure 28 – DiagnosticInfo.CreateTrace

This is a stack trace. It’s a snapshot of all the function calls that led up to this token’s creation. These stack traces grew upwards, so the function at the top of the stack was called last. But the function addresses are not resolving. We must do a little more work to figure out the names of the functions.

First, clean up the output of the stack trace:

Figure 29 - Using !grep and !cut to clean up the output

Figure 29 – Using !grep and !cut to clean up the output

Now, using all the snazzy new Mex magic you’ve learned, see if you can unassemble (that’s the u command) each address to see if resolves to a function name:

Figure 30 - Unassemble instructions at each address in the stack trace

Figure 30 – Unassemble instructions at each address in the stack trace

The output continues beyond what I’ve shown above, but you get the idea.

The function on top of the trace will almost always be SepDuplicateToken, but could also be SepCreateToken or SepFilterToken, and whether one creation method was used versus another could be a big hint as to where in the program’s code to start searching for the token leak. You will find that the usefulness of these stacks will vary wildly from one scenario to the next, as things like inlined functions, lack of symbols, unloaded modules, and managed code all influence the integrity of the stack. However, you (or the developer of the application you’re using) can use this information to figure out where the token is being created in this program, and fix the leak.

Alright, that’s it. If you’re still reading this, then… thank you for hanging in there. I know this wasn’t exactly a light read.

And lastly, allow me to reiterate that this is not just a contrived, unrealistic scenario; There’s a lot of software out there on the market that does this kind of thing. And if you happen to write such software, then I really hope you read this blog post. It may help you improve the quality of your software in the future. Windows needs application developers to be “good citizens” and avoid writing software with the ability to destabilize the operating system. Hopefully this blog post helps someone out there do just that.

Until next time,
Ryan “Too Many Tokens” Ries

Active Directory Experts: apply within

$
0
0

Hi all! Justin Turner here from the Directory Services team with a brief announcement: We are hiring!

Would you like to join the U.S. Directory Services team and work on the most technically challenging and interesting Active Directory problems? Do you want to be the next Ned Pyle or Linda Taylor?

Then read more…

We are an escalation team based out of Irving, Texas; Charlotte, North Carolina; and Fargo, North Dakota. We work with enterprise customers helping them resolve the most critical Active Directory infrastructure problems as well as enabling them to get the best of Microsoft Windows and Identity-related technologies. The work we do is no ordinary support – we work with a huge variety of customer environments and there are rarely two problems which are the same.

You will need strong AD knowledge, strong troubleshooting skills, along with great collaboration, team work and customer service skills.

If this sounds like you, please apply here:

Irving, Texas (Las Colinas):
https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=2&jid=290399&jlang=en&pp=ss

Charlotte, North Carolina:
https://careers.microsoft.com/jobdetails.aspx?ss=&pg=0&so=&rw=1&jid=290398&jlang=en&pp=ss

U.S. citizenship is required for these positions.


Justin

Introducing Lingering Object Liquidator v2

$
0
0

Greetings again AskDS!

Ryan Ries here. Got something exciting to talk about.

You might be familiar with the original Lingering Object Liquidator tool that was released a few years ago.

Today, we’re proud to announce version 2 of Lingering Object Liquidator!

Because Justin’s blog post from 2014 covers the fundamentals of what lingering objects are so well, I don’t think I need to go over it again here. If you need to know what lingering objects in Active Directory are, and why you want to get rid of them, then please go read that post first.

The new version of the Lingering Object Liquidator tool began its life as an attempt to address some of the long-standing limitations of the old version. For example, the old version would just stop the entire scan when it encountered a single domain controller that was unreachable. The new version will just skip the unreachable DC and continue scanning the other DCs that are reachable. There are multiple other improvements in the tool as well, such as multithreading and more exhaustive logging.

Before we take a look at the new tool, there are some things you should know:

1) Lingering Object Liquidator – neither the old version nor the new version – are covered by CSS Support Engineers. A small group of us (including yours truly) have provided this tool as a convenience to you, but it comes with no guarantees. If you find a problem with the tool, or have a feature request, drop a line to the public AskDS public email address, or submit feedback to the Windows Server UserVoice forum, but please don’t bother Support Engineers with it on a support case.

2) Don’t immediately go into your production Active Directory forest and start wildly deleting things just because they show up as lingering objects in the tool. Please carefully review and consider any AD objects that are reported to be lingering objects before deleting.

3) The tool may report some false positives for deleted objects that are very close to the garbage collection age. To mitigate this issue, you can manually initiate garbage collection on your domain controllers before using this tool. (We may add this so the tool does it automatically in the future.)

4) The tool will continue to evolve and improve based on your feedback! Contact the AskDS alias or the Uservoice forum linked to in #1 above with any questions, concerns, bug reports or feature requests.

Graphical User Interface Elements

Let’s begin by looking at the graphical user interface. Below is a legend that explains each UI element:

Lingering Object Liquidator v2

Lingering Object Liquidator v2

A) “Help/About” label. Click this and a page should open up in your default web browser with extra information and detail regarding Lingering Object Liquidator.

B) “Check for Updates” label. Click this and the tool will check for a newer version than the one you’re currently running.

C) “Detect AD Topology” button. This is the first button that should be clicked in most scenarios. The AD Topology must be generated first, before proceeding on to the later phases of lingering object detection and removal.

D) “Naming Context” drop-down menu. (Naming Contexts are sometimes referred to as partitions.) Note that this drop-down menu is only available after AD Topology has been successfully discovered. It contains each Active Directory naming context in the forest. If you know precisely which Active Directory Naming context that you want to scan for lingering objects, you can select it from this menu. (Note: The Schema partition is omitted because it does not support deletion, so in theory it cannot contain lingering objects.) If you do not know which naming contexts may contain lingering objects, you can select the “[Scan All NCs]” option and the tool will scan each Naming Context that it was able to discover during the AD Topology phase.

E) “Reference DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The reference DC is the “known-good” DC against which you will compare other domain controllers for lingering objects. If a domain controller contains AD objects that do not exist on the Reference DC, they will be considered lingering objects. If you select the “[Scan Entire Forest]” option, then the tool will (arbitrarily) select one global catalog from each domain in the forest that is known to be reachable. It is recommended that you wisely choose a known-good DC yourself, because the tool doesn’t necessarily know “the best” reference DC to pick. It will pick one at random.

F) “Target DC” drop-down menu. Note that this drop-down menu is only available after AD Topology has been successfully discovered. The Target DC is the domain controller that is suspected of containing lingering objects. The Target DC will be compared against the Reference DC, and each object that exists on the Target DC but not on the Reference DC is considered a lingering object. If you aren’t sure which DC(s) contain lingering objects, or just want to scan all domain controllers, select the “[Target All DCs]” option from the drop-down menu.

G) “Detect Lingering Objects” button. Note that this button is only available after AD Topology has been successfully discovered. After you have made the appropriate selections in the three aforementioned drop-down menus, click the Detect Lingering Objects button to run the scan. Clicking this button only runs a scan; it does not delete anything. The tool will automatically detect and avoid certain nonsensical situations, such as the user specifying the same Reference and Target DCs, or selecting a Read-Only Domain Controller (RODC) as a Reference DC.

H) “Select All” button. Note that this button does not become available until after lingering objects have been detected. Clicking it merely selects all rows from the table below.

I) “Remove Selected Lingering Objects” button. This button will attempt to delete all lingering objects that have been detected by the detection process. You can select a range of items from the list using the shift key and the arrow keys. You can select and unselect specific items by holding down the control key and clicking on them. If you want to just select all items, click the “Select All” button.

J) “Removal Method” radio buttons. These are mutually exclusive. You can choose which of the two supported methods you want to use to remove the lingering objects that have been detected. The “removeLingeringObject” method refers to the rootDSE modify operation, which can be used to “spot-remove” individual lingering objects. In contrast, the DsReplicaVerifyObjects method will remove all lingering objects all at once. This intention is reflected in the GUI by all lingering objects automatically being selected when the DsReplicaVerifyObjects method is chosen.

K) “Import” button. This imports a previously-exported list of lingering objects.

L) “Export” button. This exports the selected lingering objects to a file.

M) The “Target DC Column.” This column tells you which Target DC was seen to have contained the lingering object.

N) The “Reference DC Column.” This column tells you which Reference DC was used to determine that the object in question was lingering.

O) The “Object DN Column.” This column contains the distinguished name of the lingering object.

P) The “Naming Context Column.” This column contains the Naming Context that the lingering object resides in.

Q) The “Lingering Object ListView”. This “ListView” works similarly to a spreadsheet. It will display all lingering objects that were detected. You can think of each row as a lingering object. You can click on the column headers to sort the rows in ascending or descending order, and you can resize the columns to fit your needs. NOTE: If you right-click on the lingering object listview, the selected lingering objects (if any) will be copied to your clipboard.

R) The “Status” box. The status box contains diagnostics and operational messages from the Lingering Object Liquidator tool. Everything that is logged to the status box in the GUI is also mirrored to a text log file.

User-configurable Settings

 

The user-configurable settings in Lingering Object Liquidator are alluded to in the Status box when the application first starts.

Key: HKLM\SOFTWARE\LingeringObjectLiquidator
Value: DetectionTimeoutPerDCSeconds
Type: DWORD
Default: 300 seconds

This setting affects the “Detect Lingering Objects” scan. Lingering Object Liquidator establishes event log “subscriptions” to each Target DC that it needs to scan. The tool then waits for the DC to log an event (Event ID 1942 in the Directory Service event log) signaling that lingering object detection has completed for a specific naming context. Only once a certain number of those events (depending on your choices in the “Naming Contexts” drop-down menu,) have been received from the remote domain controller, does the tool know that particular domain controller has been fully scanned. However, there is an overall timeout, and if the tool does not receive the requisite number of Event ID 1942s in the allotted time, the tool “gives up” and proceeds to the next domain controller.

Key: HKLM\SOFTWARE\LingeringObjectLiquidator
Value: ThreadCount
Type: DWORD
Default: 8 threads

This setting sets the maximum number of threads to use during the “Detect Lingering Objects” scan. Using more threads may decrease the overall time it takes to complete a scan, especially in very large environments.

Tips

The domain controllers must allow the network connectivity required for remote event log management for Lingering Object Liquidator to work. You can enable the required Windows Firewall rules using the following line of PowerShell:

Get-NetFirewallRule -DisplayName "Remote Event Log*" | Enable-NetFirewallRule

Check the download site often for new versions! (There’s also a handy “check for updates” option in the tool.)

 

Final Words

We provide this tool because we at AskDS want your Active Directory lingering object removal experience to go as smoothly as possible. If you find any bugs or have any feature requests, please drop a note to our public contact alias.

Download: http://aka.ms/msftlol

 

Release Notes

v2.0.19:
– Initial release to the public.

v2.0.21:
– Added new radio buttons that allow the user more control over which lingering object removal method they want to use – the DsReplicaVerifyObjects method or removeLingeringObject method.
– Fixed issue with Export button not displaying the full path of the export file.
– Fixed crash when unexpected or corrupted data is returned from event log subscription.

ESE Deep Dive: Part 1: The Anatomy of an ESE database

$
0
0

hi!

Get your crash helmets on and strap into your seatbelts for a JET engine / ESE database special…

This is Linda Taylor, Senior AD Escalation Engineer from the UK here again. And WAIT…… I also somehow managed to persuade Brett Shirley to join me in this post. Brett is a Principal Software Engineer in the ESE Development team so you can be sure the information in this post is going to be deep and confusing but really interesting and useful and the kind you cannot find anywhere else :- )
BTW, Brett used to write blogs before he grew up and got very busy. And just for fun, you might find this old  “Brett” classic entertaining. I have never forgotten it. :- )
Back to today’s post…this will be a rather more grown up post, although we will talk about DITs but in a very scientific fashion.

In this post, we will start from the ground up and dive deep into the overall file format of an ESE database file including practical skills with esentutl such as how to look at raw database pages. And as the title suggests this is Part1 so there will be more!

What is an ESE database?

Let’s start basic. The Extensible Storage Engine (ESE), also known as JET Blue, is a database engine from Microsoft that does not speak SQL. And Brett also says … For those with a historical bent, or from academia, and remember ‘before SQL’ instead of ‘NoSQL’ ESE is modelled after the ISAMs (indexed sequential access method) that were vogue in the mid-70s. ;-p
If you work with Active Directory (which you must do if you are reading this post 🙂 then you will (I hope!) know that it uses an ESE database. The respective binary being, esent.dll (or Brett loves exchange, it’s ese.dll for the Exchange Server install). Applications like active directory are all ESE clients and use the JET APIs to access the ESE database.

1

This post will dive deep into the Blue parts above. The ESE side of things. AD is one huge client of ESE, but there are many other Windows components which use an ESE database (and non-Microsoft software too), so your knowledge in this area is actually very applicable for those other areas. Some examples are below:

2

Tools

There are several built-in command line tools for looking into an ESE database and related files. 

  1. esentutl. This is a tool that ships in Windows Server by default for use with Active Directory, Certificate Authority and any other built in ESE databases.  This is what we will be using in this post and can be used to look at any ESE database.

  1. eseutil. This is the Exchange version of the same and gets installed typically in the Microsoft\Exchange\V15\Bin sub-directory of the Program Files directory.

  1. ntdsutil. Is a tool specifically for managing an AD or ADLDS databases and cannot be used with generic ESE databases (such as the one produced by Certificate Authority service).  This is installed by default when you add the AD DS or ADLDS role.

For read operations such as dumping file or log headers it doesn’t matter which tool you use. But for operations which write to the database you MUST use the matching tool for the application and version (for instance it is not safe to run esentutl /r from Windows Server 2016 on a Windows Server 2008 DB). Further throughout this article if you are looking at an Exchange database instead, you should use eseutil.exe instead of esentutl.exe. For AD and ADLDS always use ntdsutil or esentutl. They have different capabilities, so I use a mixture of both. And Brett says that If you think you can NOT keep the read operations straight from the write operations, play it safe and match the versions and application.

During this post, we will use an AD database as our victim example. We may use other ones, like ADLDS for variety in later posts.

Database logical format – Tables

Let’s start with the logical format. From a logical perspective, an ESE database is a set of tables which have rows and columns and indices.

Below is a visual of the list of tables from an AD database in Windows Server 2016. Different ESE databases will have different table names and use those tables in their own ways.

3

In this post, we won’t go into the detail about the DNTs, PDNTs and how to analyze an AD database dump taken with LDP because this is AD specific and here we are going to look at ESE specific level. Also, there are other blogs and sources where this has already been explained. for example, here on AskPFEPlat. However, if such post is wanted, tell me and I will endeavor to write one!!

It is also worth noting that all ESE databases have a table called MSysObjects and MSysObjectsShadow which is a backup of MSysObjects. These are also known as “the catalog” of the database and they store metadata about client’s schema of the database – i.e.

  1. All the tables and their table names and where their associated B+ trees start in the database and other miscellaneous metadata.

  1. All the columns for each table and their names (of course), the type of data stored in them, and various schema constraints.

  1. All the indexes on the tables and their names, and where their associated B+ trees start in the database.

This is the boot-strap information for ESE to be able to service client requests for opening tables to eventually retrieve rows of data.

Database physical format

From a physical perspective, an ESE database is just a file on disk. It is a collection of fixed size pages arranged into B+ tree structures. Every database has its page size stamped in the header (and it can vary between different clients, AD uses 8 KB). At a high level it looks like this:

4

The first “page” is the Header (H).

The second “page” is a Shadow Header (SH) which is a copy of the header.

However, in ESE “page number” (also frequently abbreviated “pgno”) has a very specific meaning (and often shows up in ESE events) and the first NUMBERED page of the actual database is page number / pgno 1 but is actually the third “page” (if you are counting from the beginning :-).

From here on out though, we will not consider the header and shadow header proper pages, and page number 1 will be third page, at byte offset = <page size> * 2 = 8192 * 2 (for AD databases).

If you don’t know the page size, you can dump the database header with esentutl /mh.

Here is a dump of the header for an NTDS.DIT file – the AD database:

1

The page size is the cbDbPage. AD and ADLDS uses a page size of 8k. Other databases use different page sizes.

A caveat is that to be able to do this, the database must not be in use. So, you’d have to stop the NTDS service on the DC or run esentutl on an offline copy of the database.

But the good news is that in WS2016 and above we can now dump a LIVE DB header with the /vss switch! The command you need would be “esentutl /mh ntds.dit /vss” (note: must be run as administrator).

All these numbered database pages logically are “owned” by various B+ trees where the actual data for the client is contained … and all these B+ trees have a “type of tree” and all of a tree’s pages have a “placement in the tree” flag (Root, or Leaf or implicitly Internal – if not root or leaf).

Ok, Brett, that was “proper” tree and page talk –  I think we need some pictures to show them…

Logically the ownership / containing relationship looks like this:

5

More about B+ Trees

The pages are in turn arranged into B+ Trees. Where top page is known as the ‘Root’ page and then the bottom pages are ‘Leaf’ pages where all the data is kept.  Something like this (note this particular example does not show ‘Internal’ B+ tree pages):

6

  • The upper / parent page has partial keys indicating that all entries with 4245 + A* can be found in pgno 13, and all entries with 4245 + E* can be found in pgno 14, etc.

  • Note this is a highly simplified representation of what ESE does … it’s a bit more complicated.

  • This is not specific to ESE; many database engines have either B trees or B+ trees as a fundamental arrangement of data in their database files.

The Different trees

You should know that there are different types of B+ trees inside the ESE database that are needed for different purposes. These are:

  1. Data / Primary Trees – hold the table’s primary records which are used to store data for regular (and small) column data.

  1. Long Value (LV) Trees – used to store long values. In other words, large chunks of data which don’t fit into the primary record.

  1. Index trees – these are B+Trees used to store indexes.

  1. Space Trees – these are used to track what pages are owned and free / available as new pages for a given B+ tree.  Each of the previous three types of B+ Tree (Data, LV, and index), may (if the tree is large) have a set of two space trees associated with them.

Storing large records

Each Row of a table is limited to 8k (or whatever the page size is) in Active Directory and AD LDS. I.e. so each record has to fit into a single database page of 8k..but you are probably aware that you can fit a LOT more than 8k into an AD object or an exchange e-mail! So how do we store large records?

Well, we have different types of columns as illustrated below:

7

Tagged columns can be split out into what we call the Long Value Tree. So in the tagged column we store a simple 4 byte number that’s called a LID (Long Value ID) which then points to an entry in the LV tree. So we take the large piece of data, break it up into small chunks and prefix those with the key for the LID and the offset.

So, if every part of the record was a LID / pointer to a LV, then essentially we can fit 1300 LV pointers onto the 8k page. btw, this is what creates the 1300 attribute limit in AD. It’s all down to the ESE page size.

Now you can also start to see that when you are looking at a whole AD object you may read pages from various trees to get all the information about your object. For example, for a user with many attributes and group memberships you may have to get data from a page in the ”datatable” \ Primary tree + “datatable” \ LV tree + sd_table \ Primary tree + link_table \ Primary tree.

Index Trees

An index is used for a couple of purposes. Firstly, to make a list of the records in an intelligent order, such as by surname in an alphabetical order. And then secondly to also cut down the number of records which sometimes greatly helps speed up searches (especially when the ‘selectivity is high’ – meaning few entries match).

Below is a visual illustration (with the B+ trees turned on their side to make the diagram easier) of a primary index which is the DNT index in the AD Database – the Data Tree.  And a secondary index of dNSHostName. You can see that the secondary index only contains the records which has a dNSHostName populated. It is smaller.

8

You can also see that in the secondary index, the primary key is the data portion (the name) and then the data is the actual Key that links us back to the REAL record itself.

Inside a Database page

Each database page has a fixed header. And the header has a checksum as well as other information like how much free space is on that page and which B-tree it belongs to.

Then we have these things called TAGS (or nodes), which store the data.

A node can be many things, such as a record in a database table or an entry in an index.

The TAGS are actually out of order on the page, but order is established by the tag array at end.

  • TAG 0 = Page External Header

This contains variable sized special information on the page, depending upon the type of B-tree and type of page in B tree (space vs. regular tree, and root vs. leaf).

  • TAG 1,2,3, etc are all “nodes” or lines, and the order is tracked.

The key & data is specific to the B Tree type.

And TAG 1 is actually node 0!!! So here is a visual picture of what an ESE database page looks like:

9

It is possible to calculate this key if you have an object’s primary key. In AD this is a DNT.

The formulae for that (if you are ever crazy enough to need it) would be:

  • Start with 0x7F, and if it is a signed INT append a 0x80000000 and then OR in the number

  • For example 4248 –> in hex 1098 –> as key 7F80001098 (note 5 bytes).

  • Note: Key buffer uses big endian, not little endian (like x86/amd64 arch).

  • If it was a 64-bit int, just insert zeros in the middle (9 byte key).

  • If it is an unsigned INT, start with 0x7F and just append the number.

  • Note: Long Value (LID) trees and ESE’s Space Trees (pgno) are special, no 0x7F (4 byte keys).

  • And finally other non-integers column types, such as String and Binary types, have a different more complicated formatting for keys.

Why is this useful? Because, for example you can take a DNT of an object and then calculate its key and then seek to its page using esentutl.exe dump page /m functionality and /k option.

The Nodes also look different (containing different data) depending on the ESE B+tree type. Below is an illustration of the different nodes in a Space tree, a Data Tree, a LV tree and an Index tree.

10

The green are the keys. The dark blue is data.

What does a REAL page look like?

You can use esentutl to dump pages of the database if you are investigating some corruption for example.

Before we can dump a page, we want to find a page of interest (picking a random page could give you just a blank page) … so first we need some info about the table schema, so to start you can dump all the tables and their associated root page numbers like this :

1_2

Note, we have findstring’d the output again to get a nice view of just all the tables and their pgnoFDP and objidFDP. Findstr.exe is case sensitive so use the exact format or use /i switch.

objidFDP identifies this table in the catalog metadata. When looking at a database page we can use its objidFDP to tell which table this page belongs to.

pgnoFDP is the page number of the Father Data Page – the very top page of that B+ tree, also known as the root page.  If you run esentutl /mm <dbname> on its own you will see a huge list of every table and B-tree (except internal “space” trees) including all the indexes.

So, in this example page 31 is the root page of the datatable here.

Dumping a page

You can dump a page with esentutl using /m and /p. Below is an example of dumping page 31 from the database – the root page of the “datatable” table as above.

2

3

The objidFDP is the number indicating which B-tree the page belongs to. And the cbFree tells us how much of this page is free. (cb = count of bytes). Each database page has a double header checksum – one ECC (Error Correcting Code) checksum for single bit data correction, and a higher fidelity XOR checksum to catch all other errors, including 3 or more bit errors that the ECC may not catch.  In addition, we compute a logged data checksum from the page data, but this is not stored in the header, and only utilized by the Exchange 2016 Database Divergence Detection feature.

You can see this is a root page and it has 3 nodes (4 TAGS – remember TAG1 is node 0 also known as line 0! 🙂 and it is nearly empty! (cbFree = 8092 bytes, so only 100 bytes used for these 3 nodes + page header + external header).

The objidFDP tells us which B-Tree this page belongs to.

And notice the PageFlushType, which is related to the JET Flush Map file we could talk about in another post later.

The nodes here point to pages lower down in the tree. And we could dump a next level page (pgno: 1438)….and we can see them getting deeper and more spread out with more nodes.

4

5

So you can see this page has 294 nodes! Which again all point to other pages. It is also a ParentOfLeaf meaning these pgno / page numbers actually point to leaf pages (with the final data on them).

Are you bored yet? Open-mouthed smile

Or are you enjoying this like a geek? either way, we are nearly done with the page internals and the tree climbing here. 

If you navigate more down, eventually you will get a page with some data on it like this for example, let’s dump page 69 which TAG 6 is pointing to:

6

7

So this one has some data on it (as indicated by the “Leaf page” indicator under the fFlags). 

Finally, you can also dump the data – the contents of a node (ie TAG) with the /n switch like this:

8

Remember: The /n specifier takes a pgno : line or node specifier … this means that the :3 here, dumped TAG 4 from the previous screen.  And note that trying to dump “/n69:4” would actually fail.

This /n will dump all the raw data on the page along with the information of columns and their contents and types. The output also needs some translation because it gives us the columnID (711 in the above example) and not the attribute name in AD (or whatever your database may be). The application developer would then be able to translate those column IDs to some meaningful information. For AD and ADLDS, we can translate those to attribute names using the source code.

Finally, there really should be no need to do this in real life, other than in a situation where you are debugging a database problem. However, we hope this provided a good and ‘realistic’ demo to help understand and visualize the structure of an ESE database and how the data is stored inside it!

Stay tuned for more parts …. which Brett says will be significantly more useful to everyday administrators! 😉

The End!

Linda & Brett

TLS Handshake errors and connection timeouts? Maybe it’s the CTL engine….

$
0
0

Hi There! 

Marius and Tolu from the Directory Services Escalation Team. 

Today, we’re going to talk about a little twist on some scenarios you may have come across at some point, where TLS connections fail or timeout for a variety of reasons.   

You’re probably already familiar with some of the usual suspects like cipher suite mismatches, certificate validation errors and TLS version incompatibility, to name a few.   

Here are just some examples for illustration (but there is a wealth of information out there)  

Recently we’ve seen a number of cases with a variety of symptoms affecting different customers which all turned out to have a common root cause.  

We’ve managed to narrow it down to an unlikely source; a built-in OS feature working in its default configuration.  

We’re talking about the automatic root update and automatic disallowed roots update mechanisms based on CTLs.

Starting with Windows Vista, root certificates are updated on Windows automatically.

When a user on a Windows client visits a secure Web site (by using HTTPS/TLS), reads a secure email (S/MIME), or downloads an ActiveX control that is signed (code signing) and encounters a certificate which chains to a root certificate not present in the root store, Windows will automatically check the appropriate Microsoft Update location for the root certificate.  

If it finds it, it downloads it to the system. To the user, the experience is seamless; they don’t see any security dialog boxes or warnings and the download occurs automatically, behind the scenes.  

Additional information in:   

How Root Certificate Distribution Works 

During TLS handshakes, any certificate chains involved in the connection will need to be validated, and, from Windows Vista/2008 onwards, the automatic disallowed root update mechanism is also invoked to verify if there are any changes to the untrusted CTL (Certificate Trust List).

A certificate trust list (CTL) is a predefined list of items that are authenticated and signed by a trusted entity.

The mechanism is described in more detail in the following article:

An automatic updater of untrusted certificates is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2

It expands on the automatic root update mechanism technology (for trusted root certificates) mentioned earlier to let certificates that are compromised or are untrusted in some way be specifically flagged as untrusted.

Customers therefore benefit from periodic automatic updates to both trusted and untrusted CTLs.

So, after the preamble, what scenarios are we talking about today?

Here are some examples of issues we’ve come across recently.

 

1)

  • Your users may experience browser errors after several seconds when trying to browse to secure (https) websites behind a load balancer.
  •       They might receive an error like “The page cannot be displayed. Turn on TLS 1.0, TLS 1.1, and TLS 1.2 in the Advanced settings and try connecting to https://contoso.com again.  If this error persists, contact your site administrator.”
  •       If they try to connect to the website via the IP address of the server hosting the site, the https connection works after showing a certificate name mismatch error.
  •       All TLS versions ARE enabled when checking in the browser settings:

Internet Options

Internet Options

 

2)

  •       You have a 3rd party appliance making TLS connections to a Domain Controller via LDAPs (Secure LDAP over SSL) which may experience delays of up to 15 seconds during the TLS handshake
  •       The issue occurs randomly when connecting to any eligible DC in the environment targeted for authentication.
  •       There are no intervening devices that filter or modify traffic between the appliance and the DCs

2a)

  • A very similar scenario* to the above is in fact described in the following article by our esteemed colleague, Herbert:

Understanding ATQ performance counters, yet another twist in the world of TLAs 

Where he details:

Scenario 2 

DC supports LDAP over SSL/TLS 

A user sends a certificate on a session. The server need to check for certificate revocation which may take some time.*

This becomes problematic if network communication is restricted and the DC cannot reach the Certificate Distribution Point (CDP) for a certificate.

To determine if your clients are using secure LDAP (LDAPs), check the counter “LDAP New SSL Connections/sec”. 

If there are a significant number of sessions, you might want to look at CAPI-Logging.

 

3)

  • A 3rd party meeting server performing LDAPs queries against a Domain Controller may fail the TLS handshake on the first attempt after surpassing a pre-configured timeout (e.g 5 seconds) on the application side
  • Subsequent connection attempts are successful

 

So, what’s the story? Are these issues related in anyway?

Well, as it turns out, they do have something in common.

As we mentioned earlier, certificate chain validation occurs during TLS handshakes.

Again, there is plenty of documentation on this subject, such as

During certificate validation operations, the CTL engine gets periodically invoked to verify if there are any changes to the untrusted CTLs.

In the example scenarios we described earlier, if the default public URLs for the CTLs are unreachable, and there is no alternative internal CTL distribution point configured (more on this in a minute), the TLS handshake will be delayed until the WinHttp call to access the default CTL URL times out.

By default, this timeout is usually around 15 seconds, which can cause problems when load balancers or 3rd party applications are involved and have their own (more aggressive) timeouts configured.

If we enable CAPI2 Diagnostic logging, we should be able to see evidence of when and why the timeouts are occurring.

We will see events like the following: 

Event ID 20 – Retrieve Third-Party Root Certificate from Network:

  

  • Trusted CTL attempt

Trusted CTL Attempt

Trusted CTL Attempt

 

  • Disallowed CTL attempt

 

Disallowed CTL Attempt

Disallowed CTL Attempt

 

 

Event ID 53 error message details showing that we have failed to access the disallowed CTL:

 

Event ID 53

Event ID 53

 

 The following article gives a more detailed overview of the CAPI2 diagnostics feature available on Windows systems, which is very useful when looking at any certificate validation operations occurring on the system:  

Troubleshooting PKI Problems on Windows Vista 

To help us confirm that the CTL updater engine is indeed affecting the TLS delays and timeouts we’ve described, we can temporarily disable it for both the trusted and untrusted CTLs and then attempt our TLS connections again.

 

To disable it:

  • Create a backup of this registry key (export and save a copy)

HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\SystemCertificates\AuthRoot

  • Then create the following DWORD registry values under the key

“EnableDisallowedCertAutoUpdate”=dword:00000000

“DisableRootAutoUpdate”=dword:00000001

 

After applying these steps, you should find that your previously failing TLS connections will no longer timeout. Your symptoms may vary slightly, but you should see speedier connection times, because we have eliminated the delay in trying and failing to reach the CTL URLs. 

So, what now? 

We should now REVERT the above registry changes by restoring the backup we created, and evaluate the following, more permanent solutions.

We previously stated that disabling the updater engine should only be a temporary measure to confirm the root cause of the timeouts in the above scenarios.  

 

  • For the untrusted CTL:             

 

  • For the trusted CTL:                 
  • For server systems you might consider deploying the trusted 3rd party CA certificates via GPO on an as needed basis

Manage Trusted Root Certificates   

(particularly to avoid hitting the TLS protocol limitation described here:

SSL/TLS communication problems after you install KB 931125 )

 

  • For client systems, you should consider

Allowing access to the public allowed Microsoft CTL URL http://ctldl.windowsupdate.com/msdownload/update/v3/static/trustedr/en/authrootstl.cab

OR

Defining and maintaining an internal trusted CTL distribution point as outlined in Configure Trusted Roots and Disallowed Certificates 

OR

If you require a more granular control of which CAs are trusted by client machines, you can deploy the 3rd Party CA certificates as needed via GPO

Manage Trusted Root Certificates 

So there you have it. We hope you found this interesting, and now have an additional factor to take into account when troubleshooting TLS/SSL communication failures.

Deep Dive: Active Directory ESE Version Store Changes in Server 2019

$
0
0

Hey everybody. Ryan Ries here to help you fellow AD ninjas celebrate the launch of Server 2019.

Warning: As is my wont, this is a deep dive post. Make sure you’ve had your coffee before proceeding.

Last month at Microsoft Ignite, many exciting new features rolling out in Server 2019 were talked about. (Watch MS Ignite sessions here and here.)

But now I want to talk about an enhancement to on-premises Active Directory in Server 2019 that you won’t read or hear anywhere else. This specific topic is near and dear to my heart personally.

The intent of the first section of this article is to discuss how Active Directory’s sizing of the ESE version store has changed in Server 2019 going forward. The second section of this article will discuss some basic debugging techniques related to the ESE version store.

Active Directory, also known as NT Directory Services (NTDS,) uses Extensible Storage Engine (ESE) technology as its underlying database.

One component of all ESE database instances is known as the version store. The version store is an in-memory temporary storage location where ESE stores snapshots of the database during open transactions. This allows the database to roll back transactions and return to a previous state in case the transactions cannot be committed. When the version store is full, no more database transactions can be committed, which effectively brings NTDS to a halt.

In 2016, the CSS Directory Services support team blog, (also known as AskDS,) published some previously undocumented (and some lightly-documented) internals regarding the ESE version store. Those new to the concept of the ESE version store should read that blog post first.

In the blog post linked to previously, it was demonstrated how Active Directory had calculated the size of the ESE version store since AD’s introduction in Windows 2000. When the NTDS service first started, a complex algorithm was used to calculate version store size. This algorithm included the machine’s native pointer size, number of CPUs, version store page size (based on an assumption which was incorrect on 64-bit operating systems,) maximum number of simultaneous RPC calls allowed, maximum number of ESE sessions allowed per thread, and more.

Since the version store is a memory resource, it follows that the most important factor in determining the optimal ESE version store size is the amount of physical memory in the machine, and that – ironically – seems to have been the only variable not considered in the equation!

The way that Active Directory calculated the version store size did not age well. The original algorithm was written during a time when all machines running Windows were 32-bit, and even high-end server machines had maybe one or two gigabytes of RAM.

As a result, many customers have contacted Microsoft Support over the years for issues arising on their domain controllers that could be attributed to or at least exacerbated by an undersized ESE version store. Furthermore, even though the default ESE version store size can be augmented by the “EDB max ver pages (increment over the minimum)” registry setting, customers are often hesitant to use the setting because it is a complex topic that warrants heavier and more generous amounts of documentation than what has traditionally been available.

The algorithm is now greatly simplified in Server 2019:

When NTDS first starts, the ESE version store size is now calculated as 10% of physical RAM, with a minimum of 400MB and a maximum of 4GB.

The same calculation applies to physical machines and virtual machines. In the case of virtual machines with dynamic memory, the calculation will be based off of the amount of “starting RAM” assigned to the VM. The “EDB max ver pages (increment over the minimum)” registry setting can still be used, as before, to add additional buckets over the default calculation. (Even beyond 4GB if desired.) The registry setting is in terms of “buckets,” not bytes. Version store buckets are 32KB each on 64-bit systems. (They are 16KB on 32-bit systems, but Microsoft no longer supports any 32-bit server OSes.) Therefore, if one adds 5000 “buckets” by setting the registry entry to 5000 (decimal,) then 156MB will be added to the default version store size. A minimum of 400MB was chosen for backwards compatibility because when using the old algorithm, the default version store size for a DC with a single 64-bit CPU was ~410MB, regardless of how much memory it had. (There is no way to configure less than the minimum of 400MB, similar to previous Windows versions.) The advantage of the new algorithm is that now the version store size scales linearly with the amount of memory the domain controller has, when previously it did not.

Defaults:

Physical Memory in the Domain Controller Default ESE Version Store Size
1GB 400MB
2GB 400MB
3GB 400MB
4GB 400MB
5GB 500MB
6GB 600MB
8GB 800MB
12GB 1.2GB
24GB 2.4GB
48GB 4GB
128GB 4GB

 

This new calculation will result in larger default ESE version store sizes for domain controllers with greater than 4GB of physical memory when compared to the old algorithm. This means more version store space to process database transactions, and fewer cases of version store exhaustion. (Which means fewer customers needing to call us!)

Note: This enhancement currently only exists in Server 2019 and there are not yet any plans to backport it to older Windows versions.

Note: This enhancement applies only to Active Directory and not to any other application that uses an ESE database such as Exchange, etc.

 

ESE Version Store Advanced Debugging and Troubleshooting

 

This section will cover some basic ESE version store triage, debugging and troubleshooting techniques.

As covered in the AskDS blog post linked to previously, the performance counter used to see how many ESE version store buckets are currently in use is:

\\.\Database ==> Instances(lsass/NTDSA)\Version buckets allocated

 

Once that counter has reached its limit, (~12,000 buckets or ~400MB by default,) events will be logged to the Directory Services event log, indicating the exhaustion:

Figure 1: NTDS version store exhaustion.

Figure 1: NTDS version store exhaustion.

 

The event can also be viewed graphically in Performance Monitor:

Figure 2: The plateau at 12,314 means that the performance counter "Version Buckets Allocated" cannot go any higher. The flat line represents a dead patient.

Figure 2: The plateau at 12,314 means that the performance counter “Version Buckets Allocated” cannot go any higher. The flat line represents a dead patient.

 

As long as the domain controller still has available RAM, try increasing the version store size using the previously mentioned registry setting. Increase it in gradual increments until the domain controller is no longer exhausting the ESE version store, or the server has no more free RAM, whichever comes first. Keep in mind that the more memory that is used for version store, the less memory will be available for other resources such as the database cache, so a sensible balance must be struck to maintain optimal performance for your workload. (i.e. no one size fits all.)

If the “Version Buckets Allocated” performance counter is still pegged at the maximum amount, then there is some further investigation that can be done using the debugger.

The eventual goal will be to determine the nature of the activity within NTDS that is primarily responsible for exhausting the domain controller of all its version store, but first, some setup is required.

First, generate a process memory dump of lsass on the domain controller while the machine is “in state” – that is, while the domain controller is at or near version store exhaustion. To do this, the “Create dump file” option can be used in Task Manager by right-clicking on the lsass process on the Details tab. Optionally, another tool such as Sysinternals’ procdump.exe can be used (with the -ma switch .)

In case the issue is transient and only occurs when no one is watching, data collection can be configured on a trigger, using procdump with the -p switch.

Note: Do not share lsass memory dump files with unauthorized persons, as these memory dumps can contain passwords and other sensitive data.

 

It is a good idea to generate the dump after the Version Buckets Allocated performance counter has risen to an abnormally elevated level but before version store has plateaued completely. The reason why is because the database transaction responsible may be terminated once the exhaustion occurs, therefore the thread would no longer be present in the memory dump. If the guilty thread is no longer alive once the memory dump is taken, troubleshooting will be much more difficult.

Next, gather a copy of %windir%\System32\esent.dll from the same Server 2019 domain controller. The esent.dll file contains a debugger extension, but it is highly dependent upon the correct Windows version, or else it could output incorrect results. It should match the same version of Windows as the memory dump file.

Next, download WinDbg from the Microsoft Store, or from this link.

Once WinDbg is installed, configure the symbol path for Microsoft’s public symbol server:

Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol

Figure 3: srv*c:\symbols*http://msdl.microsoft.com/download/symbol

 

Now load the lsass.dmp memory dump file, and load the esent.dll module that you had previously collected from the same domain controller:

Figure 4: .load esent.dll

Figure 4: .load esent.dll

 

Now the ESE database instances present in this memory dump can be viewed with the command !ese dumpinsts:

Figure 5: !ese dumpinsts - The only ESE instance present in an lsass dump on a DC should be NTDSA.

Figure 5: !ese dumpinsts – The only ESE instance present in an lsass dump on a DC should be NTDSA.

 

Notice that the current version bucket usage is 11,189 out of 12,802 buckets total. The version store in this memory dump is very nearly exhausted. The database is not in a particularly healthy state at this moment.

The command !ese param <instance> can also be used, specifying the same database instance gotten from the previous command, to see global configuration parameters for that ESE database instance. Notice that JET_paramMaxVerPages is set to 12800 buckets, which is 400MB worth of 32KB buckets:

Figure 6: !ese param <instance>

Figure 6: !ese param

 

To see much more detail regarding the ESE version store, use the !ese verstore <instance> command, specifying the same database instance:

 

Figure 7: !ese verstore <instance>

Figure 7: !ese verstore 

 

The output of the command above shows us that there is an open, long-running database transaction, how long it’s been running, and which thread started it. This also matches the same information displayed in the Directory Services event log event pictured previously.

Neither the event log event nor the esent debugger extension were always quite so helpful; they have both been enhanced in recent versions of Windows.

In older versions of the esent debugger extension, the thread ID could be found in the dwTrxContext field of the PIB, (command: !ese dump PIB 0x000001AD71621320) and the start time of the transaction could be found in m_trxidstack as a 64-bit file time. But now the debugger extension extracts that data automatically for convenience.

Switch to the thread that was identified earlier and look at its call stack:

Figure 8: The guilty-looking thread responsible for the long-running database transaction.

Figure 8: The guilty-looking thread responsible for the long-running database transaction.

 

The four functions that are highlighted by a red rectangle in the picture above are interesting, and here’s why:

When an object is deleted on a domain controller, and that object has links to other objects, those links must also be deleted/cleaned by the domain controller. For example, when an Active Directory user becomes a member of a security group, a database link between the user and the group is created that represents that relationship. The same principle applies to all linked attributes in Active Directory. If the Active Directory Recycle Bin is enabled, then the link-cleaning process will be deferred until the deleted object surpasses its Deleted Object Lifetime – typically 60 or 180 days after being deleted. This is why, when the AD Recycle Bin is enabled, a deleted user can be easily restored with all of its group memberships still intact – because the user account object’s links are not cleaned until after its time in the Recycle Bin has expired.

The trouble begins when an object with many backlinks is deleted. Some security groups, distribution lists, RODC password replication policies, etc., may contain hundreds of thousands or even millions of members. Deleting such an object will give the domain controller a lot of work to do. As you can see in the thread call stack shown above, the domain controller had been busily processing links on a deleted object for 47 seconds and still wasn’t done. All the while, more and more ESE version store space was being consumed.

When the AD Recycle Bin is enabled, this can cause even more confusion, because no one remembers that they deleted that gigantic security group 6 months ago. A time bomb has been sitting in the AD Recycle Bin for months. But suddenly, AD replication grinds to a standstill throughout the domain and the admins are scrambling to figure out why.

The performance counter “\\.\DirectoryServices ==> Instances(NTDS)\Link Values Cleaned/sec” would also show increased activity during this time.

There are two main ways to fight this: either by increasing version store size with the “EDB max ver pages (increment over the minimum)” registry setting, or by decreasing the batch size with the “Links process batch size” registry setting, or a combination of both. Domain controllers process the deletion of these links in batches. The smaller the batch size, the shorter the individual database transactions will be, thus relieving pressure on the ESE version store.

Though the default values are properly-sized for almost all Active Directory deployments and most administrators should never have to worry about them, the two previously-mentioned registry settings are supported and well-informed enterprise administrators are encouraged to tweak the values – within reason – to avoid ESE version store depletion. Contact Microsoft customer support before making any modifications if there is any uncertainty.

At this point, one could continue diving deeper, using various approaches (e.g. consider not only debugging process memory, but also consulting DS Object Access audit logs, object metadata from repadmin.exe, etc.) to find out which exact object with many thousands of links was just deleted, but in the end that’s a moot point. There’s nothing else that can be done with that information. The domain controller simply must complete the work of link processing.

In other situations however, it will be apparent using the same techniques shown previously, that it’s an incoming LDAP query from some network client that’s performing inefficient queries, leading to version store exhaustion. Other times it will be DirSync clients. Other times it may be something else. In those instances, there may be more you can do besides just tweaking the version store variables, such as tracking down and silencing the offending network client(s), optimizing LDAP queries, creating database indices, etc..

 

Thanks for reading,
– Ryan “Where’s My Ship-It Award” Ries


AskDS Is Moving!

Viewing all 76 articles
Browse latest View live