Active Directory Replication Status Tool has been re-released

Microsoft has just re-released the on-premise tool to monitor AD replication: “Active Directory Replication Status Tool” — it works again!

In case you are not aware of the tool itself and the situation which was uncovering around it for the last couple of months, here’s a quick explanation:
This tool is a little GUI helper, useful for almost every AD admin – it quickly shows the replication status of an AD forest. Close to “repadmin”, but easier to use because of the GUI. Microsoft decided to shut down the tool this February and force all its users to migrate to their new cloud monitoring solution Microsoft Operations Management Suite. To achieve this, MS published an “updated” version of the tool which only “enhancement” was a countdown timer which prevented the tool from running after the February 1st.

The mistake MS made was that they didn’t think that the two mentioned tools serve completely different purposes: one use OMS to continuously monitor their IT infrastructure, but this is not the case when you use AD Replication Status Tool — you spin it up when you need to check AD replication right now, instantly: evaluating new Active Directory forest, troubleshooting replication etc. Also, Microsoft often don’t think about internet-disconnected environments — it’s just technically impossible to use Operations Management Suite in such infrastructures. And, for the last, some organizations just don’t want, or couldn’t by law, give their sensitive data to MS, especially via Internet.

All these misunderstandings lead to the creation of this thread on UserVoice where many IT-specialists expressed their anger and explained why OMS cannot replace the on-premise tool in many ways.

Finally, today, our voices has been heard — on-premise tool is back and we all can lean back and relax, but…

They just won’t leave us so easily:

Anyway, I want to thank everyone in the Operational Insights Team for the understanding of the IT-community needs, Ryan Ries for raising this issue and creating the original thread at UserVoice and, of course, all this would not be possible without you — thanks to all sysadmins who raised their voice and voted, argued and explained their needs to the OMS team.

How to allow users to join their computers into AD domain

Imagine, that half of your company users are local administrators at their machines. They pretty often reinstall operating systems and request IT Service Desk to join that newly installed OSes into corporate Active Directory domain. One usually has two options to help the users: either to come to the user’s workstation and use one’s credentials to join a PC into domain, or to recreate workstation’s computer account, at the same time allowing employee’s user account to join the computer into a domain by him-/herself. In the first case, ServiceDesk employee needs to walk to a user’s desk, which may be quite exhausting, especially for remote locations, in the second case, computer’s group membership will be probably lost and a new account must be added into all appropriate groups manually.
Is it possible to decrease time and effort put into resolution of such requests? Yes, absolutely!

The main trick is to assign a user with following permissions to his computer’s account:

  • Validated write to DNS host name
  • Validated write to service principal name
  • List the children of an object
  • Read
  • Read security information
  • List the object access
  • Control access right
  • Delete an object and all of its children
  • Delete
  • Write to the following properties:
    • sAMAccountName
    • displayName
    • description
    • Logon Information
    • Account Restrictions

User with abovementioned permissions will be able to join their PC into an AD domain without any assistance from Service Desk.

I made this little script to automate the permissions assigning. Please look into the help section (or use Get-Help cmdlet) to find out about its syntax and usage examples.

In case you use Windows 8.1/Server 2012 R2, you might need to install KB 3092002, either way, only member of the “Domain Admins” group will be able to execute the script. This is due to a bug in the Set-Acl cmdlet. The fix for Windows 10 is included in the latest RSAT package.

If you unsure how to use the script or experience any errors, please leave a comment below or contact me directly.

How to simplify SCDPM servers maintenance

Every DPM administrator, ever tried to perform a regular maintenance onto a set of SCDPM servers (monthly updates, for example), knows that the only one way to do it correctly, i.e. without interruption of backup jobs, is to gracefully shutdown the server. To achieve this, you need to disable active SCDPM agents, connected to this server, and wait till every running job will be completed.

The only problem is — there is no quick way to get a list of every computer with an active agent connected to a SCDPM server. One may say, I’m wrong here and there IS a quick way — just use Get-DPMProductionServer cmdlet with “ServerProtectionState -eq ‘HasDatasourcesProtected'” filter. But you forgot about cluster nodes: If we protect clustered resource, but not cluster nodes themselves, we will not see them in an output of Get-DPMProductionServer (with abovementioned filter applied, of course). In addition, the output will contain clustered resources, which are useless in our task to stop every active protection agent.

That’s why I want to present you with a solution to quickly get a list of only real computers with an active SCDPM-agent installed. Just pass names of your SCDPM-servers to it (or don’t pass anything for localhost) and you’ll receive a collection of ProtectedServers in response. You may then pass that collection directly to Enable/Disable-DPMProductionServer cmdlets.

“setspn -x” is case-insensitive now

As you probably know, duplicate SPNs cause Kerberos authentication errors in AD DS domains. You may notice it by looking for KRB_AP_ERR_MODIFIED errors and Event ID 11 in system logs. With Windows Server 2008, Microsoft released a largely improved version of setspn, which includes “-x” switch to help you proactively monitor your infrastructure for duplicate SPNs. Combined with “-f” switch, setspn output contains duplicate SPNs not only from a single domain, but from a whole AD DS forest. Many companies rely on a result of “setspn -x -f” command as a data source for monitoring systems.

Today I found, that all these years almost nobody noticed that “setspn -x” command compares SPNs case-sensitively, i.e. following SPNs will be considered different and will not be shown in the output:

  • HOST/SERVERNAME
  • HOST/ServerName

Starting from Windows 10, Microsoft changed behavior of setspn to case-insensitive, and, from now on, every duplicated SPN will be displayed in setspn output, disregarding its case.

While Microsoft asserts, that Windows is case-insensitive to SPNs, not every Microsoft product agrees: for example, Shane Young found that you must pay attention to SPNs used by SharePoint accounts.

As a conclusion, I suggest every AD DS administrator to check their infrastructure with the setspn tool shipped with Windows 10, at least once. It allows you to find TRULY EVERY duplicate SPN (I did found a couple, myself ;).

How to limit a number of PowerShell Jobs running simultaneously

Recently, I received a task which required me to run a particular command on a several thousands of servers. Since execution of this command takes some time, it is just logical to run them in a parallel mode. And PowerShell Background Jobs are the first thing comes in mind.
But resources of my PC are limited — it cannot run more than a 100 jobs simultaneously. Unfortunately, PowerShell doesn’t have a built-in functionality for limiting background jobs yet.

Though, I’m not the first one who stuck with the same problem: official “Hey, Scripting Guy!” blog has introduced us a queue based on .NET Framework objects. But I couldn’t manage this solution to work and needed something simpler. After all, all we need are:

  • a loop
  • a counter, for how much jobs are active and running
  • a variable, allowing next job in queue to start

Eventually, I came up with a piece of code like this:

Whereas my version is similar to a solution proposed at StackOverflow (which I only found AFTER completing my own version, of course), the SO version suffers from a bug where some items in queue may be skipped.

While PS Jobs are so easy to play with, Boe Prox claims that runspaces work much and much faster. Go, check it out, how you can use them to accelerate your job queue in his blog.

Some other queueing technics in PowerShell:

AuthenticationSilo claim is not issued

You setup an Active Directory Authentication Policy and use a membership in Authentication Policy Silo as an access control condition. Next you setup Authentication Policy Silo to use the abovementioned Authentication Policy for appropriate principal types. You set the silo into “audit-only” mode.

In that case, AuthenticationSilo claim is not issued for your security principals.

Why does this happen?

As described in 3.1.1.11.2.18 GetAuthSiloClaim section of Active Directory Technical Specification, AuthenticationSilo claim is issued only when policies in Authentication Silo are enforced:
/*
Check if user is assigned to an enforced silo.
*/
assignedSilo := pADPrincipal!msDS-AssignedAuthNPolicySilo
if (assignedSilo = NULL ||
assignedSilo!msDS-AuthNPolicySiloEnforced = FALSE)
return NULL
endif

Resolution

I’ve found no option to modify this behavior yet. Just keep it in mind while you are testing your Authentication Policies configuration.

«0x80070721 A security package specific error occurred» while using WMI between domains

Symptoms:

You have two IBM System x servers with Windows Server 2012 or newer and with «IBM USB Remote NDIS Network Device» network interface enabled. Both of these servers reside in the same AD DS forest but in different domains. You try to setup a WMI session from one to another (using wbemtest, for example). In that case WMI connection fails and you receive an error:

Number: 0x80070721 Facility: Win32 Description: A security package specific error occurred.»
In System log of the server which initiates the connection we have:
Log Name: System
Source: Microsoft-Windows-Security-Kerberos
Date: 10/6/2013 9:13:09 PM
Event ID: 4
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: SRV1.alpha.example.com
Description:
The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server srv1$. The target name used was host/srv2.beta.example.com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Ensure that the target SPN is only registered on the account used by the server. This error can also happen if the target service account password is different than what is configured on the Kerberos Key Distribution Center for that target service. Ensure that the service on the server and the KDC are both configured to use the same password. If the server name is not fully qualified, and the target domain (ALPHA.EXAMPLE.COM) is different from the client domain (BETA.EXAMPLE.COM), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

Why does this happen?

Starting from Windows Server 2012, when one machine connects to another computer’s WMI, Windows asks remote system for IP-addresses of its network interfaces. Then operating system chooses which one of them serves its needs to connect best.
In case of IBM System x, both servers have network interface with the same IP-address – 169.254.95.120. Windows chooses this IP-address as the best and tries to connect to it. But instead of remote system, it connects to itself and you see the error.

When you try to connect with wbemtest, it calls a WMI API. In the background WMI uses DCOM to communicate with other servers. When DCOM establishes a session it uses «ServerAlive2» query to check the other server.
Here is what network capture looks like:

14 14:38:55 10.01.2014 0.0111483 srv2.beta.example.com 135 (0x87) 91.103.70.14 55535 (0xD8EF) DCOM DCOM:IObjectExporter:ServerAlive2 Response {MSRPC:10, TCP:9, IPv4:8}
NetworkAddress: srv1
NetworkAddress: 169.254.95.120
NetworkAddress: 192.0.2.1

Same traffic when we access from another end:

38 14:37:12 10.01.2014 37.1871388 srv1.alpha.example.com 135 (0x87) 10.253.12.2 59278 (0xE78E) DCOM DCOM:IObjectExporter:ServerAlive2 Response {MSRPC:14, TCP:13, IPv4:8}
NetworkAddress: srv2
NetworkAddress: 169.254.95.120
NetworkAddress: 203.0.113.2

The application layer chooses the common IP-address and loops back the real connection. The DCOM gets a Kerberos ticket, and tries to authenticate with it, so this is why we see AP_ERR_MODIFIED errors from Kerberos. So a DCOM based communication (for example WMI) won’t work if the participants has a common IP address.

This problem is known by Microsoft and will not be fixed since it is by design.

How to mitigate it?

  1. Just disable IMM USB Network Interface at one or at both servers. But beware: updating of IMM firmware from Windows or using ASU64 will enable this interface again. If you choose this option, I suggest you to setup monitoring to alert you when that interface leave enabled.
  2. Change IP-address at one if the interfaces to something else. You can use almost any private address but here are recommendations from IBM. You even can use DHCP-server to achieve it (I hope you are using separate VLAN for management interfaces, aren’t you?).

SCDPM: Fail to Modify Disk Allocation after Exchange DAG Switched

Symptoms:

Suppose, you have an Exchange 2010 installation with one or more Database Availability Groups with 2 or more servers in each DAG. You setup backup for one of these DAGs using DPM 2010 UR? or newer (incl. 2012 R2 UR2). Later you change active status for protected copy of mailbox database (for example, you switch active copy to another mailbox server). After that, for database copies which status has changed, you’ll receive following error at Review Disk Allocation page at New/Modify protection group master at DPM console:
"The operation failed because the data source VSS component {76fe1ac4-15f7-4bcd-987e-8e1acb462fb7} is missing.
Check to see that the protected data source is installed properly  and the VSS writer service is running.

ID: 915
Details: The operation completed successfully (0x0)”

If you’ll try to add such DB to secondary DPM server, you’ll receive same error at a disk size calculation step.

This problem is known by Microsoft and will not be fixed.

Why does this happen?

DPM stores information about protected resources into tbl_IM_DataSource and tbl_IM_ProtectedObject tables in DPMDB. If you look into ApplicationPath, LogicalPath or PhysicalPath cells, you find an XML-document describing protected resource. Here is one for Exchange mailbox database in DAG:

DAGNODE2.example.com – DAG-node from which database is protected.
MAILDB01 – name of protected DB
Microsoft Exchange Server\Microsoft Information Store\Replica\DAGNODE2 – path to a copy of protected DB at a DAG node which we are protect. Mind “Replica” element of the path – it means we protect passive (not active) copy of DB. In case of active copy, this part of path won’t exist.

When you change status of mailbox database in DAG, the actual LogicalPath changes, but DPM knows nothing about it and keeps an inconsistent data in DPMDB.

Resolution:

There are two workarounds (choose which suits you best):

At DPM side:

  1. Stop protection of problematic DB with retaining its data.
  2. Add the DB back into an appropriate protection group. DPM will update tbl_IM_DataSource and tbl_IM_ProtectedObject tables.
  3. When consistency check completes, you’ll be able to manage allocated disk space for this DB and setup secondary protection for it.

At Exchange side:

  1. Restore active state for problematic DB as it was when you added it into DPM:
    1. If the DB was in an active state – make it active again.
    2. If the DB was in a passive state – make it passive.

If you require so, you can reinstate DB’s state after modifying disk allocation / setup secondary protection – it doesn’t interfere with synchronization / recovery points creation – it just makes impossible to calculate size of a DB.