Thursday, October 20, 2011

Playing IT Fast and Loose

It's been a long time since I've been at work from dusk 'til dawn. I not saying that I'm the reason we have such fabulous uptime, there are a lot of factors that play into it. We've got a well rounded NetOps team, we try to buy decent hardware, we work to keep everything backed up and we don't screw with things when they are working. And we've been lucky for a long time.

It also helps that our business model doesn't require selling things to the public or answering to many external "customers".  Which puts us in the interesting position where its almost okay if we are down for a day or two, as long as we can get things back to pretty close to where they were before they went down. That also sets up to make some very interesting decisions come budget time. They aren't necessarily "wrong", but they can end up being awkward at times.

For example, we've been working over the last two years to virtualize our infrastructure. This makes lots of sense for us - our office space requirements are shrinking and our servers aren't heavily utilized individually, yet we tend to need lots of individual servers due to our line of business. When our virtualization project finally got rolling, we opted to us a small array of SAN devices from Lefthand (now HP).  We've always used Compaq/HP equipment, we've been very happy with the dependability of the physical hardware.  Hard drives are considered consumables and we do expect failures of those from time to time, but whole systems really biting the dust?  Not so much.

Because of all the factors I've mentioned, we made the decision to NOT mirror our SAN array. Or do any network RAID.  (That's right, you can pause for a moment while the IT gods strike me down.)  We opted for using all the space we could for data and weighed that against the odds of a failure that would destroyed the data on a SAN, rendering entire RAID 0 array useless.

Early this week, we came really close. We had a motherboard fail on one of the SANs, taking down our entire VM infrastructure. This included everything except the VoIP phone system and two major applications that have not yet been virtualized. We were down for about 18 hours total, which included one business day.

Granted, we spent the majority of our downtime waiting for parts from HP and planning for the ultimate worst - restoring everything from backup. While we may think highly of HP hardware overall, we don't think very highly of their 4-hour response windows on Sunday nights.  Ultimately, over 99% of the data on the SAN survived the hardware failure and the VMs popped back into action as soon as the SAN came back online. We only had to restore one non-production server from backup after the motherboard replacement.

Today, our upper management complemented us on how we handled the issue and was pleased with how quickly we got everything working again.

Do I recommend not having redundancy on your critical systems? Nope.

But if your company management fully understands and agrees to the risks related to certain budgeting decisions, then as a IT Pro your job is to simply do the best you can with what you have and clearly define the potential results of certain failure scenarios.  

Still, I'm thinking it might be a good time to hit Vegas, because Lady Luck was certainly on our side.

Monday, October 17, 2011

Migrating to Exchange 2010 (Part 2) - Certificates

Depending on your installation of Exchange 2010 and what internal and external services you want to provide, you'll likely need a new SSL certificate from a 3rd party provider. You probably already have a basic mail.company.com certificate, but that's just not going to cut it anymore. 

If youl'll be supporting mailboxes on a previous version of Exchange or providing access to supporting Outlook Anywhere, you'll likely need additional host names on your certificate, like legacy.company.com and autodiscover.company.com. This will require a SAN (Subject Alternate Name) certificate. 

Exchange supports different URLs for internal and external access and after a typical installation, your internal URLs will be set to the FQDN of the server name (server.company.com) and external URLs will be set to whatever host name you specify during the install of the CAS server, like mail.company.com. 

In order for us to get a shiny new SAN certificate, we had to revoke our existing mail.company.com while we were waiting for the new certificate to be issued. This would cause some temporary certificate problems with anyone who tried to use Outlook Web Access, but since this was a weekend project and I already declared the entire weekend as a maintenance window I wasn't too concerned about it. 

Meanwhile, I moved all my users mailboxes to the new server. All the Outlook clients were happy with the server's self-signed certificate, which was great, since our 3rd party certificate provider took a few days to finish issuing the new cert. Once the new certificate came, I loaded it onto the mail server and authorized it for IIS to use.

My OWA certificate errors disappeared, but shortly there after we started getting reports of Outlook 2007 complaining about the certificate having a different name than what it was expecting. This was because we didn't include the server name as part of the certificate, but all the internal URLs referenced the FQDN of the server's real name.   

Some of the internal URLs can be change in the Exchange Management Console, but there are a few that are easily overlooked since you can only change them using PowerShell, particularly the URLs for Autodiscover and EWS (Exchange Web Service). 

Set-ClientAccessServer -Identity CAS_Server_Name -AutodiscoverServiceInternalUri https://mail.company.com/autodiscover/autodiscover.xml
Set-WebServicesVirtualDirectory -Identity "CAS_Server_Name\EWS (Default Web Site)" -InternalUrl https://mail.company.com/ews/exchange.asmx

Then be sure to recycle your MSExchangeAutodiscoverAppPool in IIS.  You can read more about this issue in Microsoft's KB 940726.

Wednesday, October 12, 2011

Migrating to Exchange 2010 (Part 1)

Ah, upgrades and migrations. Nothing every happens the same way it does in the lab! First off though, I do have to say that my upgrade/migration from Exchange 2003 to Exchange 2010 SP1 was successful and relatively transparent to my end users. Of course, we have a pretty small office and only one server, so there were not a lot of moving parts.

Before working in production, I did two lab-based migrations using some older copies of my Active Directory and Exchange servers - probably a tad too old, since I ran into totally different troubleshooting hurdles in production. Also, there were several things I couldn't completely test in our lab environment, like our BlackBerry BES implementation or inbound and outbound mail connectors. But hey, I love flying by the seat of my pants.

One of the benefits of being late to Exchange 2010 was that there was lots of information on the Internet when I went search for solutions and nothing was insurmountable.
My primary source of guidance was the Microsoft Exchange Deployment Assistant, which is an online checklist of steps to follow. It asks a few questions about your environment and the produces a "customized" checklist. I have a few caveats about it though.
  1. It assumes you are installing the various Exchange server roles on different machines or at different times. Since I was using the "typical" installation process my CAS, Hub and Mailbox roles were being installed together.
  2. You must check off the completed steps in order. Sure, you can skip around and follow the instructions however you want, but if you like crossing things off a list as you go along and something early in the list is delayed, you can't check of any of the later tasks. For example, "Adding digital certificates on the CAS" is something that is listed very early in the checklist. I had to wait several days for my new SAN certificate to be issued but that didn't prevent me from moving forward with my migration. However, I couldn't play along with with the checklist.
These are small gripes and if you are a stickler for documentation, you can print, email or copy/paste the instructions from the deployment assistant into your own project plan.

In the lab, the typical installation went along with out a hitch. However, I was not blessed with such luck in production. The CAS and Hub Transport roles installed fine, but the installation choked on the Mailbox role with the following error.

Couldn't resolve the user or group "mydomain.local/Microsoft Exchange Security Groups/Discovery Management." If the user or group is a foreign forest principal, you must have either a two-way trust or an outgoing trust.
I found the solution in several places, but it was very nicely documented here on Peter Schmidt's blog.

Just to clarify, you are deleting the "DiscoverySearchMailbox" user from Active Directory, rerunning your install for the mailbox role and then rerunning "setup /prepareAD" to recreate the user you deleted. Interestingly, I can't see the Discovery Search Mailbox in my Recipient Configuration in production, but I can in my test lab. (Odd... maybe one day I'll figure that out.)

At this point, Exchange 2010 is humming along right next my Exchange 2003 server and everything is happy and still working the way it did before, mostly because we have a Barracuda appliance that collects our inbound mail and delivers it to the Exchange 2003 server, so really nothing had changed.

I created a Receive Connector for the Barracuda, updated the Barracuda to deliver mail the Exchange 2010 server, then created my new Send Connector as per the Deployment Assistant and removed the Send Connector on the Exchange 2003 server.  Once I verified that inbound and outbound mail was still flowing it was time to take a breather and regroup for the next round.

Coming up - Getting BlackBerry BES to work again, fixing certificate errors with Outlook 2007, creating an external relay for some legacy devices on my network and figuring out why I couldn't mount an new database after I created it.  Stay tuned.

Thursday, October 6, 2011

Replication Warnings? - It could be just one Attribute.

Active Directory can be a funny beast.  This week, I noticed a reoccuring replication error that didn't seem to be sorting itself within a reasonable time frame.  I was seeing NTDS Replication Warning 1083, referencing a specific user account: 

Event Type: Warning
Event Source: NTDS Replication
Event Category: Replication
Event ID: 1083
Date:  10/3/2011
Time:  11:45:00 AM
User:  NT AUTHORITY\ANONYMOUS LOGON
Computer: DC1
Description:
Active Directory could not update the following object with changes received from the domain controller at the following network address because Active Directory was busy processing information.

Object:
CN=Joe Smith,OU=Accounts,DC=mydomain,DC=org
Network address:
a5b5b72d-c74b-486a-9dfa-f6516f37b38b._msdcs.caclo.org

Following it was the informational event 1955 about a write conflict:

Event Type: Information
Event Source: NTDS Replication
Event Category: Replication
Event ID: 1955
Date:  10/3/2011
Time:  11:45:00 AM
User:  NT AUTHORITY\ANONYMOUS LOGON
Computer: DC1
Description:
Active Directory encountered a write conflict when applying replicated changes to the following object.

Object:
CN=Joe Smith,OU=Accounts,DC=mydomain,DC=org
Time in seconds: 0 

After some research I tried the following troubleshooting steps:

1) Moved the offending user to a different OU temporarily to see if the problem resolved.  This essentially "tickles" AD into replicating that particular user. I recieved the same messages, but the user's CN had been updated to the new OU.
2) Used the LDP tool to see if there was duplicate entries for this user somehow, but only one instance was found.
3) Used repadmin to look at the time stamps of various attributes on the account, particular one with a time stamp close to the time that the replication warnings started appearing in the event log.

Repadmin was where I had the most luck.  You'll want to run the following command for Windows 2003 SP2 DCs:

repadmin /showobjmeta DC1 "CN=Joe Smith,OU=Accounts,DC=mydomain,DC=org"

This will return a list of attributes with timestamps.  In my case it was the attribute related to the last password change, which was the only one that had a timestamp of the same date when the errors began.  I reset the password on the account to "tickle" that particular attribute and the replication completed without any complaint.

Some anticodotal stories on the Internet indicate that this attribute can cause trouble if replication occurs while an account happens to be locked out.  In this case, the account was for a consultant who didn't log in very often, so the locked account went unnoticed for some time, causing the replication issue.

MS ITPro Evangelists Blogs

More Great Blogs