Monday, January 3, 2011

SharePoint 2010 Migration Hell - Part 1

I started the upgrade from MOSS to SharePoint Server 2010 back in October. It was a database attach migration and the visual upgrade was not to be applied until 3 or 4 weeks later (time needed to train contributors). Everything seemed to go just fine after a couple of test runs on the new farm, so we proceeded with the migration. Other than typical User Profile Service stuff, everything seemed successful.

NOTE: This is all done in production as there isn't a test environment. Yes, I know, I NEED one, but there just hasn't been any time for my team of me, myself, and I to get to that project.

However, when I went to prepare for the visual upgrade, demons from Hell exploded from my farm. Seeing as I was short a Golden Child (apparently, Eddie Murphy didn't protect him after his return to Tibet), I had to actually address the issues (no, I can't make Pepsi can dancers).

The issue that first arose after trying to apply the visual upgrade was: "One or more field types are not installed properly. Go to the list settings page to delete these fields." This happened on any page I tried to edit. I followed it back to the Relationships List (http://social.technet.microsoft.com/Forums/en/sharepoint2010setup/thread/a30f0ffd-782f-4d4a-9c22-b92bac0fad4f). I tried the changes as stated, but that only seemed to fix a small part of the problem as the pages continued to display errors. At this point, I rolled back to v3.

After that process that would cause most people to start drinking, or at least quit, I pulled out my shovel and started digging through the log files and event viewer. Eventually, I hit a wall (at the time, I was wishing that was a literal statement) and after having baffled our consultant, I called Microsoft Tech Support.

Well, this was about as useful as asking my 3-year-old for help. After playing phone tag for a week I finally got to talk to someone. At this point it was the day before Thanksgiving. After having a look at the log files, the tech, Mukesh, wanted to perform some steps in a short downtime. Management approved a 30-minute window in the middle of the day as that's all Mukesh said would be needed. All he wanted to do was detach the content databases from the main web app and reconnect them to a test web app that was created after the migration.

*SNAP* goes the Error Demon's whip.

The attach fails (I think it was due to the fact that our master page has a custom web part built in it and we didn't tweak the web.config, but Mukesh wasn't listening). Ok, so reattach to main and we'll be back up.

*SNAP* At this point the Error Demon is laughing maniacally.

The reattach didn't bring the site back up. This is around the 45-minute mark of the 30-minute downtime. So Mukesh has me try many other things over the next few hours (AAM's, IIS host headers, host files, web.configs, etc.). Around 4:45pm (3.75 hours into the .5 hour window), he decides that it's a Network issue and sends it into the Network team's queue. His lead told me that the Network team would contact me within the hour.

5:45 - I call the operator line. It was assigned to Network and I should receive a call "soon."

6:30 - Status is still "soon." Mukesh's lead was wrong about 1 hour callback. That's only for Premier customers, not standard Enterprise Agreement customers.

7:15 - Operator can't see that there is an assigned tech. Calls Mukesh's team, but gets no answer. While on hold, I get a call from Sameer with the Network team. Operator transfers me at 7:45.

7:45 - Sameer wants to see load balancing info. After telling him that we don't use Windows Load Balancing, he says he can't help because it's not supported. He says the SharePoint team will have to work on it (you know, that same team from before that was of so much help). He leaves a message for the SharePoint team lead.

8:50 - Sameer calls back asking if I had heard from the SP team. After telling him I hadn’t, he said I would need to call the operator to get in contact with them.

9:05 - Operator Camille emails manager on duty and is notifying her supervisor of the issues and trying to expedite the call back. Said she will contact me when she gets more info.

9:24 - Camille says the manager is assigning the ticket and that I should receive a call “soon.”

10:38 - was told that the SharePoint manager put me next in line, but does not know when the next tech will be available.

10:55 - call from tech support (Bharat). Finally figured out, after reading the logs (same logs sent to Mukesh), that the Office Web Cache Creation job was failing. Did a remove via Powershell and it started functioning again.

1:30am Thursday - walk out with a site back up, but no closer to a resolution on the original issue.

There was another run-in with tech support the next Wednesday. Basically, same process and similar resolution (had to run the upgrade to get the site collection to respond).

This turned a bit longer than expected, but I wanted people to know about the whole experience including the chaos that was tech support. Up to this point the tech support stats finished at:

Planned time down: 1.5 hours
Total time down: 15 hours
Tickets opened: 2
Tickets closed: 0
Tickets refunded: 1 (but will try to get #2 refunded)
Migraines: numerous
Understanding of where THE postal worker was coming from (you know the guy who went "postal"): YES

The next entry will actually resolve the issue. I swear. No three-part blog.

No comments:

Post a Comment