HowTo: Quiesced snapshots of Forefront TMG virtual machines

I was recently asked to look into a problem a client was having with his vSphere vDR backup routines. All guest machines were being successfully backed up apart from one.

The virtual machine that was failing was the proxy server running MS Forefront TMG on a Windows 2008 R2 guest.  The error reported that that a snapshot had failed with error:

 "Failed to create snapshot for proxy, error -3960 ( cannot quiesce virtual machine)"

I started by taking a manual quiesced snapshot to test it outside of vDR:

Sure enough, this produced  a similar error –

"Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine."

Obviously I began scouring the forum posts at http://communities.vmware.com but found a whole range of posts regarding quiesced snapshots, all of which I found rather confusing.  All the virtual machines on this system were built from the same template.  All therefore had the same VMware tools installed, in the same way, and only one was having this issue.

Some of the knowledge base articles looked interesting, but didn’t get me any closer to the solution either – http://kb.vmware.com/kb/1009073 and http://kb.vmware.com/kb/1007696

In the end, I decided to investigate the TMG services to see if they were causing the VSS to fail themselves.  When all TMG services were stopped, the snapshots worked with no errors!!!  I then began selectively stopping them to see which was causing the problem, and found my culprit – the ISASTGCTRL service.  This service, described here by Marc Grote, is used to store the TMG configuration in the AD-LDS (Active Directory – Lightweight Directory Service).  When the service is running, snapshots fail, and when stopped, they succeed.

In order to allow quiesced snapshots to be taken, I had to create a freeze and thaw script procedure as follows:

Within the guest operating system, I created the following folder C:\Program Files\VMware\VMware Tools\BackupScripts.d\ This folder is not created by default when the VMware tools install, but is required if you want to add pre-snapshot and post-snapshot scripts as I did.  Within this folder I created a txt file called vss.bat with the following contents:

@echo off
if %1 == freeze goto freeze
if %1 == thaw goto thaw
if %1 == freezeFail goto freezeFail

:freeze
net stop "ISASTGCTRL" /Y
exit
:thaw
net start "ISASTGCTRL" /Y
exit
:freezeFail
net start "ISASTGCTRL" /Y
exit

Hence, when the snapshot is called by VMware tools, it first checks this folder for any scripts to run. When the snapshot is taken, it passed an argument ‘freeze’ to the script, and when the snapshot is finished, it passed the argument ‘thaw’ to the script. In doing so, the script successfully stops and then re-starts my problematic service.

Now when the scheduled VMware vDR process runs, the appliance is able to take a quiesced snapshot successfully.  Happy days.

Oops how embarrassing!

I often stumble upon an interesting blog or website, but am usually reluctant to add it to my favourites.  My favourites is full of clutter from broken links, retired sites, and urls that are quicker just to type in.

What I need is a web service that provides a list of favourite sites, which saves me synchronising my favourites within my Mesh, and also allows me to share links I think are interesting, but not worth blogging about.

Enter Ma.gnolia.com (I have no idea why they write it like that).  Ma.gnolia is, well, let them tell you:

At Ma.gnolia, members save websites as bookmarks, just like in their browser. Except with a twist: they also “tag” them, assigning labels that make them easy to find again. So when you search for something, you use words that people choose and look only at websites that people think are worth saving. Suddenly you have access to a human-organized bookmark collection that numbers in the millions, but is as easy to use as a search engine.

With Ma.gnolia, that’s really all the work you have to do. Finding by tags makes organizing bookmarks a thing of the past. Since it’s a website, your Ma.gnolia bookmark collection can be reached by you and your friends from anywhere, any time. And don’t worry about web pages disappearing from your searches or even the web, as we make a saved copy of each page you bookmark where websites allow us to.

All very interesting, but one of the main reasons to use the service is so that you always have access to your favourites.  Unless they lose them of course.

A couple of days ago, that’s exactly what they did.  And they can’t get them back.  Here’s what they have to say (link):

Dear Ma.gnolia Community Member or Visitor,

Early on the West-coast morning of Friday, January 30th, Ma.gnolia experienced every web service’s worst nightmare: data corruption and loss. For Ma.gnolia, this means that the service is offline and members’ bookmarks are unavailable, both through the website itself and the API. As I evaluate recovery options, I can’t provide a certain timeline or prognosis as to to when or to what degree Ma.gnolia or your bookmarks will return; only that this process will take days, not hours.

I will of course keep you appraised here and in our Twitter account.

Most importantly, I apologize to all of you who have made Ma.gnolia a home for your bookmarks and community. I know that many of you rely on Ma.gnolia in your day to day work and play to safely host you bookmarks, keeping them available around the clock, and that this is a difficult disruption.

Sincerely,
Larry

Oh dear.

I’m especially surprised by the “as I evaluate recovery options” comment.  Surely every business understands their recovery options.  Don’t they?

When online presence is crucial (i.e. your main business function), as it is with web service providers, a fast recovery plan should have been in place.  Replication of the data to a second location, with regular snapshots to protect against data corruption, is such an inexpensive protection strategy nowadays.  Add to that the ease with which service providers can test the recoverability, this failure is a true schoolboy error.

The lesson to be learned for the rest of us, is to take DR plan into your own hands.  Store multiple copies of the data you want to keep.  Fortunately a very helpful blogger, Hutch Carpenter, posted a great idea to make this a simple process.  Store your bookmarks at Diigo, and let Diigo copy them to Del.icio.us.  See his site for a step by step guide.

Add to FacebookAdd to NewsvineAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to Ma.gnoliaAdd to TechnoratiAdd to Furl

DR tests too costly?

I was speaking with a colleague last week about UK businesses scaling back their budgets for disaster recovery and business continuity provision. It seems that while some firms have decided to take the risk of not having a plan at all, others are trying to find shortcuts to reduce their spending. By far the most obvious piece of the jigsaw to remove, for most businesses, are the test invocations.

Test invocations form a crucial part of all disaster recovery plans, but often it is the most expensive component of the solution. Test invocations are frequently overlooked at the outset of a business continuity plan, as service providers and manufacturers proclaim ‘ease of recovery’. Only when the first test is carried out does the extent of the hidden costs become apparent. Even simple tape restore testing can be time consuming and therefore expensive (and often outside the desired Recovery Time Objective or RTO). Worse still, if the test fails, further staff time must be dedicated to investigation and documentation updates. When job losses are on the horizon, and teams are running on empty, just sparing the staff to fulfil the project may not be an option.

Some DR processes have an even higher cost due to bad design, and can only be carried out at the expense of uptime. HP ServerPhysical servers sometimes need to be moved, or shutdown to carry out all the environment or application testing. Some business continuity advisors get it right and ask service providers to ‘bundle’ test invocations into the service contract. That is fine as far as it goes, but it still frequently does not account for the hidden costs like resource, transport, and documentation updates.

It seems fair then to reduce or postpone test invocations as part of a budget cutting directive, but at what cost? When times are good, and business is booming, cash flow is rarely a problem. IT budgets increase as stakeholders recognise the need for business continuity plans and related insurance strategies. In reality, during such times, the organisation may be able to recover from the impact of a couple of days IT downtime. Sure, some customers will switch to your competitors, some of those will never come back, but your order book, and cash flow will be strong enough to carry the business through. In contrast, during a recession, when order books are small, and cash flow is tight, the same period of IT downtime, and resultant loss of business, could be enough to break the camel’s back. Hence, economic recession makes a working business continuity plan even more crucial.

Some service providers, like virtualDCS (but there are others), have engaged with their customers to find a solution to this dichotomy. It is possible, given the right approach, to leave the invocation process to the service provider. The service provider maintains a detailed documentation process, and provides both the equipment and the manpower to invoke the solution independently, with no impact on the client’s live running IT operation, or the team supporting it. Once the solution is fully invoked, the business can carry out specific application tests, before leaving the service provider to dismantle the invocation test again, and update the documentation.

This sounds like a shift to wholly outsourcing the disaster recovery solution to a service provider, and it is. It also sounds very expensive, but it isn’t. The recovery team at virtualDCS (I can’t speak for other service providers), perform test invocations every day of the year. Fortunately live invocations are rare, but test invocations happen on a regular basis. Because the test invocation is a routine action, and highly automated, the costs are kept small, and more importantly, included in the contract. With contracts starting at around £50/week for a server with 60GB of data, and an achievable Recovery Point Objective (RPO) of near zero, why would you do it yourself?

Add to FacebookAdd to NewsvineAdd to DiggAdd to Del.icio.usAdd to StumbleuponAdd to RedditAdd to BlinklistAdd to Ma.gnoliaAdd to TechnoratiAdd to Furl