Case of the high disk activity

A Blog by Ben Lavender

So it’s finally Friday here and I was driving to the office this morning when I thought about another idea for a tech-blog that I could share with our customers.

Now whenever I get stuck for ideas on where to begin with a technical issue I always remember Mark Russinovich quoting David Solomon in one of his Sysinternals TECHED sessions with “When in doubt, run Process Monitor!” By the way, I suggest watching some of them as you’ll learn a whole lot about Wininternals, I’ll link in some stuff from him at the end of this blog.

So what was the issue? I was asked by my MD to look at an issue with an Apache server running Windows Server 2008 R2, this was running Apache 2.2.22 and a certain process was hitting the disk quite heavily and also slowing the site rendering speed. We knew of the high disk write counts after a quick look using perfmon.exe.

As a starting tool on Windows I also use Performance Monitor (perfmon.exe) and Resource Monitor (perfmon.exe /res) and I particularly enjoy the Data Collector Sets (DCS) that you can use as filters, but I’m more of a Sysinternals guy when it comes to analysis.

*You can use the below perfmon.exe DCSs if you want to look deeper into Disk I/Os:

1

 

*Also generally looking at perfmon.exe /res “Resource Monitor” live screens will give a good idea of the activity:

2

Now to look deeper and I’ll show my way of using different tools to solve these issues at different stages, using mostly Sysinternals tools.

I started Process Explorer (procexp.exe) and added in the below fields:

select colums 3

 

 

 

 

 

 

 

 

 

 

 

 

 

Then sorted by Disk Writes and looked at the process associated, note this isn’t an actual screenshot from the server as I only saved the .pml file from the procmon.exe trace:

4

 

So at that time I knew that from the procexp.exe results that one of the httpd.exe processes, the child process of the primary httpd.exe process was writing a lot of new files to the directory below:

5

 

Now I moved to Process Monitor (procmon.exe) probably the most widely used Sysinternals tool. As I knew what process to look for, I set a trace going for 5 minutes then stopped and filtered by PID 544 and then added in filters to include Operation=Is=WriteFile and Operation=Is=ReadFile. Now this proved that this activity was writing .php files in batches to the \public directory but also there was a lot of information so I cleared those filters and set Path=Contains=\Apache2.2\htdocs\templates_c\public to see all results for that path, again popped up httpd.exe 544 again along with the usual process of explorer.exe.

You can see the quick succession of the file creations periodically, though the reason as to why it’s creating files in these directories is unknown. So to satisfy our suspicions that the httpd.exe PID:544 is creating the files we stopped the httpd.exe service (not recommended if you’ve got a popular website, obviously) to see that the file creations stopped, and they did.

So it was clearly this process that was creating the files but why exactly? I then decided to look at the call stack of one of the writefile events to see if there was any .dlls being called that either out of the ordinary or suspicious, funnily enough I found the php5ts.dll:

6

 

Notice I haven’t configured symbols but looking at this trace it provided the suspect straight away, also looking at the trace modules I can see the particular .dll and its location:

7

 

So now we knew that it was potentially a PHP script that was creating the files, maybe would have been more clear if I’d configured symbols so I could see the functions though I’d found enough info. We contacted the web developers and asked if they’d configured any PHP scripts that would create batches of these files and they advised there was such a script that should be disabled in the production environment that may have not been disabled. This particular scripts create log files in .php formats every time a user visits the site and of course this script was enabled. It was then disabled and the scripts stopped and the disk I/O returned to acceptable levels.

Cisco VPN Client – Decrypted: 0 woes

For a long time I have used Cisco VPN client on my Windows 7 computers.  I use it to provide IPSec VPN tunnels to Cisco ASA firewalls and it works well enough for me to not resort to ShrewSoft.

Until today.

I wasted about an hour trying to work out why my VPN session would establish but not decrypt any packets.  Sending encrypted packets was fine, but I got nothing back.  It didn’t matter which ASA I was connecting to, so I figured this was a client issue.

Long story short – the Cisco VPN client will do this if you have more than one IP address assigned to your local LAN interface.  I had added a second to configure an access point earlier in the week, and left it in place without considering it could affect the VPN client.  After removing this second IP address, the session traffic traversed the tunnel as normal.

vCheck syslog plugin update

I regularly use Alan Renouf’s excellent vCheck powershell utility to help me manage and maintain some sort or order with my ESXi hosts.

Unfortunately the good people at VMware are charging ahead advancing the features of vSphere, which means that some useful powercli commands are deprecated from time to time.  This can break some vCheck plugins and hence the authors are often pestered for updates to support the newer versions of ESXi.

I am in the process of validating plugins which are broken, and adapting them to support new releases whilst still having backwards compatibility.  Of course I am sharing this info with the original authors, whom no doubt can code a little prettier than me, but at least I have an interim solution.  Anyway, here is my first one to address the new way in which the syslog server detials is queried on ESXi 5.x based upon the good work of Jonathan Medd‘s original plugin:

# Start of Settings
# The Syslog server which should be set on your hosts
$SyslogServer =”syslog.domain.local”
# End of Settings

$ESXiSyslog = @()
$ESXiSyslog += $VMH | Where { $_.Version -lt 5.0 } | Where {$_.ConnectionState -eq “Connected” -or $_.ConnectionState -eq “Maintenance”} | Select Name, @{Name=’SyslogServer’;Expression={($_ | Get-VMHostSysLogServer).Host}} | Where-Object {$_.host -ne $syslogserver}
$ESXiSyslog += $VMH | Where { $_.Version -ge “5.0.0” } | Where {$_.ConnectionState -eq “Connected” -or $_.ConnectionState -eq “Maintenance”} | Where {$_.ExtensionData.Summary.Config.Product.Name -match “i”} | Select Name, @{Name=”SyslogServer”;Expression={(Get-VMHost $_.Name | Get-VMHostAdvancedConfiguration -Name Syslog.global.logHost).Values}}

$Result = @($ESXiSyslog | Where { $_.SyslogServer -ne $syslogserver})
$Result

$Title = “Hosts with incorrect or empty Syslog Server defined”
$Header = “Hosts with incorrect or empty Syslog Server defined : $(@($Result).count)”
$Comments = “The following hosts do not have the correct Syslog settings which may cause issues if ESXi hosts experience issues and logs need to be investigated”
$Display = “Table”
$Author = “John Murray based on orginal scripts from Alan Renouf & Jonathan Medd”
$PluginVersion = 1.2
$PluginCategory = “vSphere”

 

 

vCloud Director 101

I decided to write this blog (read it in reference to my slideshow) to give you guidance on the complex terminologies of VMware vCloud director. I will refer to it vCD from now on to save my poor fingers.

If you’re a vSphere admin, vCD terminology is very different, it uses new terms to label layers, a way to image this is an onion ring, as you peel away the layers you get to the core or centre of the onion, vCD is abstraction layer above your infrastructure. It hides all the bits and pieces your users don’t need to see, and you don’t want them to mess around with!

Massimo gave a great quote. Check out his blog for all things vCD. He wraps it up in this quote.

“Think about how difficult it is to implement something that allows and end-user to create, in self-service mode, separate layer 2 network segments, define custom layer 3 IP policies, configure services such as DHCP, NAT and Firewall… all without having to ask the vSphere / cloud administrator to do all that for you, all without messing up with the cloud-wide setup, all without causing conflicts with the other tenants on the cloud. This is a titanic effort, believe me.”

This blog will not go through the install of vCD, as it is beyond the scope of this article, but have a look over at Kendrick Coleman’s blog site, as he has a fantastic walkthrough on a vCD install. Now let’s tackle the terminology you need to understand, these terms are prompted by the wizard once installation has completed and you’re ready to create your first tenant.

So as my vCD slide outlines, what is vCloud Director, it’s the wrapper around your vSphere infrastructure, it hides the complex bits and automates creation of VM’s and networks without admin intervention.
What is a vCD Cell?

An instance of vCloud Director

Can be scaled by adding multiple cells behind a load balancer

Scales up to 10,000 VMs and 25 vCenter Servers

Creates virtual datacentres by pooling resources into new units of consumption

Secures and Isolates users with vShield, LDAP and RBAC with policies

Components of a vCD Deployment

Min 2xESXi hosts vSphere ENT or ENT+

No Enterprise Plus licence means No vCDNI networking

Shared Storage for DRS of hosts

vCenter

vCloud Director (VM)

Embedded or remote DB

AD / LDAP Directory

vShield Manager VM

vShield Edge VMs (automatically deployed on ESXi hosts)

vApps, deployed on ESXi hosts

Optional Components

VMware Chargeback

Meter the consumption of VM’s, networks etc., and bill them.

vCloud Connector

Connect Private Clouds to public, makes the interchange of VMs across clouds seamless.

vCD Logical Terminology:

Provider virtual Data Centre (PvDC): A logical grouping of vSphere compute and storage resources where all resources are equal (some clouds may have tiers with platinum/gold/silver)

Organisation: A unit of administration with its own users, groups, policies, and catalogues. An Org has its own security boundary. These are ‘tenants’.

Organisation vDC: A logical grouping of resources from one of more provider vDCs, enabling different performance, SLA, and cost options to be available in the same organisation.

Recommendations

Allocate at least one vCloud Director (Cell) for each vCenter server

Configure the vCloud Director database, VMware appliance is for testing purposes and uses embedded Oracle DB, not for production (16GB Ram, 100GB Storage, 4vCpus)

Read the vCat documentation to see how See VMware recommends building vCloud Director.

Recommended Configuration

Create 2 Clusters, 1 for management and 1 for resource, you don’t want your new cloud to be consuming resources before you have even installed any tenants onto it yet would you?

Create all the VMs needed for management in the management cluster.

Layers of Networking

Customer/Tenant/Organisation Network Layer (Completely Dynamic – No configuration by the customer)
————————————————————————————–

vCloud Director Network Layer

Maps to components of vSphere layer and physical layer

vSphere Network Layer

vSwitches, Port Groups etc. (must be stable and static)

Physical Network Layer

Switches/routers and IP’s etc. (must be stable and static)

vCD Networking Terms:

External Network

The vCD inner networking component is called External Networks. If you want your Organization (and in turns your vApps) to have connectivity to the external world you need to have External Networks. As the word implies, these are networks that are managed by someone that is typically external to the vCD environment and are identified by a vSphere Port Group. That’s in fact what you do when you create a vCD External Network: you point to an existing vSphere Port Group. Essentially you are telling vCloud Director that there is a Port Group that is able to provide external connectivity to your cloud environment. The typical example is a Port Group with VLAN 233 (for instance) which can support native Internet traffic. For naming convention you will be calling this External Network something like Internet or Ext-Net-Internet. I usually suggest naming the vCD External Network after the vSphere Port Group for ease of tracking.

• Connects vCD to the outside world

• Based on a vSphere port group
NOTE

When you create the port group on the dvSwitch recommended editing settings to make the ports Ephemeral – no limit on ports

Organisational Network

External Networks are easy. With Organization Networks things start to become more “interesting”. In the previous section we have created cloud-wide external connectivity (i.e. External Networks). Now we are zooming inside an Organization. An Organization (or Org) is a logical construct within vCD that describes a tenant or a customer. Cloud end-users are defined inside each Organization.

• A virtual network for tenants / customers

• Communicate with each other and access the internet

• Require an External network, network pool or both

The 3 Types of Org network a tenant can have are:

• External Organisational network: Direct

• External Organisational network: NAT-routed

• Internal Organisational network (private)

3 types of network pools you can allocate to tenants:
VLAN Backed (flexible, no special MTU settings, requires a lot of VLAN management)

Network Isolation Backed (vCDNI – no VLAN ranges to track, must change MTU / mac-in-mac encapsulation)

vSphere Port Group Backed (Standard and Distributed, no auto network deployment – most work involved)

Ideally you need to use vCDNI, so everything is automated, but you will need an Enterprise Plus licence for this feature, and also make sure that the MTU settings are set higher than 1500 at the physical switch level, esx host level and vCenter server level. You can use as high as 9000 without causing problems.

Network Pools

At this point you may have an overall understanding of what a Network Pool is and why it is used. In summary it is a small CMDB that contains layer 2 segments available to vCD administrators and end-users. Note Network Pools need to be created before we start deploying the actual networks we have described above (with the exception of the External Networks because they don’t use Networks Pools).

So far we kept referring to a “layer 2 segment” as a Port Group with an associated VLAN id. This is correct but it doesn’t tell the whole story. There are really three different types of Network Pools one can create.

VLAN-backed Network Pools: this is the easiest to get. You can, for example, create a Network Pool and give it a range of VLAN ID 100 to 199. Whenever you grab one of these IDs because you need to deploy a new layer 2 segment, vCD will tell vCenter “please create on the fly a Port Group, and give it VLAN ID 100″. The next time there is a need for another layer 2 segment vCD will tell vCenter “please create on the fly a Port Group, and give it VLAN ID 101″. And so on. Of course if one of these networks is destroyed during the lifecycle of the cloud, the corresponding VLAN ID gets put back into the pool of available networks to be deployed.

Port Group-backed Network Pools: it is similar to the VLAN-backed. The difference is that the Port Groups need to be pre-provisioned on the vSphere infrastructure and they need to be imported into vCloud Director. So vCD won’t tell vCenter to create these on the fly, they are already there pre-provisioned. Why using this? Well there are some circumstances where vCenter cannot easily (programmatically) create Port Groups on the fly. This is the case when you use vSphere Standard Switches (as opposed to Distributed Switches) or when you use the Nexus 1000v (at the moment vCD cannot manipulate programmatically Port Profiles).

vCloud Director Network Isolation Network Pools: This is when things start to get interesting (again). We use a technique called Mac-in-Mac to create layer 2 separated networks without using VLANs. Yeah that’s right. This is extremely useful for big environments where VLAN management is problematic, either because there is a limited number of VLANs available or because keeping track of VLANs is a big management overhead (especially if you use an excel spread sheet to do that ).

When you create such a Network Pool you only specify how many of these layer 2 networks you want this Network Pool to have and you are done. When vCD starts to deploy Port Groups from this Network Pool you won’t see any VLAN associated to them but they are indeed different layer 2 segments.

Now the acronym VCD-NI and the labels Preprovisioned and Created-on-the-fly in the pictures above should make more sense to you. Try to go back and have a look at them again.
Virtual Machines IP management

First of all note you cannot connect a vNIC to an External Network directly. You can however connect the vNIC to either an Organization Network or a vApp Network.
Now the question is: what happens when you connect a vNIC to either an Organization Network or a vApp Network? How do you control the layer 3 behaviour? As we said, you have a choice of connecting each vNIC of the VM to an Organization Network, a vApp Network or leave the vNIC not connected.
Reference URL’s
Massimo
http://it20.info/2010/09/vcloud-director-networking-for-dummies/

Duncan Epping – Creating a vCD Lab on your Mac/Laptop
http://www.yellow-bricks.com/2010/09/13/creating-a-vcd-lab-on-your-maclaptop/

Chris Colotti – VMware vCloud “In a Box” for your Home Lab
http://www.chriscolotti.us/vmware/vsphere/vmware-vcloud-in-a-box-for-your-home-lab/

vCloud networking explained in 1 slide and 52 animations
http://www.ntpro.nl/blog/archives/2024-vCloud-networking-explained-in-1-slide-and-52-animations.html

vSphere vDR 1.2 LVM limitation and workaround

One of our users at virtualDCS was recently experiencing problems recovering data using the vSphere ‘VMware Data Recovery’ (vDR) release 1.2 within their CentOS Linux VM using the ‘File Level Recovery’ (FLR) tool.

I won’t go into detail with regards to describing the tool, as it has been expertly described here, and documented here already.

Although the vDR appliance was reporting successful backups, the FLR utility was not mounting all the partitions when accessing a selected restore point. The virtual machine in question was running CentOS 5.4 32bit, and had just a single vmdk but had a specific partition layout which caused issues with vDR.

The disk was configured with a small /boot partition and two larger LVM partitions as follows:

[root@localhost ~]#fdisk -l

Disk /dev/sda: 85.8 GB, 85899345920 bytes
255 heads, 63 sectors/track, 10443 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

 Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1          13      104391   83  Linux
/dev/sda2              14        1305    10377990   8e  Linux LVM
/dev/sda3            1306       10443    73400985   8e  Linux LVM

It was these two 8e LVM partitions that vDR had issue with.

The standard vDR FLR mount looked ok, but only recovered the non-LVM partition on /dev/sda1 as follows:

[root@localhost /opt/vdr/VMwareRestoreClient]#./VdrFileRestore -a 172.16.1.10

(98) "Mon Feb  7 00:09:09 2011"
(99) "Tue Feb  8 02:53:03 2011"

Please input restore point to mount from list above
98
Created "/root/2011-02-07-00.09.09/Mount1"

Restore point has been mounted...
"/vcenter.domain.homelab/Datacentre One/host/Clus1/Resources/CustomerClone/CusClone1/CusClone1-WEB-1"
root mount point -> "/root/2011-02-07-00.09.09"

Please input "unmount" to terminate application and remove mount point

In order to see the details of the problem you need to run the FLR tool in verbose mode with the -v switch as follows:

[root@localhost /opt/vdr/VMwareRestoreClient]#./VdrFileRestore -a 172.16.1.10

(98) "Mon Feb  7 00:09:09 2011"
(99) "Tue Feb  8 02:53:03 2011"

Please input restore point to mount from list above
98
findRestorePointNdx: searching for 98
Restore Point 98 has been found...
"/vcenter.domain.homelab/Datacentre One/host/Clus1/Resources/CustomerClone/CusClone1/CusClone1-WEB-1"
"/SCSI-0:2/"
"Mon Feb  7 00:09:09 2011"

Initializing vix...
VixDiskLib: config options: libdir '/opt/vdr/VMwareRestoreClient/disklibpluginvcdr', tmpDir '/tmp/vmware-root'.
VixDiskLib: Could not load default plugins from /opt/vdr/VMwareRestoreClient/disklibpluginvcdr/plugins32/libdiskLibPlugin.so: Cannot open library: /opt/vdr/VMwareRestoreClient/disklibpluginvcdr/plugins32/libdiskLibPlugin.so: cannot open shared object file: No such file or directory.
DISKLIB-PLUGIN : Not loading plugin /opt/vdr/VMwareRestoreClient/disklibpluginvcdr/plugins32/libvdrplugin.so.1.0: Not a shared library.
VMware VixDiskLib (1.2) Release build-254294
Using system libcrypto, version 9080CF
VixDiskLib: Failed to load libvixDiskLibVim.so : Error = libvixDiskLibVim.so: cannot open shared object file: No such file or directory.
Msg_Reset:
[msg.dictionary.load.openFailed] Cannot open file "/etc/vmware/config": No such file or directory.
----------------------------------------
PREF Optional preferences file not found at /etc/vmware/config. Using default values.
Msg_Reset:
[msg.dictionary.load.openFailed] Cannot open file "/usr/lib/vmware/settings": No such file or directory.
----------------------------------------
PREF Optional preferences file not found at /usr/lib/vmware/settings. Using default values.
Msg_Reset:
[msg.dictionary.load.openFailed] Cannot open file "/usr/lib/vmware/config": No such file or directory.
----------------------------------------
PREF Optional preferences file not found at /usr/lib/vmware/config. Using default values.
Msg_Reset:
[msg.dictionary.load.openFailed] Cannot open file "/root/.vmware/config": No such file or directory.
----------------------------------------
PREF Optional preferences file not found at /root/.vmware/config. Using default values.
Msg_Reset:
[msg.dictionary.load.openFailed] Cannot open file "/root/.vmware/preferences": No such file or directory.
----------------------------------------
PREF Failed to load user preferences.
DISKLIB-LINK  : Opened 'vdr://vdr://vdrip:1.1.1.40<>vcuser:<>vcpass:<>vcsrvr:<>vmuuid:<>destid:39<>sessdate:129415109490000000<>datastore:P4500-DS05<>vmdk_name:CusClone1-WEB-1.vmdk<>oppid:4499' (0x1e): plugin, 167772160 sectors / 80 GB.
DISKLIB-LIB   : Opened "vdr://vdr://vdrip:1.1.1.40<>vcuser:<>vcpass:<>vcsrvr:<>vmuuid:<>destid:39<>sessdate:129415109490000000<>datastore:P4500-DS05<>vmdk_name:CusClone1-WEB-1.vmdk<>oppid:4499" (flags 0x1e, type plugin).
DISKLIB-LIB   : CREATE CHILD: "/tmp/flr-4499-w4U6cS" -- twoGbMaxExtentSparse grainSize=128
DISKLIB-DSCPTR: "/tmp/flr-4499-w4U6cS" : creation successful.
PREF early PreferenceGet(filePosix.coalesce.enable), using default
PREF early PreferenceGet(filePosix.coalesce.aligned), using default
PREF early PreferenceGet(filePosix.coalesce.count), using default
PREF early PreferenceGet(filePosix.coalesce.size), using default
PREF early PreferenceGet(aioCusClone1r.numThreads), using default
--- Mounting Virtual Disk: /tmp/flr-4499-w4U6cS ---
SNAPSHOT: IsDiskModifySafe: Scanning directory of file /tmp/flr-4499-w4U6cS for vmx files.
Disk flat file mounted under /var/run/vmware/fuse/2848693010656666867
VixMntapi_OpenDisks: Mounted disk /tmp/flr-4499-w4U6cS at /var/run/vmware/fuse/2848693010656666867/flat.
Mounting Partition 1 from disk /tmp/flr-4499-w4U6cS
Created "/root/2011-02-07-00.09.09/Mount1"
MountsDone: LVM volume detected, start: 106928640, flat file: "/var/run/vmware/fuse/2848693010656666867/flat"
MountsDone: LVM volume detected, start: 10733990400, flat file: "/var/run/vmware/fuse/2848693010656666867/flat"
System: running "lvm version 2>&1"
System: start results...
File descriptor 3 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4701: sh
File descriptor 4 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4701: sh
LVM version:     2.02.46-RHEL5 (2009-09-15)
Library version: 1.02.32 (2009-05-21)
Driver version:  4.11.5
System: end results...
System: command "lvm version 2>&1" completed successfully
LoopMountSetup: Setup loop device for "/dev/loop1" (offset: 106928640) : "/var/run/vmware/fuse/2848693010656666867/flat"
LoopMountSetup: Setup loop device for "/dev/loop2" (offset: 2144055808) : "/var/run/vmware/fuse/2848693010656666867/flat"
System: running "lvm vgdisplay 2>&1"
System: start results...
File descriptor 3 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4706: sh
File descriptor 4 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4706: sh
Couldn't find device with uuid '1KPTt2-2Kya-Wk4H-MDz7-0tgJ-a82T-N6OsIX'.
--- Volume group ---
VG Name               VolGroup00
System ID
Format                lvm2
Metadata Areas        1
Metadata Sequence No  24
VG Access             read/write
VG Status             resizable
MAX LV                0
Cur LV                4
Open LV               4
Max PV                0
Cur PV                2
Act PV                1
VG Size               79.88 GB
PE Size               32.00 MB
Total PE              2556
Alloc PE / Size       2556 / 79.88 GB
Free  PE / Size       0 / 0
VG UUID               KTg9lK-J48t-P6sw-03lC-TjAX-d5n6-8qcAEx

System: end results...
System: command "lvm vgdisplay 2>&1" completed successfully
LVMFindInfo: found "VG Name" -> "VolGroup00"
System: running "env LVM_SYSTEM_DIR=/tmp/flr-4499-2kqxja/ lvm pvscan /dev/loop1 /dev/loop2 2>&1"
System: start results...
File descriptor 3 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4710: sh
File descriptor 4 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4710: sh
Couldn't find device with uuid '1KPTt2-2Kya-Wk4H-MDz7-0tgJ-a82T-N6OsIX'.
PV /dev/loop1       VG VolGroup00   lvm2 [9.88 GB / 0    free]
PV unknown device   VG VolGroup00   lvm2 [70.00 GB / 0    free]
Total: 2 [79.88 GB] / in use: 2 [79.88 GB] / in no VG: 0 [0   ]
System: end results...
System: command "env LVM_SYSTEM_DIR=/tmp/flr-4499-2kqxja/ lvm pvscan /dev/loop1 /dev/loop2 2>&1" completed successfully
System: running "env LVM_SYSTEM_DIR=/tmp/flr-4499-2kqxja/ lvm pvdisplay /dev/loop1 /dev/loop2 2>&1"
System: start results...
File descriptor 3 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4714: sh
File descriptor 4 (pipe:[1356862]) leaked on lvm invocation. Parent PID 4714: sh
Couldn't find device with uuid '1KPTt2-2Kya-Wk4H-MDz7-0tgJ-a82T-N6OsIX'.
Couldn't find device with uuid '1KPTt2-2Kya-Wk4H-MDz7-0tgJ-a82T-N6OsIX'.
No physical volume label read from /dev/loop2
Failed to read physical volume "/dev/loop2"
--- Physical volume ---
PV Name               /dev/loop1
VG Name               VolGroup00
PV Size               9.90 GB / not usable 22.76 MB
Allocatable           yes (but full)
PE Size (KByte)       32768
Total PE              316
Free PE               0
Allocated PE          316
PV UUID               Qqk2st-jiXP-k281-A1Ug-nCtM-rn0a-I8eXlX

System: end results...
System: command "env LVM_SYSTEM_DIR=/tmp/flr-4499-2kqxja/ lvm pvdisplay /dev/loop1 /dev/loop2 2>&1" failed with error 1280
LoopDestroy: Removed loop device "/dev/loop1" (offset: 106928640) : "/var/run/vmware/fuse/2848693010656666867/flat"
LoopDestroy: Removed loop device "/dev/loop2" (offset: 10733990400) : "/var/run/vmware/fuse/2848693010656666867/flat"
LoopMountSetup: LVM mounts terminating due to fatal error
VdrVixMountDone: Failed 1

Restore point has been mounted...
"/vcenter.domain.homelab/Datacentre One/host/Clus1/Resources/CustomerClone/CusClone1/CusClone1-WEB-1"
root mount point -> "/root/2011-02-07-00.09.09"

Please input "unmount" to terminate application and remove mount point

Once again, only the /boot non-LVM partition held on /dev/sda1 was mounted. You can see from the results above that the LVM mounts failed due to a fatal error.

I wasn’t sure whether there was an undocumented incompatibility with my LVM version or fuse version, so I took the easy route and logged a call with VMware Support SR# 1589684961. After eliminating the obvious the ticket was escalated to a Research Engineer who was excellent (aren’t they all?). He told me that VMware was aware of an issue with multiple LVM partitions and were expecting to include a fix in an upcoming relase of vDR.

That was great, but my customer needed to ensure his backup process allowed FLR restores. I had to find a workaround that could be implemented without requiring a reboot as the virtual machine in question was aiming for 100% uptime.

My plan was to add a new vmdk to the VM, and migrate the data off the two existing LVM partitions, remove them both, than create a single LVM partition on the original disk and migrate the data back, before removing the temporary disk.

This is the procedure I used:

***hot add new 80GB thin SCSI disk as SCSI0:1

echo "scsi add-single-device" 0 0 1 0 > /proc/scsi/scsi

***partition the new disk

fdisk /dev/sdb
n
p
1

Accept first and last cylinders to use all space

***format partition as LVM type 8e

t
1
8e
w

***prepare the new partition for LVM

pvcreate /dev/sdb1

***add the partition to the existing LVM Volume Group

vgextend VolGroup00 /dev/sdb1

***move the data off /dev/sda2 and /dev/sda3

pvmove /dev/sda2 /dev/sdb1
pvmove /dev/sda3 /dev/sdb1

***remove /dev/sda2 and /dev/sda3 from the VolGroup

vgreduce VolGroup00 /dev/sda2
vgreduce VolGroup00 /dev/sda3

***unprepare the original partitions

pvremove /dev/sda2
pvremove /dev/sda3

***delete the original partitions and create a single new bigger one

fdisk /dev/sda
d
2
d
3
n
p
2

Accept first and last cylinders to use all space

t
2
8e
w

***instead of rebooting to recognise the partition you can just run

partprobe

I didn’t have parted installed, so before I could probe the partitions, I had to run

yum install parted

***prepare the new partition for LVM:

pvcreate /dev/sda2

***add the partition to the existing Vol Group

vgextend VolGroup00 /dev/sda2

***next move the data back off /dev/sdb1

pvmove /dev/sdb1 /dev/sda2

***remove the temp disk from the LVM Volume Group

vgreduce VolGroup00 /dev/sdb1

***unprepare the partition

pvremove /dev/sdb1

***delete the partition

fdisk /dev/sdb
d
1

***remove the temporary disk from the virtual machine using vcenter then finally:

echo "scsi remove-single-device" 0 0 1 0 > /proc/scsi/scsi

I have no idea if the above will be of use to anyone, so please let me know if you find it helpful in any way. The new version of vDR will include a new version of the FLR tool anyway so let’s hope the issue is resolved in that.

SMTP, ESMTP, and the BDAT baddie

I recently had to troubleshoot a problem with an external SMTP service which was having difficulty delivering mail to our corporate mail server.  The delivering service was running Windows 2003 Standard and using the built-in Simple Mail Transfer Protocol (SMTP) service from IIS 6.0.  The receiving service was running Windows Server 2008, but also MS Exchange Server 2007 SP2.

Basically messages were not being received reliably.  Some came through and some didn’t.  The Message Tracking logs on Exchange 2007 didn’t yield much useful information, but before I turned up the logging level for the transport role, I took a look at the sending mail system.

Within C:\Windows\System32\LogFiles\SMTPSVC1 I found the most recent log file which recorded the following basic data around the failed email transmission:

22:15:27 172.16.1.10 – – 0
22:15:27 172.16.1.10 EHLO – 0
22:15:27 172.16.1.10 – – 0
22:15:27 172.16.1.10 MAIL – 0
22:15:27 172.16.1.10 – – 0
22:15:27 172.16.1.10 RCPT – 0
22:15:27 172.16.1.10 – – 0
22:15:27 172.16.1.10 BDAT – 0

I already knew that many security appliances do not like the new ESMTP BDAT command, so I Googled around and found this JoeKiller article which shed a little light on the subject, and that it was possible to force the session to not use the BDAT command at all.

By telneting to the service ‘telnet localhost 25’ and typing ‘ehlo’, the SMTP will list ESMTP verbs that it supports:

I knew I needed to remove BINARYMIME and CHUNKING, however little was mentioned regarding the exact steps to take, which in turn prompted this post.

Fortunately, I already had the IIS6.0 Resource Kit installed so was quick to find the SmtpInboundCommandSupportOptions value by opening the IIS Metabase Explorer, and navigating to LM\SmtpSvc\1

Here the default value was 7697601.  I knew that I wanted to disable the BINARYMIME and CHUNKING verbs so using the table here I subtracted 2097152 (BINARYMIME) and 1048576 (CHUNKING) from 797601:

7697601-2097152-1048576 = 4551873

I then set the SmtpInboundCommandSupportOptions value to 4551873, closed the IIS Metabase Explorer and restarted the IIS Admin Service (which in turn restarts the Simple Mail Transfer Protocol (SMTP) service).  Now the server only advertises and uses the following verbs:

Next was to restrict the sending of SMTP mail to not use the BDAT command either.  Back to the IIS Metabase Explorer, and change the value of SmtpOutboundCommandSupportOptions from 7 to 5.

Job done. Now I have a more firewall friendly mail host.

HowTo: Quiesced snapshots of Forefront TMG virtual machines

I was recently asked to look into a problem a client was having with his vSphere vDR backup routines. All guest machines were being successfully backed up apart from one.

The virtual machine that was failing was the proxy server running MS Forefront TMG on a Windows 2008 R2 guest.  The error reported that that a snapshot had failed with error:

 "Failed to create snapshot for proxy, error -3960 ( cannot quiesce virtual machine)"

I started by taking a manual quiesced snapshot to test it outside of vDR:

Sure enough, this produced  a similar error –

"Cannot create a quiesced snapshot because the create snapshot operation exceeded the time limit for holding off I/O in the frozen virtual machine."

Obviously I began scouring the forum posts at http://communities.vmware.com but found a whole range of posts regarding quiesced snapshots, all of which I found rather confusing.  All the virtual machines on this system were built from the same template.  All therefore had the same VMware tools installed, in the same way, and only one was having this issue.

Some of the knowledge base articles looked interesting, but didn’t get me any closer to the solution either – http://kb.vmware.com/kb/1009073 and http://kb.vmware.com/kb/1007696

In the end, I decided to investigate the TMG services to see if they were causing the VSS to fail themselves.  When all TMG services were stopped, the snapshots worked with no errors!!!  I then began selectively stopping them to see which was causing the problem, and found my culprit – the ISASTGCTRL service.  This service, described here by Marc Grote, is used to store the TMG configuration in the AD-LDS (Active Directory – Lightweight Directory Service).  When the service is running, snapshots fail, and when stopped, they succeed.

In order to allow quiesced snapshots to be taken, I had to create a freeze and thaw script procedure as follows:

Within the guest operating system, I created the following folder C:\Program Files\VMware\VMware Tools\BackupScripts.d\ This folder is not created by default when the VMware tools install, but is required if you want to add pre-snapshot and post-snapshot scripts as I did.  Within this folder I created a txt file called vss.bat with the following contents:

@echo off
if %1 == freeze goto freeze
if %1 == thaw goto thaw
if %1 == freezeFail goto freezeFail

:freeze
net stop "ISASTGCTRL" /Y
exit
:thaw
net start "ISASTGCTRL" /Y
exit
:freezeFail
net start "ISASTGCTRL" /Y
exit

Hence, when the snapshot is called by VMware tools, it first checks this folder for any scripts to run. When the snapshot is taken, it passed an argument ‘freeze’ to the script, and when the snapshot is finished, it passed the argument ‘thaw’ to the script. In doing so, the script successfully stops and then re-starts my problematic service.

Now when the scheduled VMware vDR process runs, the appliance is able to take a quiesced snapshot successfully.  Happy days.