30 Aralık 2011 Cuma

Logical Disk free space alerts don’t show percent and MB free values in the alert description

I recently wrote about the new Base OS Monitoring Packs that shipped, adding many new features and fixes for monitoring the OS. You can read more about that new release HERE. While this MP update contained many fixes and new features which are VERY beneficial in making alerts more actionable by controlling “false positives”, some of these modifications left a bit of a negative side effect.
One of the areas this new MP focused on, was changing a lot of the “average threshold” monitors to “consecutive sample” monitors. This helps control the noise when there are short term fluctuations in a performance value, or when some counters can spike tremendously for a very short time, skewing the average. So for the most part – changing these over to consecutive samples is a good thing. That said, one of the changes made was to the Logical Disk free space monitors, both for Windows Server 2003 and 2008 disks.
The script used to monitor logical disk free space in previous versions of the Monitoring Pack would output two additional propertybags for free space in MB and Percent. This was very useful, because these values could easily be added to the alert description, alert context, and health explorer. This was very beneficial, because the consumer of the alert in a notification knew precisely how much space was left for each and every alert generated. Here are some examples of how it looked previously:

image
image
image

Now – when the new MP shipped – this script was changed to support the new consecutive samples monitortype, and was completely re-written. When it was rewritten, the script no longer returned these propertybags, so they were removed from the alert description, alert context, and health explorer. The current MP (6.0.6958.0) looks like this:
image
The monitor still works perfectly as designed, and you are alerted when thresholds that you set are breached. The only negative side effect is the loss of information in the alert description.
Several customers have indicated that they preferred to have these values back in the alert description. The only real way to handle this scenario, until the signed and sealed MP gets updated at some point in the future, is to disable the built in monitor, and enable a new monitor with an alert description that you like.
I have written two addendum MP’s attached at the bottom of this article, which do exactly that – I created two new monitors (essentially the same monitors from the previous older version of the Base OS MP’s) and included two overrides which disable the existing monitors from the sealed MP’s. These two new monitors are essentially exact copies of the monitors before they got updated. They run once per hour and have all the default settings from the previous monitors.
With the addendum MP imported – health explorer looks like the following:
image
Note the new name for the addendum monitor, and the fact that the existing “Logical Disk Free Space” monitor is unloaded as it is disabled via override.

These addendum MP’s for Windows Server 2003 and Windows Server 2008 each simply include a script datasource, monitortype, and monitor to use instead of the items in the current sealed Base OS MP’s. These addendum MP’s are unsealed, so you have two options:
  1. Leave them unsealed, and use them as-is. This allows you to be able to tweak the monitor names, alert descriptions, and any other settings further.
  2. Seal the MP’s with your own key (recommended) after making any adjustments that you desire. This will be necessary in order to create overrides for existing groups in other MP’s should you desire to use those.

One caveat to understand – is that any overrides you have created on the existing Base OS free space monitors will have to be re-created here on these new ones. There is no easy workaround for that.
Let me know if you have any issues using these addendum MP’s (which are provided as a sample only) and I will try to address them.

Credits – to Larry Mosley at Microsoft for doing most of the initial heavy lifting writing the workaround MP.

Kevin Holman

Microsoft.Windows.Server.LogicalDisk.Addendum.zip   

Danielle Grandini

I want to follow a different approach to achieve a comparable thus not identical result. The goal is to not modify the original code but rather add a diagnostic and a task to the new monitors that get the MB and % free space. The major difference with Kevin solution is you won’t have the data in the alert description but in the health explorer change state context, on the other hand you should be fairly independent from any new OS MPs release.
But before digging inside the diagnostic code I want to set some points (not necessarily ordered):
  • a diagnostic is a probe that gets executed when a monitor changes its health state from healthy to warning or error. A diagnostic should not change the system state
  • the new monitors lost the ability to report on disk free space because the MPs author decided to keep the old code and then chain a filter module to change the state only if the disk stays under threshold for n (4) samples. Since there’s no generic filter module to do this in OpsMgr the author transformed the data in performance data and then used the performance specific filter System.Performance.ConsecutiveSamplesCondition. This highlights two annoyance:
    • the lack of generic filter modules for non-performance data
    • the need, to overcome this limitation, to implement persistence, when it’s needed, in every single script. The MP author should have chose this way to implement the new monitor.
But let’s return to the diagnostic stuff, we need:
  • a probe to return disk data (%free space, MB free and anything else we thing can be useful)
  • a couple of diagnostic for the warning and error states
  • a task, since it comes for free once we get the probe done
The net effect is the following:
image
Once you have the probe the syntax for the diagnostic is as follows:
<Diagnostics>
      <Diagnostic ID="Progel.Windows.Server.2008.LogicalDisk.FreeSpace.Error.Diagnostic" Comment="List current disk allocation." Accessibility="Public" Enabled="true"
                  Target="Win2008!Microsoft.Windows.Server.2008.LogicalDisk" Monitor="Win2008Mon!Microsoft.Windows.Server.2008.LogicalDisk.FreeSpace" ExecuteOnState="Error" Remotable="true" Timeout="300">
        <Category>MaintenanceCategory>
        <ProbeAction ID="PA" TypeID="QND.Library.DiskSpaceGet.PT">
          <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$ComputerName>
          <DiskLabel>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$DiskLabel>
          <ScriptTimeout>120ScriptTimeout>
        ProbeAction>
      Diagnostic>
      <Diagnostic ID="Progel.Windows.Server.2008.LogicalDisk.FreeSpace.Warning.Diagnostic" Comment="List current disk allocation." Accessibility="Public" Enabled="true"
                  Target="Win2008!Microsoft.Windows.Server.2008.LogicalDisk" Monitor="Win2008Mon!Microsoft.Windows.Server.2008.LogicalDisk.FreeSpace" ExecuteOnState="Warning" Remotable="true" Timeout="300">
        <Category>MaintenanceCategory>
        <ProbeAction ID="PA" TypeID="QND.Library.DiskSpaceGet.PT">
          <ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$ComputerName>
          <DiskLabel>$Target/Property[Type="Windows!Microsoft.Windows.LogicalDevice"]/DeviceID$DiskLabel>
          <ScriptTimeout>120ScriptTimeout>
        ProbeAction>
      Diagnostic>
    Diagnostics>
I just want to highlight the Diagnostic is state specific, so you have two different diagnostics one for Error state and the other one for Warning state. All the other parameters are pretty straightforward.

Network utilization scripts in BaseOS MP version 6.0.6958.0 may cause high CPU utilization

One of the changes in this newer version of the MP is the addition of a new datasource module, which runs a script to output the Network Adapter Utilization. The name of the datasource is “Microsoft.Windows.Server.2008.NetworkAdapter.BandwidthUsed.ModuleType”. This datasource module uses the timed script property bag provider, along with a generic mapper condition detection. The script name is: “Microsoft.Windows.Server.NetwokAdapter.BandwidthUsed.ModuleType.vbs”

There are 3 rules, and 3 monitors for each OS (2003 and 2008), which utilize this datasource:
  • Rules:
    • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedReads.Collection (Percent Bandwidth Used Read)
    • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedWrites.Collection (Percent Bandwidth Used Write)
    • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedTotal.Collection (Percent Bandwidth Used Total)
    • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedReads.Collection (Percent Bandwidth Used Read)
    • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedWrites.Collection (Percent Bandwidth Used Write)
    • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedTotal.Collection (Percent Bandwidth Used Total)
  • Monitors:
    • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedReads (Percent Bandwidth Used Read)
    • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedWrites (Percent Bandwidth Used Write)
    • Microsoft.Windows.Server.2008.NetworkAdapter.PercentBandwidthUsedTotal (Percent Bandwidth Used Total)
    • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedReads (Percent Bandwidth Used Read)
    • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedWrites (Percent Bandwidth Used Write)
    • Microsoft.Windows.Server.2003.NetworkAdapter.PercentBandwidthUsedTotal (Percent Bandwidth Used Total)

Only the “Total” rules and monitors are enabled by default, the Read/Write workflows are disabled out of the box by design.


The good:

This new functionality is cool because it allows us to monitor the total utilization based on the network bandwidth as a percentage of the “total pipe”, report on this, and view the data in the console:

image


The issue:

Since there is no direct perfmon data to collect this, the information must be collected via script. I wrote about how to write this yourself HERE.
There are 4 known issues with this script in the current Base OS MP, which can cause problems in some environments:

1. When the script executes – it consumes a high amount of CPU (WMIPrvse.exe process) for a few seconds.
2. The script does not support cookdown, so it runs a cscript.exe process and an instance of the script for EACH and every network adapter in your system (physical or virtual). This makes the CPU consumption even higher, especially for systems with a large number of network adapters (such as Hyper-V servers).
3. The script does not support teamed network adapters very well, as they are manufacturer/driver dependent, and are often missing the WMI classes expected by the script, so you will see errors on each script execution, about “invalid class”
4. On some Windows 2003 servers, people have reported this script eventually causes a fault in netman.dll, and this can subsequently cause some additional services to fault/stop.

From a CPU perspective – below is an example Hyper-V server with multiple NIC’s. I set the rule and monitor which use this script to run every 30 seconds for demonstration purposes (they run every 5 minutes by default).
image

You can see WMI (and the total CPU) spiking every 30 seconds.
After disabling all the rules and monitors which utilize this data source, we see the following from the same server:
image


Based on these issues, I’d probably recommend disabling these rules AND monitors for Windows 2003 and Windows 2008. They seem to create a bit more impact than the usefulness of the data they provide.

Kevin Holman