4 Ekim 2011 Salı

OpsMgr: MP Update: New Base OS MP 6.0.6957.0 adds Cluster Shared Volume monitoring, BPA, new reports, and many other changes

Get it from the download center here:  http://www.microsoft.com/download/en/details.aspx?id=9296

This really looks like a nice addition to the Base OS MP’s.  This update centers around a few key areas for Windows 2008 and 2008 R2:

  • Adds Cluster Shared Volume discovery and monitoring for free space and availability.  This is critical for those Hyper-V clusters on Server 2008 R2.
  • Adds a new monitor to execute the Windows Best Practices Analyzer for different discovered installed Roles, and then generate alerts until these are resolved.
  • Changes to many built in rules/monitors, to reduce noise, database space and I/O, and increase a positive “out of the box” experience.  Also added a few new monitors and rules.
  • Changes to the MP Views – removing some old stuff and adding some new
  • Addition of some new reports – way cool

Let take a look at these changes in detail:

Cluster Share Volume discovery and monitoring:
We added a new discovery and class for cluster shared volumes:

We added some new monitors for this new class:

NTFS State Monitor and State monitor are disabled by default.  The guide states:
  • This monitor is disabled as normally the state of the NTFS partition is not needed (Dirty State notification).
  • This monitor is disabled as it when enabled it may cause false negatives during backups of the Cluster Shared Volumes
I’d probably leave these turned off.  Smile

The free space monitoring for CSV’s is different than how we monitor Logical disks.  This is good – because CSV’s are hosted by the cluster virtual resource name, not by the Node, as logical disks are handled.   What CSV’s have is two monitors, which both run a script every 15 minutes, and compare against specific thresholds.  Free space % is 5 (critical) and 10 (warning) while Free space MB is 100 (critical) and 500 (warning) by default.  Obviously you will need to adjust these to what’s actionable in your Hyper-V cluster environment.
BOTH of these unit monitors act and alert independently, as seen in the above graphic for state, and below graphic for alerts:

Some notes on how free space monitoring of CSV’s work:
  • Each unit monitor has state (critical or warning) and generate individual alerts (warning ONLY)
  • There is an aggregate rollup monitor (Cluster Share Volume – Free Space Rollup Monitor) that will roll up WORST STATE of any member, and ALSO generate alerts, when the WORST state rolls up CRITICAL.  This is how we can generate warning alerts to notify administrators, but then also generate a new, different CRITICAL alert for when error thresholds are breached.  I really like this new design better than the Logical Disk monitoring…. it gives the most flexibility to be able to generate warning and critical alerts when necessary.  Perhaps you only email notify the warning alerts, but need to auto-create incidents on the critical.  The only downside is that if a CSV volume fills up and breaches all thresholds in a short time frame, you will potentially get three alerts.

There are also collection rules for the CSV performance:

Best Practices Analyzer monitor:

A new monitor was added to run the Best Practices Analyzer.  You can read more about the BPA here:

We can open Health Explorer and get detailed information on what's not up to snuff:


Alternatively – we can run this task on demand to ensure we have resolved the issues:


Changes to built in Monitors and Rules:

Many rules and monitors were changed from a default setting, to provide a better out of the box experience.  You might want to look at any overrides you have against these and give them a fresh look:
  • “Logical Disk Availability Monitor” renamed to “File System error or corruption”
  • “Avg Disk Seconds per Write/Read/Transfer” monitors changed from Average Threshold monitortype to Consecutive Samples Threshold monitortype.
    • This is VERY good – this stops all the noise for the default enabled Sec/Transfer monitor, caused by momentary perf spikes.
    • The default threshold is set to “0.04” which is 40ms latency.  This is a good generic rule of thumb for the typical server.
    • The default sample rate is once per minute, for 15 consecutive samples.
    • Note – make sure you implement or at least evaluate hotfixes 2470949 or 2495300 for 2008R2 and 2008 Operating systems, which affect these disk counters.
    • Make sure you look at any overrides you had previously set on these – as they likely should be reviewed to see if they are still needed.
  • Disabled “Percentage Committed Memory in Use” monitor
    • This monitor used to change state when more than 80% of memory was utilized.  This created unnecessary noise due the fact that more and more server roles utilize all available memory (SQL, Exchange) and this monitor was not always actionable.
  • Disabled “Total Percentage Interrupt Time” and “Total DPC Time Percentage”. 
    • These monitors would often generate alert and state noise in heavily virtualized environments, especially when the CPU’s are oversubscribed or heavily consumed temporarily.  These were turned off by default, because there are better performance counters at the Hypervisor host level to track this condition than these OS level counters.
  • Added “Free System Page Table Entries” and “Memory Pages per Second” monitors.  These are both enabled out of the box to track excessive paging conditions.  Also added MANY perf collection rules targeting memory counters, some disabled by default, some enabled.
  • “Total CPU Utilization Percentage” monitor was increased from 3 to 5 samples.  The timeout was shortened from 120 to 100 seconds (to be less than the interval of 120 seconds).
  • Disabled the following perf counter collection rules by default:
    • Avg Disk Sec/Write
    • Avg Disk Sec/Read
    • Disk Writes Per Second
    • Disk Reads Per Second
    • Disk Bytes Per Second
    • Disk Read Bytes Per Second
    • Disk Write Bytes Per Second
    • Average Disk Read Queue Length
    • Average Disk Write Queue Length
    • Average Disk Queue length
    • Logical Disk Split I/O per second
    • Memory Commit Limit
    • Memory Committed Bytes
    • Memory % Committed Bytes in use
    • Memory Page Reads per Second
    • Memory Page writes per second
    • Page File % use
    • Pages Input per second
    • Pages output per second
    • System Cache Resident Bytes
    • System Context Switches per second
  • Enabled the following perf counter collection rules by default:
    • Memory Pool Paged Bytes
    • Memory Pool Non-Paged bytes

A full list of all disabled rules, monitors and discoveries is available in the guide in the Appendix section.  The disabling of all these logical disk and memory perf collections is AWESOME.  This MP really collected more perf data than most customers were ready to consume and report on.  By including these collection rules, but disabling them, we are saving LOTS of space in the databases, valuable transactions per second in SQL, network bandwidth, etc… etc..  Good move.  If a customer desires them – they are already built and a quick override to enable them is all that’s necessary.  Great work here.  I’d like to see us do more of this out of the box from a perf collection perspective.

Added 10/3 - I just found some more changes to the MP’s:
  • The Windows Computer discovery added a “ProductType <> WinNT” to further filter out incorrect discoveries.
  • The Windows Disk partition discovery changed a propertyname from “Bootable” to “BootPartition” to fix an old issue.
  • Added a new Monitortype for NetworkAdapter.PercentBandwidthUsed
  • Memory Available megabytes script was updated.
  • Minor update to the Logical disk defrag monitor
  • Modified the tolerances and ToleranceTypes of several optimized performance collection rules.

Changes to MP views:

The old on the left – new on the right:

Top level logical disk and network adapter state views removed.
Added new views for Cluster Shared Volume Health, and Cluster Shared Volume Disk Capacity.

New Reports!  Performance by system, and Performance by utilization:

There are two new reports deployed with this new set of MP’s (provided you import the new reports MP that ships with this download – only available from the MSI and not the catalog)

To run the Performance by System report – open the report, select the time range you’d like to examine data for, and click '”Add Object”.  This report has already been filtered only to return Windows Computer objects.  search based on computer name, and add in the computer objects that you’d like to report on.  On the right – you can pick and choose the performance objects you care about for these systems.  We can even show you if the performance value is causing an unhealthy state – such as my Avg % memory used – which is yellow in the example:

Additionally – there is a report for showing you which computers are using the most, or the least resources in your environment.  Open “Performance by Utilization”, select a time range, choose a group that contains Windows Computers, and choose “Most”.  Run that, and you get a nice dashboard – with health indicators – of which computers are consuming the most resources, and potentially also impacted by this:
Using the report below – I can see I have some memory issues impacting my Exchange server, and my Domain Controller is experiencing disk latency issues.

By clicking the DC01 computer link in the above report – it takes me to the “Performance by System” report for that specific computer – very cool!

In summary – the Base OS MP is already a rock solid management pack.  This made some key changes to make the MP even less noisy out of the box, and added critical support for discovering and monitoring Cluster Shared Volumes.

Known Issues in this MP:

1.  A note on upgrading these MP’s – I do not recommend using the OpsMgr console to show “Updates available for Installed Management Packs”.  The reason for this, is that the new MP’s shipping with this update (for CSV’s and BPA) are shipped as new, independent MP’s…. and will not show up as needing an update.  If you use the console to install the updated MP’s – you will miss these new ones.  This is why I NEVER recommend using the Console/Catalog to download or update MP’s…. it is a worst practice in my personal opinion.  You should always download the MSI from the web catalog at http://systemcenter.pinpoint.microsoft.com  and extract them – otherwise you will likely end up missing MP’s you need.

2.  There might be an issue when you try and execute the reports:
An error has occurred during report processing
Query execution failed for dataset ‘PerfDS’ or Query execution failed for dataset ‘PerformanceData’
The EXECUTE permission was denied on the object ‘Microsoft_SystemCenter_Report_Performace_By_System’, database ‘OperationsManagerDW’, schema ‘dbo’.
I recommend enabling remote errors on you reporting server so the report output will show you the full details of the error:  http://technet.microsoft.com/en-us/library/aa337165.aspx   (without remote errors enabled – you might only see the top two lines in the error above)

This is due to a security permission on the new stored procedures which are deployed with this report. Thanks to PFE Tim McFadden for bringing this to my attention and to PFE Antoni Hanus for determining a resolution before we even had a chance to look into it:
  • Open SQL Mgmt Studio and connect to the SQL server hosting the Data Warehouse (OperationsManagerDW)
  • Navigate to OperationsManagerDW > Programmability > Stored Procedures > dbo.Microsoft_SystemCenter_Report_Performace_By_System

  • Right click dbo.Microsoft_SystemCenter_Report_Performace_By_System and choose Properties
  • Click the Permissions Page. Click the Search button. Hit Browse. Check [OpsMgrReader] and Click OK. Click OK again.

  • Click the check box in the Grant column for EXECUTE in the Permission row. It should look like this:

  • Click Ok
  • Repeat steps 3-9 above - for the stored procedure dbo.Microsoft_SystemCenter_Report_Performace_By_Utilization

If you are getting a specific error about “System.Data.SqlClient.SqlException: Procedure or function Microsoft_SystemCenter_Report_Performace_By_Utilization has too many arguments specified” that is still under investigation.

3.  The logical disk free space monitortypes for both Windows 2003 and Windows 2008 were re-written.  These we changed to a consecutive samples monitortype.  However – in doing the modifications –  several changes were made that might cause an impact:
The following three override-able properties were changed:
  • DebugFlag – removed
  • TimeoutSeconds – removed
  • SystemDriveWarningMBytesThreshold – renamed to “SystemDriveWarningMBytesTheshold”  (I am sure this wasn’t by design)
If you previously had overrides referencing any of these properties before, you might get an error when importing or modifying your existing override MP:
Date: 10/3/2011 2:14:21 PM
Application: System Center Operations Manager 2007 R2
Application Version: 6.1.7221.81
Severity: Error

: Verification failed with [1] errors:
Error 1:
: Failed to verify Override [OverrideForMonitorMicrosoftWindowsServer2003LogicalDiskFreeSpaceForContextMicrosoftWindowsServer2003LogicalDisk02b92b47f8f74b2393f88f6a673823f5].
Cannot find OverridableParameter with name [SystemDriveWarningMBytesThreshold] defined on [Microsoft.Windows.Server.2003.FreeSpace.Monitortype]

Failed to verify Override [OverrideForMonitorMicrosoftWindowsServer2003LogicalDiskFreeSpaceForContextMicrosoftWindowsServer2003LogicalDisk02b92b47f8f74b2393f88f6a673823f5].Cannot find OverridableParameter with name [SystemDriveWarningMBytesThreshold] defined on [Microsoft.Windows.Server.2003.FreeSpace.Monitortype]
: Failed to verify Override [OverrideForMonitorMicrosoftWindowsServer2003LogicalDiskFreeSpaceForContextMicrosoftWindowsServer2003LogicalDisk02b92b47f8f74b2393f88f6a673823f5].
Cannot find OverridableParameter with name [SystemDriveWarningMBytesThreshold] defined on [Microsoft.Windows.Server.2003.FreeSpace.Monitortype]
: Cannot find OverridableParameter with name [SystemDriveWarningMBytesThreshold] defined on [Microsoft.Windows.Server.2003.FreeSpace.Monitortype]
You will be stuck and will not be able to save any more overrides to that MP until you resolve the issue.
You MUST export the XML of your broken override MP at this point.  In the XML – search for:  “SystemDriveWarningMBytesThreshold”
Modify the following:
change it to:
Save the modified XML, and reimport.  (always save a backup copy FIRST before making any changes!)  You will now be able to use your existing override MP again.
If your issues was caused by the fact you have overridden timeout or debugflag – then simple delete those overrides in XML.

4.  The knowledge is out of date for the new default values in the free space monitors.  The changed values are referenced below:
ParameterDefault Value
System Drive Error Mbytes Threshold100  (now 300)
System Drive Error Percent Threshold5
System Drive Warning Mbytes Threshold200 (now 500)
System Drive Warning Percent Threshold10
Non-System Drive Error Mbytes Threshold1000
Non-System Drive Error Percent Threshold5
Non-System Drive Warning Mbytes Threshold2000
Non-System Drive Warning Percent Threshold10

5.  The BPA monitors can be noisy for Server 2008R2 systems.
The new BPA monitor runs a powershell script that calls the built in BPA in the Server 2008 R2 operating System.  It runs this once per day.  It does not have any capability to filter out known configurations or BPA issues that you choose not to resolve.  While the UI provides the ability to create exclusions for specific issues in the BPA results, this monitor does not support that functionality.  The result is, that this monitor could cause a large percentage of your servers to generate an alert and enter a warning state.  This is designed as a very simple monitor to bring attention to the BPA in Server 2008 R2, and to recommend adherence to best practices.  If you don’t want this monitor to generate alerts or affect health state – then disable this monitor via overrides.

6.  The “performance by utilization” report section dealing with Logical Disk % Idle time is flip-flopped…. in “Most Utilized” it reports “%100” as the highest, descending down to smaller numbers.  When in fact, 100% idle is NOT utilized at all.  The same issue shows up with the “least utilized” report model.  So for now – these specific values don’t work in a helpful manner.  However, as a workaround – you can still run a “performance top objects” report for this same counter, and choose “top N” and “bottom N” in the report to gain access to the same data.

Hiç yorum yok: