monitoring resource usage with resource governor

by mark wilkinson, 2016-09-30 estimated read time: 10-13 minutes

tag(s): database administration, performance, m82labs archive

When Resource Governor was first introduced with SQL Server 2008 you could use it to put limits on CPU usage, memory usage, and concurrent requests. In later editions Microsoft also added the ability to limit disk IO, place a hard cap on CPU usage, and control NUMA node affinity. Resource Governor is a great way to ensure no single application or user can completely starve another application or user of their SQL Server resources, but thats not what we are going to focus on today.

In this post we'll go over the basics of getting resource pools and workload groups set up, writing an effective classifier function to spearate your workload into these groups, and finally we'll talk about using the built-in DMVs to monitor resource usage per resource group and why this is such a cool feature.

In the examples below we will be creating a simple Resource Governor configuration for an environment where we have two teams executing requests, each team has two different applications that make these requests. As stated above, while we will discuss some of the surface-level basics of Resource Governor configuration, the goal of this post is to highlight the monitoring capabilities of Resource Governor.

Throughout the examples we will be using a database called DBA. This is a database name I use to store all of my administration scripts, procedures, functions, lookup tables, etc. I create it on every instance I manage and it keeps a nice set of familiar tools close at hand, regardless of which instance I am working on. I highly suggest building your own DBA database.

The Basics

Before we do anything with Resource Governor we have to make sure it's turned on:

SELECT is_enabled FROM sys.resource_governor_configuration;

If it's not enabled, it's simple to turn on:

ALTER RESOURCE GOVERNOR RECONFIGURE;

Resource Pools

When thinking about Resource Governor, it can be helpful to think of it as a hierarchy of resource filters. At the very top, wide open, you have the whole of your SQL Server, all of the cores, all of the memory, all of the IO. Below that we have resource pools. At the pool level you can set the minimum and maximum CPU, memory, and IO, as well as the NUMA node affinity. The resources in the pool are shared among all session requests classified to a workload group within that pool.

Creating a pool is simple:

CREATE RESOURCE POOL Team01;  
GO
CREATE RESOURCE POOL Team02;  
GO  
ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

This creates two unrestricted pools, this will not limit the resources session requests within these pool can use.

When creating pools you have a few options for limiting session requests:

Min/Max CPU limits (max CPU is applicable in times of CPU contention only)
Hard CPU cap (max CPU that is applicable at all times)
Min/Max query execution grant memory (be careful with the min setting on this one, as it reduces the max available memory for other pools)
Min/Max IOPS on a per volume basis
NUMA node affinity

I could go into great detail on these settings, but I would really just be copying Microsofts own documentation word-for-word, as it is quite concise: MSDN

One side-effect of classifying sessions into Resource Pools, is that each pool has it's own plan cache. While it may be a bit of overkill, you can use Resource Pools to eleviate bad parameter sniffing issues if you have two applications running similar queries with wildly different parameters.

Workload Groups

Following our hierarchy down from the pool level, we get to the workload group. A workload group is restricted to using the resources defined by the pool it belongs to, and adds the additional ability to limit the maximum memory grant per request ( as a percentage of total memory in the pool ), MAXDOP, maximum concurrent requests allowed in the group, and more.

Again, creating a workload group is very straight forward, we'll create 4 groups, one for each application in our two resource pools:

CREATE WORKLOAD GROUP App01  
USING Team01;  
GO
CREATE WORKLOAD GROUP App02  
USING Team01;  
GO
CREATE WORKLOAD GROUP App03  
USING Team02;  
GO
CREATE WORKLOAD GROUP App04  
USING Team02;  
GO  
ALTER RESOURCE GOVERNOR RECONFIGURE;  
GO

Like the pools, these workload groups have no limits applied to them, so the groups are only used to classify the workload for reporting purposes.

Classifier

Once the pools and workload groups have been created, we have to write the function to classify our requests into these groups. When writing a classifier it's very important to remember that this function will be called for every single session, so we want to make it as lean and efficient as possible. The classifier only has one job: based on some condition, return the name of the workload group to assign the current session to.

Here is a simple example, this classifier will assign sessions from our four applications into their cooresponding workload groups:

USE master;
GO

CREATE FUNCTION dbo.fnRGClassifier ()
RETURNS sysname
WITH SCHEMABINDING  
AS   
BEGIN
  DECLARE @app_name sysname = APP_NAME();

  RETURN (
    CASE
    WHEN @app_name = N'Application 01' THEN N'App01'
    WHEN @app_name = N'Application 02' THEN N'App02'
    WHEN @app_name = N'Application 03' THEN N'App03'
    WHEN @app_name = N'Application 04' THEN N'App04'
    ELSE NULL
    END
  )
END
GO

ALTER RESOURCE GOVERNOR with (CLASSIFIER_FUNCTION = dbo.fnRGClassifier)  
ALTER RESOURCE GOVERNOR RECONFIGURE  
GO

Before you get too fancy, just remember, the classifier function has to be schema bound, so you will not be able to access objects outside of master.dbo. This elimates the posiblity of doing things like accessing lookup tables outside of master, or using in-memory tables or natively compiled functions.

Resource Governor DMVs

There are a few DMVs available to help you view the configuration of Resource Governor as well as usage statistics on your pools and workload groups. If you want to verify the configuration of what we have done so far, and see whats statistics are available, run these simple statements:

-- Check the classifier is set properly and that Resource Governor is enabled
SELECT  OBJECT_NAME(classifier_function_id) AS classifier,
        is_enabled
FROM    sys.resource_governor_configuration;

-- View our pools
SELECT  *
FROM    sys.dm_resource_governor_resource_pools;

-- View our workload groups
SELECT  *
FROM    sys.dm_resource_governor_workload_groups;

If you simply want to test to make sure your sessions are being properly classified, sys.dm_exec_sessions has a column called group_id that you can use to join on sys.dm_resource_governor_workload_groups to see which group each session has been classified to.

From looking at these DMVs it's pretty obvious how much data is available on your resource pools and workload groups. For those of you not following along, these DMVs allow you to view the following (and more):

Current/total request count
Current/total requests queued (if these sessions would violate your limits)
Total lock wait time and count
Total CPU usage in milliseconds
Total resource limit violations
Total reads/writes
Total read/write IO stalls

Depending on the statistic you want, you can get them at the pool or group level, or in some cases both. You can read more about the available DMVs here: MSDN

In addition to all of this great data, if you impose IO limits at the resource pool level you will also have access to sys.dm_resource_governor_resource_pool_volumes. This DMV will allow you to dig deep into the IO usage of your pools on a per volume basis. If you are currently running a IO-bound workload, I would highly suggest taking a look at what Resource Governor can offer here.

Collecting Data

Now that we know which statistics we can gather, we need to actually start gathering them. While the DMVs for Resource Governor are great, they will only give you an aggregate of the usage information since the last time the statistics were reset, or the last time services were restarted.

In most cases it makes sense to store your data in a separate table so you can calculate differentials between two time periods. For our example we are only going to be interested in request counts and CPU usage. For this, we will create the following table:

USE DBA;
GO

CREATE TABLE dbo.ResourceGovernorUsageData (
  CollectionTimeUTC DATETIME,
  CollectionTimeOffset INT,
  ResourcePool SYSNAME,
  WorkloadGroup SYSNAME,
  RequestCountTotal BIGINT,
  RequestCountDelta BIGINT,
  CPUUsageMSTotal BIGINT,
  CPUUsageMSDelta BIGINT,
  CONSTRAINT PK_ResourceGovernorUsageData PRIMARY KEY CLUSTERED (
    CollectionTimeUTC,
    ResourcePool,
    WorkloadGroup
  )
)

With the table created, we will also create a stored procedure to write data to the table, eventually calling the procedure via a scheduled job. The general idea behind the procedure is that it will grab the current statistics, and then read the statistics currently being stored in the table. It will then write the new statistics it gathered, as well as the delta between this run of the procedure and the last. Technically you could just store the aggregated value as it appears in the DMVs, but the query needed to process that data would end up being costly.

When writing a procedure like this you have to account for two cases:

The DMV statistics get reset: This would result in a negative delta, as the latest total would be less than the previous
The first time you load the data: If you are starting with nothing, there is no delta to store, in which case you have to decide if you want to store a delta of 0 or NULL

Here is an example of a procedure we could use to store our data:

USE DBA;
GO

CREATE PROCEDURE dbo.Collect_ResourceGovernorStats
AS
;WITH RGCTE AS
(
  SELECT  GETUTCDATE() AS now_utc,
      DATEDIFF(HOUR,GETUTCDATE(),GETDATE()) AS now_offset,
      rgp.name AS pool_name,
      rgg.name AS group_name,
      rgg.total_request_count,
      rgg.total_cpu_usage_ms
  FROM  sys.resource_governor_resource_pools AS rgp
      INNER JOIN sys.dm_resource_governor_workload_groups As rgg
        ON rgp.pool_id = rgg.pool_id
)
INSERT INTO ResourceGovernorUsageData (
  CollectionTimeUTC,
  CollectionTimeOffset,
  ResourcePool,
  WorkloadGroup,
  RequestCountTotal,
  RequestCountDelta,
  CPUUsageMSTotal,
  CPUUsageMSDelta
) 
SELECT  RGCTE.now_utc,
    RGCTE.now_offset,
    RGCTE.group_name,
    RGCTE.pool_name,
    RGCTE.total_request_count,
    ( CASE
      WHEN RGCTE.total_request_count >= ISNULL(LastRun.RequestCountTotal,0)
        THEN RGCTE.total_request_count - LastRun.RequestCountTotal ELSE 0
      END
    ),
    RGCTE.total_cpu_usage_ms,
    ( CASE
      WHEN RGCTE.total_cpu_usage_ms >= ISNULL(LastRun.CPUUsageMSTotal,0)
        THEN RGCTE.total_cpu_usage_ms - LastRun.CPUUsageMSTotal ELSE 0
      END
    )
FROM  RGCTE
    OUTER APPLY (
      SELECT TOP (1)
            RequestCountTotal,
            CPUUsageMSTotal
      FROM   ResourceGovernorUsageData AS RGD
      WHERE  RGD.ResourcePool = RGCTE.pool_name
            AND RGD.WorkloadGroup = RGCTE.group_name
      ORDER BY
            CollectionTimeUTC DESC
    ) AS LastRun;

This procedure will take a snapshot of the current resource governor stats (via the CTE), and then subtract the most recent stats from our table and store the new total and the delta. We are also handling our error cases, making sure to throw out our delta if the current aggregated total is not higher than the previous, and also throwing out our data if there is no data in the table to calculate a delta from.

Now all you have to do is schedule this procedure to run on a regular basis (I chose to run this every 5 minutes via SQL Agent) and you'll have a nice high level view of resource consumption on your instance.

Where to Go from Here

For me, this is the really fun part. There is a lot you can do with this data. Even if you just leave the data on the instance and export it to excel for analysis after new code releases it can be a very valuable tool.

In my environment I am using a custom PowerShell module to simultaniously write my Resource Governor data to SQL Server as well as Elasticsearch. Once the data is in Elasticsearch I can run aggregations on it, analyse it with Python, and create resource usage dashboards for the various team managers.

Look for future posts where I talk about my custom module and provide visual examples of what can be done in Kibana once you get this data into Elasticsearch.