Monitoring Resource Usage with Resource Governor
When Resource Governor was first introduced with SQL Server 2008 you could use it to put limits on CPU usage, memory usage, and concurrent requests. In later editions Microsoft also added the ability to limit disk IO, place a hard cap on CPU usage, and control NUMA node affinity. Resource Governor is a great way to ensure no single application or user can completely starve another application or user of their SQL Server resources, but thats not what we are going to focus on today.
In this post we’ll go over the basics of getting resource pools and workload groups set up, writing an effective classifier function to spearate your workload into these groups, and finally we’ll talk about using the built-in DMVs to monitor resource usage per resource group and why this is such a cool feature.
In the examples below we will be creating a simple Resource Governor configuration for an environment where we have two teams executing requests, each team has two different applications that make these requests. As stated above, while we will discuss some of the surface-level basics of Resource Governor configuration, the goal of this post is to highlight the monitoring capabilities of Resource Governor.
Throughout the examples we will be using a database called
DBA. This is a database name I use to store all of my administration scripts, procedures, functions, lookup tables, etc. I create it on every instance I manage and it keeps a nice set of familiar tools close at hand, regardless of which instance I am working on. I highly suggest building your own DBA database.
Before we do anything with Resource Governor we have to make sure it’s turned on:
If it’s not enabled, it’s simple to turn on:
When thinking about Resource Governor, it can be helpful to think of it as a hierarchy of resource filters. At the very top, wide open, you have the whole of your SQL Server, all of the cores, all of the memory, all of the IO. Below that we have resource pools. At the pool level you can set the minimum and maximum CPU, memory, and IO, as well as the NUMA node affinity. The resources in the pool are shared among all session requests classified to a workload group within that pool.
Creating a pool is simple:
This creates two unrestricted pools, this will not limit the resources session requests within these pool can use.
When creating pools you have a few options for limiting session requests:
- Min/Max CPU limits (max CPU is applicable in times of CPU contention only)
- Hard CPU cap (max CPU that is applicable at all times)
- Min/Max query execution grant memory (be careful with the min setting on this one, as it reduces the max available memory for other pools)
- Min/Max IOPS on a per volume basis
- NUMA node affinity
I could go into great detail on these settings, but I would really just be copying Microsofts own documentation word-for-word, as it is quite concise: MSDN
One side-effect of classifying sessions into Resource Pools, is that each pool has it’s own plan cache. While it may be a bit of overkill, you can use Resource Pools to eleviate bad parameter sniffing issues if you have two applications running similar queries with wildly different parameters.
Following our hierarchy down from the pool level, we get to the workload group. A workload group is restricted to using the resources defined by the pool it belongs to, and adds the additional ability to limit the maximum memory grant per request ( as a percentage of total memory in the pool ), MAXDOP, maximum concurrent requests allowed in the group, and more.
Again, creating a workload group is very straight forward, we’ll create 4 groups, one for each application in our two resource pools:
Like the pools, these workload groups have no limits applied to them, so the groups are only used to classify the workload for reporting purposes.
Once the pools and workload groups have been created, we have to write the function to classify our requests into these groups. When writing a classifier it’s very important to remember that this function will be called for every single session, so we want to make it as lean and efficient as possible. The classifier only has one job: based on some condition, return the name of the workload group to assign the current session to.
Here is a simple example, this classifier will assign sessions from our four applications into their cooresponding workload groups:
Before you get too fancy, just remember, the classifier function has to be schema bound, so you will not be able to access objects outside of
master.dbo. This elimates the posiblity of doing things like accessing lookup tables outside of master, or using in-memory tables or natively compiled functions.
Resource Governor DMVs
There are a few DMVs available to help you view the configuration of Resource Governor as well as usage statistics on your pools and workload groups. If you want to verify the configuration of what we have done so far, and see whats statistics are available, run these simple statements:
If you simply want to test to make sure your sessions are being properly classified,
sys.dm_exec_sessions has a column called
group_id that you can use to join on
sys.dm_resource_governor_workload_groups to see which group each session has been classified to.
From looking at these DMVs it’s pretty obvious how much data is available on your resource pools and workload groups. For those of you not following along, these DMVs allow you to view the following (and more):
- Current/total request count
- Current/total requests queued (if these sessions would violate your limits)
- Total lock wait time and count
- Total CPU usage in milliseconds
- Total resource limit violations
- Total reads/writes
- Total read/write IO stalls
Depending on the statistic you want, you can get them at the pool or group level, or in some cases both. You can read more about the available DMVs here: MSDN
In addition to all of this great data, if you impose IO limits at the resource pool level you will also have access to
sys.dm_resource_governor_resource_pool_volumes. This DMV will allow you to dig deep into the IO usage of your pools on a per volume basis. If you are currently running a IO-bound workload, I would highly suggest taking a look at what Resource Governor can offer here.
Now that we know which statistics we can gather, we need to actually start gathering them. While the DMVs for Resource Governor are great, they will only give you an aggregate of the usage information since the last time the statistics were reset, or the last time services were restarted.
In most cases it makes sense to store your data in a separate table so you can calculate differentials between two time periods. For our example we are only going to be interested in request counts and CPU usage. For this, we will create the following table:
With the table created, we will also create a stored procedure to write data to the table, eventually calling the procedure via a scheduled job. The general idea behind the procedure is that it will grab the current statistics, and then read the statistics currently being stored in the table. It will then write the new statistics it gathered, as well as the delta between this run of the procedure and the last. Technically you could just store the aggregated value as it appears in the DMVs, but the query needed to process that data would end up being costly.
When writing a procedure like this you have to account for two cases:
- The DMV statistics get reset: This would result in a negative delta, as the latest total would be less than the previous
- The first time you load the data: If you are starting with nothing, there is no delta to store, in which case you have to decide if you want to store a delta of
Here is an example of a procedure we could use to store our data:
This procedure will take a snapshot of the current resource governor stats (via the CTE), and then subtract the most recent stats from our table and store the new total and the delta. We are also handling our error cases, making sure to throw out our delta if the current aggregated total is not higher than the previous, and also throwing out our data if there is no data in the table to calculate a delta from.
Now all you have to do is schedule this procedure to run on a regular basis (I chose to run this every 5 minutes via SQL Agent) and you’ll have a nice high level view of resource consumption on your instance.
Where to Go from Here
For me, this is the really fun part. There is a lot you can do with this data. Even if you just leave the data on the instance and export it to excel for analysis after new code releases it can be a very valuable tool.
In my environment I am using a custom PowerShell module to simultaniously write my Resource Governor data to SQL Server as well as Elasticsearch. Once the data is in Elasticsearch I can run aggregations on it, analyse it with Python, and create resource usage dashboards for the various team managers.
Look for future posts where I talk about my custom module and provide visual examples of what can be done in Kibana once you get this data into Elasticsearch.