Handling SQL Agent Job Failures
Hopefully you don’t experience a lot of job failures. But when multiple high-frequency jobs start failing, your inbox can get overloaded. Besides being annoying, an inbox full of job failure emails can make it hard to sift out the signal from the noise. In this post we’ll walk through a system that I have used to monitor job failures on 100’s of instances running 150+ jobs a piece, many of which run every 5 minutes.
At it’s core, the system has three components: a stored procedure, a SQL Agent job, and a table. The table stores job failure details. The agent job executes a procedure to copy job failure details to the table, and cleans up the job history table in MSDB.
The table is simple, and holds the information you would expect:
Throughout the rest of this post I will be using the
DBAdatabase. This is a database I create on any instance I manage. I use it to store any objects I need to perform monitoring and troubleshooting tasks. I also tend to break out my objects by schema, in these examples I will be using the
The procedure copies job failure data from the
dbo.sysjobhistory table in
MSDB to the
JobFailureArchive table. Before it does this, the procedure scans the
JobFailureArchive table to determine the date of the most recent job failure it had captured.
How you approach alerting is up to you, but you have a lot of different options. In my environment I query the
JobFailureArchive table via a Nagios service check. You could do the same thing with any monitoring solution that allows you to add custom queries. No matter which direction you take, writing a query to get the job failure details is simple:
Since we are storing the date the records are added to the table, this query will always return the latest set of failures. This is a simple example, but the possibilities are endless:
- Send the results of this query via database mail
- Join with
dbo.syscategories, alerting on different thresholds per job category
- Extend the
TOP (1)to include multiple capture periods and alert on average failures per capture
Configuring the SQL Agent Job
A SQL Agent job is the last component that ties everything together. At it’s simplest the job executes the procedure in the example above. This will get your job failures into the archive table, but there is more this job can do.
Since we are archiving job failures, we can clean up job history in MSDB more aggressively. To do this, add another step to the agent job that executes the following (which limits job history to a rolling 24 hours):
Reducing the size of the
sysjobhistory table can have a lot of benefits including: quicker restore times on MSDB, more responsive SSMS when viewing job history, and a reduction in contention when writing to a busy
You could also add a step to send an email alert when new job failures are inserted. Database mail is a rabbit hole we aren’t going to explore in this post. If you do plan on using this job to alert, start by reading through the resources at the bottom of this page. Specifically the one by Jes Borland on sending query results using database mail.
The article on using tokens in agent jobs is also a good read. Using tokens you can find job failures inserted since the job started executing:
The system function
msdb.dbo.agent_datetime converts the date and time format used by SQL Agent into a standard datetime. The tokens
$(ESCAPE_SQUOTE(STRTTM)), will return the date and time the job started executing. Combined with Jes’s post this could be all you need for job failure alerts.
If this job fails for any reason you will stop receiving notifications for any job failures. Make sure to configure this job to alert via email if it fails!
PowerShell (Bonus Feature!)
One nice thing about having job failure data in a dedicated table is that it is easy to query. You can also give normal users access to the data without having to put them in special agent roles or granting them access to system databases. Below is a PowerShell function you, or your users, could use to retrieve job failures based on several criteria:
If you are not familiar with PowerShell I recommend giving it a try. It has become an indispensable tool for my day to day tasks. The power of this function comes when you run it across multiple instances:
This will get the 15 most recent job failures for the three instances listed, and display them in a graphical grid view.
Writing job failures to a user table seems like a simple idea, but you’ll be surprised how much you use it. We only scratched the surface in this post, I would love to hear what other uses you find for this.