T-SQL Tuesday #130 : Automate Your Stress Away

T-SQL Tuesday #128 : Learn From OthersLet's automate

This month, Elizabeth Noble asks us to talk about things we have automated (or want to automate). Automation is a huge deal at Wayfair, but as a part of improving our post-mortem process, I want to talk about something we haven't automated yet.

Reviewing Error Logs

As luck would have it, this T-SQL Tuesday topic was very timely. A week ago, I painstakingly assembled a timeline surrounding a critical incident (it was an unplanned failover, but that's not important). In order to piece everything together, this consisted of:

  1. filtering the Application event log to a window around the time of the event
  2. further filtering to remove known "noise" events based on event ID (e.g. 35202 for reconnect is noise, because this process is usually for post-mortems, not active investigation)
  3. manually copying & pasting the details for the first and last instance, and count, of any interesting events remaining (based on known event IDs or inspection)
  4. repeating 1-3 for the System event log
  5. repeating 1-3 for the tediously-named Microsoft-Windows-FailoverClustering-Manager/Admin event log
  6. manually assembling all of these events, with their timestamps, chronologically
  7. applying formatting, adding context, and pasting the polished output into a JIRA ticket

There's a whole lot of grunt work in there that shouldn't have to be done by a person. I don't think you could automate the whole thing, because it is hard to predict exactly what events will be interesting and not, but I think 90% is achievable.

A colleague mentioned that they want to build something that would help, but even when that happens, that would up in proprietary code that only helps us. I saw Drew Furgiuele's post on Building a Better Get-SQLErrorLog, and that gave me some ideas for what I would build. After reaching out to Drew, I created a GitHub repository with a working name of GetAllTheErrorLogs. Its elevator pitch is a simple sentence:

Powershell to assemble a timeline combining salient events from Windows Event Log, Failover Cluster log, and SQL Server errorlog.

Except I don't really need anything other than Get-WinEvent, since everything I'd want from the other logs make it there too.

The code so far prompts for a server, a date/time, and a window in minutes, then reaches out to that server and grabs relevant event log entries for the time range provided.

Here is the code I have so far, in very rough form, and very much a work in progress. This just outputs relevant event info to both Format-Table and Format-List, ordered by first event within the window. The filters currently exclude noisy events that aren't useful for analysis; as I test this on more servers, that list will be expanded (and documented).

$ServerName   = Read-Host -Prompt "Enter the server name"
$DateTime     = Read-Host -Prompt "Enter the date/time of the incident"
$MinuteWindow = Read-Host -Prompt "Enter the window on each side, in minutes"
$Begin = (Get-Date($DateTime)).AddMinutes(-$MinuteWindow);
$End   = (Get-Date($DateTime)).AddMinutes( $MinuteWindow);
# later if we want to pull all servers within the same WSFC, we can 
# pull that list using Get-ClusterNode, but we likely need creds
# $cred = Get-Credential -Message "Enter your credentials:";
# $Servers = Invoke-Command -ComputerName $ServerName -ScriptBlock { Get-ClusterNode } -Credential $cred 
# | Select Name;
# | Where-Object -FilterScript {$_.Name -like 'AG_*'} } -Credential $cred;
$Servers = @($ServerName);
$Logs = @("System","Application","HardwareEvents","Microsoft-Windows-FailoverClustering-Manager/Admin",
$Providers = @("Microsoft-Windows-FailoverClustering","Microsoft-Windows-FailoverClustering-Manager",
               "MSSQLSERVER","EventLog","Service Control Manager");
# foreach ($server in $Servers)
# {
  $events = Get-WinEvent -ComputerName $serverName -FilterHashtable @{ 
      LogName      = $Logs;
      ProviderName = $Providers; 
      StartTime    = $Begin; 
      EndTime      = $End; 
      #Level        = 1,2,3 
    } -ErrorAction Stop | 
    Where-Object {
    ($_.ProviderName -in ("MSSQLSERVER","Microsoft-Windows-FailoverClustering") -and 
       $_.Id -notin (17103,3408,46920,17177,17115,17111,17550,17135,8128,49903,17551,18100,33090) -and 
       $_.Id -notin (15268,18496,6299,3450,17136,849,26018,49904,35266,17806,18452,17101))
    } | Select-Object Id,TimeCreated,Message,LogName,ProviderName,Level,MachineName | Sort-Object Id,TimeCreated;
# }
$LastEventID = -1;
$LastLogName = [string]::empty;
$thisEventCounter = 0;
$totalEventCounter = 0;
$rowsToIgnore = @();
$EntityList = @(); 
$result = foreach ($incident in $events)
    $ServerName = $incident.MachineName
    $IncidentID = $incident.Id;
    $DateTime   = $incident.TimeCreated;
    $Message    = $incident.Message;
    $ErrorLevel = $incident.Level;
    $totalEventCounter += 1;
    if ($IncidentID -ne $LastEventID)
        $LastEventID = $IncidentID;
        $EntityList = @(); 
        $Entity = [string]::Empty;
        $MinTime = $DateTime;
        $MaxTime = $DateTime;
        $thisEventCounter = 0;
        # ignore previous row
        $rowsToIgnore += $totalEventCounter - 1;
        if ($DateTime -lt $MinTime) { $MinTime = $DateTime; }
        if ($DateTime -gt $MaxTime) { $MaxTime = $DateTime; }
    $thisEventCounter += 1;
    $splitElements = 3;
    if ($IncidentID -in (19406,26070,1480,19420,41061,41075,35201,3303,35206,35202) -and 
      ($Message -like '*The availability group database*' -or
       $Message -like '*availability replica*' -or 
       $Message -like '*availability database*' -or 
       $Message -like '*in database*' -or 
       $Message -like '*in availability group*' -or 
       $Message -like '*for availability group*'))
        if ($IncidentID -eq 3303) { $splitElements = 5; }
        $MessageParts = $Message.Replace("'",'"')|%{$_.split('"',$splitElements)};
        $Entity = $MessageParts[$splitElements - 2];
        $EntityList += $Entity;
        if ($thisEventCounter -gt 1)
            $Message = "$($MessageParts[0])<x>$($MessageParts[2])";
            if ($IncidentID -eq 3303)
                $Message += "<x>$($MessageParts[4])$($MessageParts[5])"
    $EntityList | Sort-Object Value -Unique;
        RowNumber       = $totalEventCounter;
        EventCount      = $thisEventCounter;
        Message         = $Message;
        IncidentID      = $IncidentID;
        ServerName      = $ServerName.ToUpper();
        Source          = "$($incident.LogName) / $($incident.ProviderName)";
        FirstOccurrence = $MinTime;
        LastOccurrence  = $MaxTime;
        Level           = $ErrorLevel;
        Entities        = if ($EntityList.Count -gt 0) { $EntityList }; 
$result | Where-Object -FilterScript {$_.RowNumber -notin $rowsToIgnore} | 
          Select Level,IncidentID,FirstOccurrence,LastOccurrence,EventCount,ServerName,Source,Message |
          Sort-Object FirstOccurrence, ServerName, -EventCount -Unique | Format-Table -AutoSize;
$result | Where-Object -FilterScript {$_.RowNumber -notin $rowsToIgnore} | 
          Select Level,IncidentID,FirstOccurrence,LastOccurrence,EventCount,ServerName,Source,Entities,Message |
          Sort-Object FirstOccurrence, ServerName, EventCount -Unique | Format-List;

This is definitely not built for speed or scalability; not now, and maybe not ever. If it takes 10 minutes to chew through event logs, that's at least 10 minutes I'm not spending manually clicking and scrolling through the event viewer UI. And it should never be that bad unless the target server is underpowered or your search window is too large.

But hopefully you can see where I'm going with GetAllTheErrorLogs.

Aaron Bertrand

I am a passionate technologist with industry experience dating back to Classic ASP and SQL Server 6.5. I am a long-time Microsoft MVP, speak frequently at conferences, and write at SQLPerformance and MSSQLTips. In real life I am a father of two, an architect at Wayfair, and my pronouns are he/him.