Monitoring ECS Scheduled Tasks

Supporting a number of $DAYJOB projects are an ever increasing number of small-to-midsize tasks performing various bits of processing. These tasks are deployed as Dockerized workloads running on AWS ECS. ECS scheduling is basic, but pretty solid and we’ve encountered few issues, but odd flukes, network outages, and random errors can occur. We needed to get visibility into whether tasks were running as scheduled and, most importantly, get alerted when tasks error. I’ve finally implemented a pattern with which I’m mostly satisfied to get that visibility without subjecting everyone to constant alerts.

Solution The Older - Pipe Events in Realtime to Slack

ECS emits a large number of events on all the things happening in and to a cluster and its various services and tasks. These all come through AWS EventBridge, which offers the ability to create filters and rules to take actions on events of interest. My first solution was to create an event rule that looked for all events with the following pattern:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  "source": [
    "aws.ecs"
  ],
  "detail-type": [
    "ECS Task State Change"
  ],
  "detail": {
    "clusterArn": [
      "REDACTED_ARN"
    ],
    "lastStatus": [
      "STOPPED"
    ]
  }
}

This captures all the task exit events from our ECS cluster. This rule then invoked an event target that extracted out the container name, exit code, and final status message of the task and routed it to SNS, where these items were consumed via a Lambda function to send to a monitoring Slack channel. Mischief managed, right? Not quite.

It turns out that having a variety of tasks running on schedules ranging from weekly to several times a day generates a heck of a lot of events. Even with Slack message formatting to indicate which posts contained alerts and which were all-systems-nominal there was too much channel noise. This got to the point where few people were monitoring the channel regularly [when I’m inclined to mute a channel then I know it’s far too noisy]. Not ideal. Clearly something else was needed.

Solution The Current - Batch Up Low Priority Events

This seemed like something that must have been solved by others so I asked the super groovy folks over on the AWS channel of the Hangops Slack for ideas. A limitation here is that, as a small team with very little infrastructure, traditional alerting tools and application performance monitoring solutions were not available. Eventually the clever idea of batching up messages was surfaced (apologies for not capturing which Hangops member came up with the final inspiration).

The principle is to still consume all the ECS exit events, but only immediately message Slack when there is an abnormal exit mode. All the successful messages are queued up. The queued up message are occasionally sent to Slack, providing that pilot-light experience that the monitoring process itself is still working while avoiding the dreaded alert fatigue.

Tracking state is a problem though. One of the major objectives of using containers and Lambda functions is to avoid all that pesky state-full processing. It was time to reach for DynamoDB once again for a simple NoSQL database. After much sketching on graph paper, I created a basic table schema of two record types:

event_history: Tracks the name of the task and has a numeric set of all the successful runs. That set contains the Unix timestamp extracted from the raw ECS events.
config: A simple structure of the name of a monitored task, the Slack destination for alerts, and the hours of the day when reports should be sent.

Now when an event is received from ECS, a consumer Lambda function checks the exit code. If an error, the consumer immediately sends an alert to Slack. If the task is a normal exit though, the timestamp is pushed to the set in DynamoDB and no alert if fired.

A report creator function runs periodically (as a Lambda function, not as an ECS task, avoiding ECS watching ECS inception issues) and checks all the configured config keys. For each watched task that has a desired report in the same hour the report process is currently running, the matching event_history are queried, extracting the number of successful runs, the earliest success date, and the latest success date. This is all packaged up and sent to Slack. Then the matching event_history is purged, starting from a clean slate for the next reporting period.

Status of the Status Pipeline

After creating and debugging a quick by of DRY-ish Terraform code, I turned this live in production today. Thing appear to be running smoothly and I hope to turn off the old firehose solution later this week. One mostly-positive benefit of this solution is that even tasks that are not explicitly reported are tracked in the database. Eventually size constraints are going to force that to be dealt with, but it’s a can I’m comfortable kicking down the road. The consumer/reporter pairing could be easily extended to send metrics to CloudWatch for trending over time as well.

I also got to spend a good bit of time doing python development with the ever so nifty VS Code as my IDE, kicking the tires on getting a proper development environment set up. Leveraging GitHub Actions for building new deployment artifacts and pushing them into AWS continues to amaze me when it’s all set up and working. Both of these will have to be missives for another day as this is turning into quite the busman’s holiday!