Using Amazon SNS to build fault tolerant and resilient solutions

3 min readMar 21, 2023

As most of us know that Amazon SNS (Simple Notification Service) is widely used for sending notifications in different formats from application-to-person (A2P) or application-to-application (A2A).

Many popular use cases where SNS is best fit being a fully managed services are:

Notifying teams about system failures
Notifying users of pending actions or status updates
Broadcasting campaigns (email/sms/mobile)
Triggering cross-application processes

One lesser used capability of Amazon SNS is ability to build fault tolerance and resiliency in your application. As our applications grow complex and de-coupled with use of multiple AWS services (each doing there specialized job) it becomes difficult build resilience in all the components and their interactions with each other.

SNS plays a very important role here, let’s start with the features which you might already know but might have not explored the other side of it :

1. Inbuilt retries

With clearly defined “delivery policy” for each delivery protocol, SNS ensures enough retries (or may be more than enough) are made when subscribers are down or under maintenance/upgrade.

Below table shows retry policies (at the time of writing the blog) :

Source : https://docs.aws.amazon.com/sns/latest/dg/sns-message-delivery-retries.html

2. Support for Custom Retry Policy (for HTTP/S only):

SNS doesn’t end at just providing inbuild delivery policies, it let’s you create your own delivery policy for HTTP/S endpoints. It means you can build your own complex retry mechanism with back-offs, delayed retry, throttling control etc. by simple configuration.

3. Delivery Status Tracking along with Failed messages :

SNS provides logging of delivery status which can help you identify whether the message reached SNS, was SNS able to deliver the message to endpoint, response from the endpoint and dwell time. You can also attach dead-letter queue to the subscription so it will store/track all messages which couldn’t be delivered (even after multiple attempts mentioned in Step 1). This tracking information and dead-letter queue can be used to identify failures, delays and setup corresponding remediation easily.

Now let’s explore how SNS is able to enhance and improve your application/system resilience using above mentioned features (and other capabilities)

1. Building highly available and fault tolerant cross-service integrations

A complex system/application has multiple cross-component integrations and each component might be build using different AWS service (or 3rd Party service) which bring a lot of overhead of building so many heterogenous interfaces and bring high-availability/fault tolerance in each of them. SNS comes to rescue here by providing ability to receive event driven notifications + performing required fan-out between

Almost all major AWS services
3rd Party Services
Custom/On-Premise services

2. Ability to route Business-logic and Code level failures to corresponding handlers for remediation

All systems (specially complex ones) have lot of business logic and lot of code to support it, there are chances when things don’t go as expected and failures are inevitable. Instead of building repeated error handling scenarios these errors can be tagged with custom attributes and sent to SNS which will filter and route these errors to relevant error handling endpoints using subscription filter policy.

For more details on SNS Subscription Filters refer:

Amazon SNS message filtering

Learn how to selectively receive messages published to a topic by assigning filter policies to your Amazon SNS…

docs.aws.amazon.com

This can be used to trigger specific endpoints based on type of failure like:

Inform Business Team to fix the data and possibly re-run the failed process post resolution

Email Id is missing or incorrect

2. Inform 3rd Party system about the failed request and ask for remediation

Partner API is down / Partner endpoint not responding as expected etc.

Conclusion : AWS SNS capabilities are not just restricted to sending notifications alone and can be used in various different scenarios to improve fault tolerance and resilience of any complex system.