My goal was finding an AWS service that will answer the following requirements:
- Store large amounts of data (split into small chunks) reliably.
- Accessible from multiple zones/regions.
- Low read/write latencies.
- Cheap enough.
SQS is a powerful service that is really useful for de-coupling between micro-services and allowing reliable transfer of messages between them. SQS promises to store your messages (each message up-to 256KB) for max. 14 days and charge you only for the operations you do (and not for the used space) – about 0.5$ per 1M operations.
The SQS API is very simple and easy to use. Currently there are two types of SQS queues – Regular queues and FIFO queues.
Regular queues should be used where we don’t care about the order so we can handle receiving messages in a different order than they were sent, and we can handle duplications. Regular queues are highly scalable and can give you very good performance.
On the other hand FIFO queues have strict ordering so when we are receiving messages we can rely on getting them in the same order they were sent. Moreover it has a strong guarantees that the same message won’t be received twice. Sounds good right? The down side is that currently its performance is not that great because it has a limit of 300 operations per queue per second (in case we are handling small messages we can use the SendMessageBatch API that can combine up-to 10 messages per single API call so with the limit of 300 operations we can encourage 3000 IOPS).
So far so good, we’ve found a nice service that should serve me well. The problem I have encounter is fulfilling one of the requirements above. When I’m consuming messages from the queue, I must know for sure whenever I’ve consumed all the messages from the queue or not.
SQS gives us an eventual-consistency guarantee so in theory the following scenario is possible:
- Send 100 messages using the SendMessage API
- Keep invoking ReceiveMessage API while there are messages to consume (its response is not empty)
- Get 80 messages and finish because ReceiveMessage returned with empty response
There is no strongly consistent way (using this service alone) to know for sure what is the exact number of messages that we need to consume from the queue. ReceiveMessage can return empty response because the last 20 messages are still in-flight within the service and haven’t propagated to the right place so we cannot read them for now, but there is no way of knowing that on the receiver side so we could wait for them to arrive. From our perspective, we’ve already consumed the whole queue.
This problem can be solved by using another service that can provide strong consistency (for example, DynamoDB) and leverage it for storing the actual number of messages that are in the queue. That way we don’t need to rely on SQS consistency and even if calling ReceiveMessage will result with an empty response, we can still know for sure whenever there should be more messages so we need to keep calling ReceiveMessage until we’ll get them or we’re done and can proceed with our lives.
Using DynamoDB (or other service) can be expensive and add more latency for each operation (we need to maintain a counter on each message we send and receive).
In this post, I would like to suggest another approach. If we can live with the performance of FIFO SQS and can guarantee that while we are receiving messages, no one will send anything to the queue, we can rely on FIFO SQS strict ordering and do the following thing on the receiver side:
- Push a special message to the queue (may contain a unique UUID and a timestamp)
- Start receiving messages from the queue until you’ll reach that special message
- Even if ReceiveMessage returns with empty response, you know that this is due to eventual-consistency delays and you should keep invoking ReceiveMessage
- Once you got that special message, you can be sure that all the messages in the queue were received
In case of a crash in the middle of the process, we can wait until visibility time expires (so the messages will be available in the queue again) and re-run the same process. We’ll push a new unique message and start running over the items in the queue so and in case we’ll find another unique message (probably a left-over from the previous run) we can just ignore it and continue receiving messages until we’ll reach the newest unique message.
This approach can be used in many variations and can really help with consistency issues.
If performance is an issue, you can always try and squeeze some more performance from FIFO SQS for example by using double buffering approach – Instead of using one queue, you can use two queues and distribute the messages to them by a specific policy (round-robin, time based and etc.)
In the worst case, you can just use regular SQS and DynamoDB as a complement 🙂