Queue & Worker Roundup
Almost every dating site we’ve built has a job queue of some kind. We enqueue email sends, statistic updates, logging, fraud detection and more. Anything that doesn’t immediately impact the response we’re preparing for the user can be run in the background.
Over the years we’ve tried several different queue/worker systems and we’d like to share our findings to help others decide on a technology.
The solutions distinguish themselves in a few key areas:
There are several separate scalability concerns to consider, such as the throughput of the queue, the number of the workers it can support and the maximum queue size. Some solutions are poorly suited to larger numbers of workers or large queues.
Sending a password reset email shouldn’t wait until a batch of hundreds of thousands of “Your Daily Picks” emails have been processed. The queue solution should therefore provide a way to give a job a numeric priority or to split jobs into multiple queues. Named “low_priority” and “high_priority” queues are often sufficient, with workers only proceeding with low priority work when the high priority queue is empty. For a time we had each worker dedicated to a particular queue with the number for each priority tweaked as needed, but this required too much attention to keep it running smoothly.
Some solutions provide web based administration tools to check queue lengths, contents of the queue and to resubmit failed jobs. Others have only a console interface, and strictly enforce that only the top element of the queue can be examined, the remainder being opaque.
Not all jobs should be executed immediately. We might schedule fraud analysis for a user several hours after signup when there is more data to analyze, or schedule a retry for a credit card transaction. Not all solutions support scheduling tasks.
You might be tempted to rely on a series of cron jobs to handle this. It has been our experience that this scales poorly. You cannot easily have multiple workers and must protect against a long running batch “running into” the next scheduled run of the cron job.
The queue server can be a single point of failure. We need to be careful that we don’t lose jobs and can continue processing in the event of failure.
Beanstalk is very fast and simple and has many great features not seen in other solutions. It supports clients in many languages and makes it easy to write your own workers. It supports priorities and “burying” failed jobs so they will be attempted again after all other jobs are completed. Jobs in progress aren’t actually popped off the queue, but are rather marked as being in progress. If the job isn’t completed in a given amount of time it can be processed by another worker. Beanstalk behaves much like a transactional database in this regard. Jobs can also be delayed so they run in the future.
We use Beanstalk on Ashley Madison and we’re very happy with it. Unfortunately, there isn’t a good redundancy solution. Without shared storage and a failover system of some sort you are looking at a single point of failure. There is also no official admin console although there is a third party project.
In theory, RabbitMQ should scale better and with better redundancy than any other option because Erlang can easily scale across multiple machines. The distributed database should prevent any data loss and provide high availability.
One downside is that it’s a “formal” queue. You can look at and pop the top most job, but you can’t get a sense of what’s in your queue. AMQP doesn’t let you check the queue length, but there are RabbitMQ specific admin consoles that work around the issue.
There’s no support for numeric priorities or delayed jobs, nor any transactional support. It’s up to the developer to elegantly handle failures and avoid losing jobs.
The biggest “gotcha” we’ve run into is that performance degrades very quickly as queue size increases. Memory use is extremely high, so it’s quite easy to get a large queue. In fact, memory use is often more than 10x the size of the same data in Beanstalk or Resque because Erlang doesn’t actually support strings. On a 64-bit machine, each character consumes 16 bytes. We found that if the queue size exceeded 1GB we were in for a world of pain. We no longer use RabbitMQ and don’t recommend it under any circumstances.
DelayedJob is a Ruby library, so it’s only an option if you are using Ruby. DelayedJob stores its queue in a single table in MySQL. If you have an existing Ruby on Rails application using MySQL it’s extremely easy to add to your app. You get all the features of MySQL, specifically transactions and master/slave replication for redundancy. You can delay a job or have it run immediately and there’s a third party admin console.
If your queue is sharing the same database as your application, large volumes of inserts and deletes to manage the queue will dominate the replication queue, causing delays on slaves. Also, the workers poll the database server for jobs and each poll can involve a full table scan. With some schema changes this can be avoided; nevertheless, large volumes with many workers does not work well with DelayedJob.
While we no longer use DelayedJob due to our volume, it’s fine for small sites.
Resque was written by the folks at Github and is heavily inspired by DelayedJob. It’s a Ruby library, but there is port for Coffeescript. Resque stores its queue in Redis. While Beanstalk and RabbitMQ are not likely to see use in your app other than as a job store, Redis is a great key-value store that can be used for logging, session storage and more.
We found that Resque was the “thin edge of the wedge” of Redis for us. Once Redis was in our environment we began to use it for storing other non-relational data. Redis can be setup to do master-slave replication, so Resque has got redundancy covered too.
Resque has an “official” admin console that is a pleasure to use. You can easily monitor queue length, view jobs and resubmit failed tasks.
Resque really shines with its plugins. They include support for priority queues, batching, forking and more. A few really stand out as “must haves.”
Resque-Scheduler not only allows jobs to be delayed until a specific time, it can also replace cron functionality. Recurring jobs are added to a queue of your choice and then executed as normal, making it easy to spread out cron work across multiple servers.
For many small sites, the only use for a job queue is to send email. Resque-Mailer integrates with Rails’ ActionMailer and makes background email sends all but transparent.
In summary, if you need something lightweight with support for many languages and are willing to accept a single point of failure then Beanstalk is a great choice. If you are using Ruby, we highly recommend using Resque. The most important thing to remember is you do need a job queue. Any job queue is likely better than none at all.