When designing the scalable systems which can run on multiple nodes, one common problem to face is how to deal with scheduled tasks which must be run on one instance, not in multiple instances. When checking the AWS architecture for a team working on AWS, I found a problem of multiple cronjobs running on multiple nodes which cause duplicated work. The key to solving distributed cronjobs in AWS Scaling Architecture is to have a locking method to guarantee that if a node is performing cron, no other nodes can be. Another approach is to have a centralized task handling system to deal with this. I note here some references which might be useful for your reference when dealing with this issue.
Implemented on Scalr system: http://highscalability.com/blog/2010/3/22/7-secrets-to-successfully-scaling-with-scalr-on-amazon-by-se.html
Answers from AWS Staff:
I did a quick poll of some of my colleagues and came up empty on the cron, but after sleeping on it I realised the important step may be limited to locking. So I looked for “distributed cron job locking” and found a reference to Zookeeper, an Apache project.
Also I have seen reference to using memcached or a similar caching mechanism as a way to create locks with a TTL. In this way you set a flag, with a TTL of 300 seconds and no other cron worker will execute the job. The lock will automatically be released after the TTL has expired. This is conceptually very similar to the SQS option we discussed yesterday.
Also see; Google’s chubby http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/chubby-osdi06.pdf
Let me know if this helps, and feel free to ask questions, we are very aware that our services can be complex and daunting to both beginners and seasoned developers alike. We are always happy to offer architecture and best practice advice.