It is definitely hard to get right, for sure. I am actually planning on taking what I have learned on this and releasing it soon. And a lot of cookie-cutter solutions out there don't scale.
But a lot of things are less about outlines than details and that is true here. If you do get it right, you get a work queue where jobs disappear properly when completed and if they fail remain in the queue to be retried.
The fact is, work queues are hard to get right on any technology. You have a ridiculous number of corner cases and failure conditions you need to account for and the question ends up coming down to what you want the default failure state to be.
PgQ, part of Skype's SkyTools package, is a good example of queues in Postgres. It's used by their Londiste, a replication tool that was developed before built-in replication was mature.
But you have a fairly major problem with all these solutions and that is when a message leaves a queue. For a message queue you want it to leave the queue when delivered. For a job queue you want it to leave the queue when the work is completed but not to be sent out in the mean time. Now imagine you have processing jobs that take two weeks with 16 cores to complete.
That's one thing that makes this a hard problem. If you have open transactions for two weeks "Deleting this...." then autovacuum is going to be totally ineffective. So failure detection and recovery is an important (and probably the hardest) part of the problem.
PgQ looks to me like it is a solution to a message queue problem not a work queue problem.
IIRC (it's been a while), PgQ doesn't keep transactions open while events are processed: events are fetched and marked as in process; they're then marked succeeded or failed to finish the event. The bulk copy of initial data which bootstraps replication is a long duration process: I don't believe transactions are held open this entire time.
If I understand you correctly, this handles your work queue case.
In my experience you have several critical issues:
1. What happens when a job silently fails?
2. What happens when a job takes a lot longer than expected to succeed?
If you solve the first with a timeout, the second leads to a job rerun. The best (only?) solution I have found is to have some awareness in the job queue of the fact that the job is currently being processed. In my previous work we used advisory locks for that.
It wasn't clear to me how closely you've looked at PgQ. Have you looked into the design (other than the README), or used it and found these failings? I'm certainly not going to be able to answer your questions off the top of my head given the time passed since I last used it.
Given your critiques of everything else out there (from what I gather from the rest of your comments in this thread), it seems like your identified a possible business opportunity.
It's been a little while but I actually read through the source code of it and Londiste. It's possible I missed something, but I didn't see anything that would automatically reset messages if a connection goes away between receiving the message and marking it as completed.
BTW one example of a long-running task would be a genome assembly task. You take in raw reads in a tar of fastq files, turn it into a BAM file, and from there find overlapping read sections so you can detect continuous read zones, and then determine the consensus nucleotide sequence in each contiguous read zone.
What happens if someone trips over a power cord after a week of this job running? How do you detect and recover? What if, due to the size of data passed in, the job takes a week longer than expected?
This is why you cannot always reduce a work queue to a message queue.
Not sure on timeline yet. Have not had the time to actually write the first generation of the code. Here was my initial announcement (Old, I know but I haven't committed the code yet and need to do that): http://ledgersmbdev.blogspot.se/2016/08/forthcoming-new-scal...
But a lot of things are less about outlines than details and that is true here. If you do get it right, you get a work queue where jobs disappear properly when completed and if they fail remain in the queue to be retried.
The fact is, work queues are hard to get right on any technology. You have a ridiculous number of corner cases and failure conditions you need to account for and the question ends up coming down to what you want the default failure state to be.