It is definitely hard to get right, for sure. I am actually planning on taking w...

grzm · on Feb 14, 2017

PgQ, part of Skype's SkyTools package, is a good example of queues in Postgres. It's used by their Londiste, a replication tool that was developed before built-in replication was mature.

https://github.com/markokr/skytools

einhverfr · on Feb 14, 2017

But you have a fairly major problem with all these solutions and that is when a message leaves a queue. For a message queue you want it to leave the queue when delivered. For a job queue you want it to leave the queue when the work is completed but not to be sent out in the mean time. Now imagine you have processing jobs that take two weeks with 16 cores to complete.

That's one thing that makes this a hard problem. If you have open transactions for two weeks "Deleting this...." then autovacuum is going to be totally ineffective. So failure detection and recovery is an important (and probably the hardest) part of the problem.

PgQ looks to me like it is a solution to a message queue problem not a work queue problem.

grzm · on Feb 14, 2017

IIRC (it's been a while), PgQ doesn't keep transactions open while events are processed: events are fetched and marked as in process; they're then marked succeeded or failed to finish the event. The bulk copy of initial data which bootstraps replication is a long duration process: I don't believe transactions are held open this entire time.

If I understand you correctly, this handles your work queue case.

einhverfr · on Feb 14, 2017

Not quite.

In my experience you have several critical issues:

1. What happens when a job silently fails?

2. What happens when a job takes a lot longer than expected to succeed?

If you solve the first with a timeout, the second leads to a job rerun. The best (only?) solution I have found is to have some awareness in the job queue of the fact that the job is currently being processed. In my previous work we used advisory locks for that.

grzm · on Feb 14, 2017

It wasn't clear to me how closely you've looked at PgQ. Have you looked into the design (other than the README), or used it and found these failings? I'm certainly not going to be able to answer your questions off the top of my head given the time passed since I last used it.

Given your critiques of everything else out there (from what I gather from the rest of your comments in this thread), it seems like your identified a possible business opportunity.

einhverfr · on Feb 14, 2017

It's been a little while but I actually read through the source code of it and Londiste. It's possible I missed something, but I didn't see anything that would automatically reset messages if a connection goes away between receiving the message and marking it as completed.

einhverfr · on Feb 14, 2017

BTW one example of a long-running task would be a genome assembly task. You take in raw reads in a tar of fastq files, turn it into a BAM file, and from there find overlapping read sections so you can detect continuous read zones, and then determine the consensus nucleotide sequence in each contiguous read zone.

What happens if someone trips over a power cord after a week of this job running? How do you detect and recover? What if, due to the size of data passed in, the job takes a week longer than expected?

This is why you cannot always reduce a work queue to a message queue.

gbrits · on Feb 14, 2017

> I am actually planning on taking what I have learned on this and releasing it soon.

Consider me interested. Any way I could be kept in the loop?

einhverfr · on Feb 14, 2017

It will be here: https://github.com/einhverfr/pg_titanides

Not sure on timeline yet. Have not had the time to actually write the first generation of the code. Here was my initial announcement (Old, I know but I haven't committed the code yet and need to do that): http://ledgersmbdev.blogspot.se/2016/08/forthcoming-new-scal...

einhverfr · on Feb 14, 2017

FYI with recent commits, it is mostly done.

einhverfr · on Feb 14, 2017

I actually had some unpushed commits, and expect to try to finish up the selection logic this week or next.