with regard to the epoll-and-accept method - why is the LIFO queue bad? sure, yo...

majke · on Oct 24, 2017

Imagine a HTTP server supporting HTTP keepalives. Say 10 new connections come in. Say on average they will all go to a single worker. Then say all of the connections request a heavy asset.

You will end up with one worker handling these 10 connections/requests and spinning at 100% CPU while other workers idle. I'm not saying this bad load balancing is a problem affecting everyone, it very much depends on your load pattern.

otterley · on Oct 24, 2017

If the first worker isn't actually idle, then the scheduler will assign the next incoming request to an idle worker. Am I mistaken? If not, what's the problem here?

Filligree · on Oct 24, 2017

Load assignment, with this architecture, happens only when a connection is first opened. HTTP keep-alive means there's a disconnect between when it's opened, and when it becomes expensive.

I.e. it's possible for one worker to first serve ten tiny requests (e.g. index.html), then wait while the clients chew on it, then have all ten clients simultaneously request a large asset.

otterley · on Oct 24, 2017

I am not sure it's possible to solve this problem generally. Doing so would require the kernel be able to predict the future.

Also, IIRC this is a proxy. If you run out of CPU copying data between file descriptors before you run out of bandwidth, I'd be very surprised. I think sendfile(2) makes it especially cheap.

brianwawok · on Oct 25, 2017

Sure there is. Pass around the connection socket as needed, or have a layer in front of the processing layer to only hand out the work when it is good to go.

otterley · on Oct 25, 2017

Can you point to a working example of this that actually solves the problem? i.e., that is demonstrably more efficient for any given request than the FIFO wakeup method?

brianwawok · on Oct 25, 2017

No, you would need to test it for your workload.

I am saying it is possible, not that it is better. Very different things.

otterley · on Oct 25, 2017

That's why I said "it's not possible to solve this problem generally." That is, there's no general solution to the problem, one that is optimal for all workloads.

kevin_nisbet · on Oct 24, 2017

I find this very interesting.

As already pointed out, I imagine there could be some subtle benefit to delivering a majority of connections to a single process while it isn't overloaded, so even perfectly even load balancing can have some disadvantages.

majke, the benchmark you did with SO_REUSEPORT, was that against a NUMA system / recent kernel? I've completely lost track of it, but I think I remember reading that there has been some work on mapping the NIC Queues, Cores, and Workers together, allowing a flow to worker being able to be processed on the same core. I'm just curious if the benchmark shown was able to take advantage of that or not (and as an aside how much of a benefit mapping this together really has).

foxhill · on Oct 24, 2017

http isn't particularly my thing, so if i'm wrong, please correct me! however, i was of the impression that a keep-alive session wouldn't return to the accept stage - the socket is still open, surely?

majke · on Oct 24, 2017

Correct. The point is the worker may be relatively idle when it accepts() plenty of connections and the load only happens after that. When the requests on the connections start to flow.

You can imagine a situation when a single worker gets most of the traffic and runs out of cpu (while other workers idle).

Tepix · on Oct 24, 2017

Once solution would be that the workers pass each other the socket descriptor (using sendmsg()) once they notice that there are idle brethren and they have a lot of work in the queue.

funnelsgun · on Oct 24, 2017

Could that cause a race condition with multiple processes using epoll on the same file descriptors?

barrkel · on Oct 24, 2017

So I think the magic piece of information missing from the article is that workers can have multiple open per-client sockets "on the go"; so if one worker gets all the client sockets, then you're not getting parallelism.