Could you explain why one thread per core translated to throughput and not latency targeting?
Each core can only handle a fixed number of instructions per second regardless of number of threads assigned to that core. If a core receives 100 requests, it could handle them all sequentially with 1 thread, or utilize 2-100 threads to handle them.
When using 1 thread, the first request will be completed within the minimum amount of time the core is able to do the work. The second request will complete in the same amount of time after the first.
When using 2 threads, the core will carry out the work an in interleaved fashion between the 2 threads. Both requests will complete at the same time as the 2nd request completed in our 1 thread example above.
I'll try make it clearer - assume a request takes 5 ms:
1 Thread
Request 1: start processing at 0ms, complete at 5ms
Request 2: start processing at 5ms, complete at 10ms
2 Threads
Request 1: start processing at 0ms, complete at 10ms
Request 2: start processing at 0ms, complete at 10ms
As can be seen, the average latency is better with only 1 thread - avg 7.5ms latency when 2 requests arrive at the same time. 2 threads gives us avg 10ms latency.
Throughput should theoretically remain the same if you disregard context switching time and cache invalidation. In the presence of both, the single thread method should outperform in terms of throughput as well.
Suggestion: Try with 1 job of X ms, and N jobs of 1ms, and figure out for different N where the break-even point of X lies. That might give some more insight.
Fast in what sense? Throughput, or latency?
> One thread per core
Oh, I guess the answer to the previous question is "throughput".