I’m really looking forwarding to seeing the original commenters reply on this. B...

Tor3 · on Oct 31, 2022

Anecdotal, but I've some experience in running both TCP- and UDP-based VPN over long-latency links (I worked from half around the globe for some years).

With OpenVPN it's easy enough to test - configure for UDP, or configure for TCP. With long latency, and a tiny amount of packet losses, running TCP over TCP OpenVPN completely stalls, while TCP over UDP OpenVPN is excellent - it's around the same performance as running direct TCP, or sometimes actually better. At work we've also used other types of VPN setups (for engineers on the road), and the TCP based ones (we've used several) work fine most of the time, but if you try that from far away it becomes nearly unusable while UDP OpenVPN continues to work basically just fine.

The TCP over TCP VPN performance problem (over long latency links) presumably has to do with windowing and ack/nak on top of windowing with ack/nak.

ay · on Oct 31, 2022

The TCP over TCP performance problem can be summarized as follows:

Because the underlay TCP is lossless (being TCP), every time the overlay TCP has to retransmit, it adds to the queue of things that the underlay TCP has to retransmit (and the need to retransmit happens more or less at the same time).

So instead of linear increase in the number of packets, you get ~quadratic.

This balloons the required throughput needed to “rectify” the issue from the protocol standpoint at both levels - usually precisely at the point when there’s not enough capacity in the first place (the packet loss is supposed to signal congestion).

If you are very lucky, the link recovers fast enough that this ballooning is small enough to be absorbed by the newly available capacity.

If the outage is long enough, the rate of build-up of retransmits exceeds the capacity of the network to send them out - so it never recovers.

Needless to say, the issue is worse with large window in overlay TCP session - e.g. a sudden connectivity blip in the middle of the file transfer.

Bluecobra · on Oct 31, 2022

> I’ve found UDP to be great for latency but pretty awful for throughout.

UDP/multicast can provide excellent throughput. It's the de facto standard for market data on all major financial exchanges. For example, the OPRA feed (which is a consolidated market data feed of all options trading) can easily burst to ~17Gbps. Typically there is a "A" feed and a "B" feed for redundancy. Now you're talking about ~34Gbps of data entering your network for this particular feed.

Also, when network engineers do stress testing with iperf we typically use UDP to avoid issues with TCP/overhead.

adamcharnock · on Oct 31, 2022

That’s interesting. And I’m sure they have some very knowledgable people working for them who may(/will) know things I don’t.

That being said, it wouldn’t surprise me if they were pushing 17G of UDP on 100G transports. Probably with some pretty high-end/expensive network hardware with huge buffers. I.e you can do it if you’ve got the money, but I bet TCP would still have better raw throughput.

Bluecobra · on Nov 1, 2022

Yep, 100G switches are common nowadays since the cost has come down so much, and you can easily carve a port to 4x10G, 4x25G, and 40G. In financial trading you tend to avoid switches with huge buffer as that comes to a huge cost in latency. For example, 2 megabytes of buffer is 1.68ms of latency on a 10G switch which is an eon in trading. Most opt for cut-through switches with shallow buffers measured in 100s of nanoseconds. If you want to get really crazy there are L1 switches that can do 5ns.

adamcharnock · on Nov 1, 2022

That is a really good point that I hadn’t considered. Presumably this comes at the risk of dropped packets if the upstream link becomes saturated? Does one just size the links accordingly to avoid that?

tiberious726 · on Nov 2, 2022

Basically yes, but the links themselves are controlled by the exchanges (and tied in to your general contact for market access).

In general UDP is not a problem in the space because of overprovisioning. Think "algorithms are for people who don't know how to buy more RAM", but with a finicial industry budget behind it.

tiberious726 · on Nov 2, 2022

Do the vendors actually convince anyone to buy those hubs rebranded as "ultra high performance 5ns L1 switches"?

kazen44 · on Oct 31, 2022

Multicast throughput is hard to measure because it is... well, multicast.

Depending on where your RP's are, and how you are transmitting multicast packets across a core, multicast performance can vary a lot.

The main advantage of multicast however, is that throughput between RP's doesn't need to be very large..

Bluecobra · on Nov 1, 2022

It’s actually pretty easy to monitor the throughout with the right tools. The network capture appliance I use can measure microbursts at 1ms time intervals. With low latency/cut through switches there are limited buffers by design. You are certain to drop packets if you are trying to subscribe to a feed that can burst to 17Gbps on a 10Gbps port.

Market data typically comes from the same RP per exchange in most cases. Some exchanges split them by product type. Typically there’s one or two ingress points (two for redundancy) into your network at a cross connect in a data center.

tiberious726 · on Nov 2, 2022

Have you tried to get inline-timestamping going on those fancy modern NICs that support PPT? Orders of magnitude cheaper than new ingress ports on that appliance whose name starts with a "C", also _really_ cool to have more perspectives on the network than switch monitor sessions.

jstimpfle · on Oct 31, 2022

UDP is little more than IP, so there isn't a technical reason why UDP couldn't be just as fast as TCP _per se_. But from when I was toying with writing a stream abstraction on top of UDP in Linux userspace, I came to the same conclusion, it's hard to achieve high throughput.

My guess is that this is in part because achieving high throughput on IP is hard and in part because it's never going to be super efficient at this level (in userspace, on top of kernel infrastructure that might not be as optimized towards throughput like it is in the case of TCP).

stevewatson301 · on Oct 31, 2022

You can use eBPF/DPDK these days for hardware offload.

nextaccountic · on Oct 31, 2022

What about QUIC? Do you think that HTTP/3 will suffer from throughput as well?