More

maxhou · on Nov 4, 2021

implementation of continue from one luajit fork:

https://github.com/zewt/LuaJIT/commit/c0e38bacba15d0259c3b77...

maxhou · on June 13, 2017

Yes they do.

A read() syscall takes longer than a getpid() syscall because read() has more work to do, it actually does a data copy of len bytes, which takes some time (and will be faster/slower if data is cache hot)

What we call the "syscall overhead" is what happens before and after the actual data copy, switching between user and kernel mode.

You make that overhead negligible by calling read() with a large size.

maxhou · on June 13, 2017

yes and pv processes are not scheduled on the same CPU core, so different L2 cache.

daveguy · on June 13, 2017

I wonder if `taskset -c1 yes | taskset -c1 pv > /dev/null` would significantly change the throughput.

joosters · on June 13, 2017

    $ yes |pv > /dev/null
    46.6GiB 0:00:05 [9.33GiB/s]

    $ taskset 1 yes |taskset 1 pv > /dev/null
    32.9GiB 0:00:05 [6.58GiB/s]

    $ taskset 1 yes |taskset 2 pv > /dev/null
    45.7GiB 0:00:05 [9.13GiB/s]

    $ taskset 1 yes |taskset 4 pv > /dev/null
    45.7GiB 0:00:05 [9.18GiB/s]

Very rough numbers - the 9.13/9.33 difference flip-flopped when I ran the commands again. Binding both processes to the same core is definitely a performance hit though. There might be some gain through a shared cache, but it's lost more through lack of parallelism.

I tried 2/4 as not sure how 'real' cores vs 'hyperthread' cores are numbered. These numbers are from a i7-7700k.

maxhou · on June 13, 2017

How do you know that the dataset fits in L2 ?

Assuming pv uses splice(), there is one only copy in the workload: copy_from_user() from fixed source buffer to some kernel allocated page, then those pages are spliced to /dev/null.

If the pages are not "recycled" (through LRU scheme for allocation), the destination changes every time and the L2 cache is constantly trashed.

joosters · on June 13, 2017

I only learned of pv from this article so I can't speak much about its buffering. I would guess that the kernel would try to re-use recent freed pages to minimise cache thrashing. But anyway, on the 'yes' side, the program isn't re-allocating its 8kb buffer after every write(), so there's a lot of data being re-read from the same memory location.

maxhou · on June 28, 2016

Does the protocol implement any kind of negotiation (ciphers, ...) ? if not, how would you handle future type of attacks against the then hardwired constructions ?

I fully agreed that being in-kernel is the right choice for performance, but the chosen constructs excludes the possibility of using any type of existing crypto hardware accelerator that shines in the IPSEC use-case (cache cold data == no cache flush overhead, fully async processing of packets with DMA chaining). Time to start lobbying SOC vendors :)

zx2c4 · on June 28, 2016

The cipher suite is part of the Noise preamble, so all operations are crytographically bound to the cipher suite to prevent against related-algo attacks. WireGuard itself has no plans for cipher agility, something that is considered an anti-feature. If these ciphers are ever considered problematic, we'll change them and release a new version (with an incremented preamble), and the new set of ciphers will be similarly non-configurable.

Fortunately AVX2-accelerated (and soon AVX512-accelerated) ChaPoly is super fast in pretty much all hardware.

maxhou · on June 21, 2016

> what ORM to use (or not to use an ORM)

do you have something to recommend ?

used Rose::DB in the past, then I discovered SQLAlchemy and it's difficult to look back...

SwellJoe · on June 21, 2016

DBIx::Class is the only one I've really looked at. I also recall seeing a talk on Fey and Fey::ORM, given by the author, at YAPC or somewhere, and remember thinking it seemed really nice. But, I have never used any ORM heavily, so I'm still figuring it out.

ashimema · on June 21, 2016

1 point by ashimema 0 minutes ago | edit | delete

Love DBIx::Class.. but it's not a good/perfect fit for Mojolicious by a long way.. it's blocking by nature and thus doesn't play too nicely if your aiming to write a non-blocking mojo app. reply

mst · on June 21, 2016

For the few queries where you need async (most should be fast enough in the first place), there's no reason you can't use $rs->as_query to get hold of the SQL and bind values and then feed those into Mojo::Pg.

SwellJoe · on June 21, 2016

Is there such a thing as a non-blocking SQL query? Doesn't it always have to be wrapped up in something to make it non-blocking?

I think my question is: Is there an SQL ORM in any language that is non-blocking during queries, without the programmer having to wrap it in some sort of promise/callback/whatever? I really have no idea about ORMs, so I don't know anything about the state of the art. I'm trying to imagine what such a creature would look like...it seems like if your queries are going to potentially make you wait for any amount of time, you'd need to account for that at the caller side, even if things happen on the ORM side.

peteretep · on June 21, 2016

I'd need someone suggesting anything other than DBIx::Class to present some ironcast arguments for that choice

kbenson · on June 21, 2016

I agree. I think the main Perl workhorse modules that provide the best benefit for me are, in order:

DBIx::Class

Kavorka / Function::Parameters / Method::Signatures (take your pick)

Moose / Moo (or use Moops and get Kavorka above built in!)

Mojolicious

Notable mention: Path::Tiny (and Try::Tiny, but everyone should know that one).

It's taken me awhile to find the happy medium where I'm not sticking too much in DBIC ResultSet methods, but being able to define complex search methods that chain is awesome:

  my $rs = Schema->resultset("Foo")->unprocessed->rows(100)->order_by([-desc=>'time']);
  $rs = $rs->for_user($user) if $user;
  $rs = $rs->recent_entries; # Limit to the last week

Makes my life much better, and makes it so much easier to change me schema as needed.

maxhou · on March 6, 2016

http://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_260.htm

maxhou · on Dec 13, 2015

> Suppose, for example, that you go to the Bank of America site to transfer some funds or pay a bill. As with Google, and as would happen with any other secure site, it turns out their certificate gets replaced with the Avast certificate. I doubt anyone needs me to lecture them on the potential security issues involved in having a third-party watching their banking transactions without permission!

Antivirus software runs with the highest level of privileges, divert system calls...

They could theoretically log everything you type on the keyboard, no need to MITM SSL connections

> Avast is replacing certificates with its own without bothering to check the validity of those certificates!

This is a far bigger issue

maxhou · on Nov 11, 2015

It won't help. To serve its purpose, any AQM (active queue management) like fqcodel must be done at the congestion point.

The end-user computer has a faster link than its internet connection upload speed, typically a gig ethernet whereas a xDSL upload speed is a few megabits/s.

If you transfer data to a random website (upload) from the computer, packets will accumulate inside router/modem tx queue, because this is the slowest link between the two hops. This is where it's important to have an AQM running to reorder/drop packets in that queue.

motoboi · on Nov 12, 2015

To clarify, I meant that fq_codel solves bufferbloat. Its wide adoption will solve the problem globally and being present at Linux will help that, as Linux is the kernel in several home network appliances.

Other AQMs could solve it, if properly implemented, but fq_codel needs no tweaking.

CORRECTION:

fq_codel is not default in Linux, but is default in some distributions, like Fedora.

jdc · on Nov 11, 2015

If the end-user's router has firmware with AQM, would that solve the buffer bloat?

_urga · on Nov 12, 2015

No, not really, see this comment: https://news.ycombinator.com/item?id=10546875

maxhou · on Oct 18, 2015

from the article:

> Each core needs to generate a few thousand data packets per second, because Ethernet packets typically contain up to 1500 bytes. This gives the CPU around 100 microseconds to process each packet.

No it doesn't, not when using TCP Segmentation Offload (TSO)

This only works for a particular use-case: sending static data using TCP, but this is the most common use-case since a typical "video streaming server" is actually a simple HTTP server that serves static MP4/MPEG-TS data.

for each connected client this is what happens - nginx/apache does sendfile(file, sock, off, <large_number>) - kernel issue large (> 10kB) DMA read to the file storage backend into a set of memory pages and wait for completion - kernel allocates/clone a small IP/TCP header (40 bytes) - kernel gives that small header + set of memory pages to network card, which will segment and create those 1500 bytes packets and send them on wire

if you have a lot of RAM, the read from storage could even be skipped because the previously read data pages are kept in the page-cache with a LRU approach. (help if clients are requesting the same file).

you can easily saturate a 10G link with spare CPU cycles on cheap hardware with that approach, no need to bypass anything.

maxhou · on Oct 18, 2015

You forgot the cost of memory access.

The L3 layer checksum is useless because IP packet is small and the kernel has to read/write all the fields anyway.

The L4 checksum covers TCP/UDP packet data, which the kernel can avoid touching if necessary.

When a TCP sender uses sendfile(), the kernel does a DMA read from storage to a page if the data is not already in memory (in the so called page cache), and just ask the network card to send this page, prepended with a ETH/IP/TCP header. That only works if the NIC can checksum the TCP packet content and update the header.

If the network card can do TCP segmentation offload, the kernel does not have to repeat this operation for each 1500 bytes packets, it can fetch a large amount of data from disk, and the NIC will split the data in smaller packets by itself.