> It can't optimize it down to simple loads and stores unless it can prove that ...

wahern · on Nov 29, 2016

  Notice how tight this loop is. In particular, we're dealing with a single simple load to read our u64.

Notice that you're reading the data into a statically allocated buffer, and doing it in such a way that it's trivial for the compiler to prove alignment. This is a classic case where the benchmark is irrelevant for a general purpose implementation.

Try running the code so that the buffer is dynamically allocated, and so that the first access is unaligned.

Now, I'm not saying that type-punning can't be faster, but to do it properly from a general-purpose library it should be done correctly so that every case is as fast as possible.

Assuming I'm correct and that the modified benchmark sees substantially different results, reimplement byteorder such that it produces the same tight loop even when the data isn't aligned.

I don't think it can be done without modifying the byteorder interface to expose something more iterator-like, because it needs to maintain state across invocations for doing the initial unaligned parse followed by the aligned parse.

If you can get it done in a reasonable amount of time[1], look at the difference between type-punning and byte-loading. I'll bet that relative difference will be much smaller than the difference between the unaligned performance before you refactored the interface, and the unaligned performance after refactoring the interface. In that case my point would stand--the most important part is refactoring code at a higher-level; gains quickly diminish thereafter.

If my argument is over-specified, that's because it's meant as a rule of thumb. Specifying a rule of thumb but then qualifying it with "unless" is counter-productive. For inexperienced engineers "unless" is an excuse to avoid the the rule; for experienced engineers "unless" is implied.

Note that I'm no stranger to optimizing regular expressions. I wrote a library to transform PCREs (specifically, a union of thousands of them, many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input) into Ragel+C code and got a >10x improvement over PCRE. After that improvement micro-optimizations were the last thing on our minds. (RE2 couldn't even come close to competing; and unlike re2c, the Ragel-based solution would compile on the order of minutes, not lifetimes.)

We eventually got to >50x improvement by doubling-down on the strategy and paying someone to modify Ragel internally to improve the quality of the transformations.

[1] Doubtful as I bet it's non-trivial and you have much better things to do with your time. But I would very much like to see just benchmarks numbers after making the initial changes--dynamic allocation and unaligned access. I don't have a Rust dev environment. I'll try to do this myself later this week if I can. However, given that I've never written any Rust code whatsoever it'd be helpful if somebody copy+pasted the code to dynamically allocate the buffer. I can probably figure the rest out from there.

glangdale · on Nov 29, 2016

Hi, author of Hyperscan (https://github.com/01org/hyperscan) here.

I strongly suspect we don't support enough of this:

> many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input

... to really support your use case. But we're interested in the workload, especially as we're looking at extensions to handle more of the zero-width assertion cases. We'll never be able to handle some of them in streaming mode (they break our semantics and the assumption that stream state is a fixed size for a given set of regular expressions).

Can you share anything about what you're doing with zero-width assertions?

burntsushi · on Nov 29, 2016

> Notice that you're reading the data into a statically allocated buffer

It is not statically allocated. The data is on the heap. The give-away is that the data is in a `Vec`, which is always on the heap.

> and so that the first access is unaligned

I modified both benchmarks in this fashion:

    let mut sum: u64 = 0;
    let mut i = 1;
    while i + 8 <= data.len() {
        sum += LE::read_u64(&data[i..]);
        i += size_of::<u64>();
    }
    sum

The results indicate that both benchmarks slow down. The gap is narrowed somewhat, but the absolute difference is still around 4x (as it was before):

    test bit_shifting ... bench:   2,293,921 ns/iter (+/- 65,243)                                                                                                                                                        
    test type_punning ... bench:     659,350 ns/iter (+/- 15,550)

The loop is not so tight any more:

    .LBB4_6:
    	leaq	-8(%rcx), %rdi
    	cmpq	%rdi, %rsi
    	jb	.LBB4_11
    	cmpq	$7, %rax
    	jbe	.LBB4_12
    	movq	(%rbx), %rdi
    	addq	-8(%rdi,%rcx), %rdx
    	addq	$8, %rcx
    	addq	$-8, %rax
    	cmpq	%rsi, %rcx
    	jbe	.LBB4_6

> Now, I'm not saying that type-punning can't be faster, but to do it properly from a general-purpose library it should be done correctly so that every case is as fast as possible.

You haven't actually told me what is improper with byteorder. I think that I've demonstrated that type punning is faster than bit-shifts on x86.

You have mentioned other workloads where the bit-shifts may parallelize better. I don't have any data to support or contradict that claim, but if it were true, then I'd expect to see a benchmark. In that case, perhaps there would be good justification for either modifying byteorder or jettisoning it for that particular use case. With that said, the data seems to indicate the the current implementation of byteorder is better than using bit-shifts, at least on x86. If I switched byteorder to bit-shifts and things got slower, I have no doubt that I'd hear from folks whose performance at a higher level was impacted negatively.

> Note that I'm not stranger to optimizing regular expressions. I wrote a library to transform PCREs (specifically, a union of thousands of them, many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input) into Ragel+C code and got a >10x improvement over PCRE. After that improvement micro-optimizations were the last thing on our minds. We eventually got to >50x improvement by doubling-down on that strategy and modifying Ragel internally. Much like micro-optimizations RE2 couldn't even come close to competing; and unlike re2c, the Ragel-based solution would compile on the order of minutes, not lifetimes.

My regex example doesn't have anything to do with regexes really. I'm simply pointing out that a micro-optimization can have a large impact, and is therefore probably worth doing. This is in stark contrast to some of your previous comments, which I found particularly strongly worded ("irrational" "premature" "bad" "incorrect"). For example:

> It's all sort of ironic, which I suppose was the point upthread--this is an example of the irrational urge for premature optimization and of bad programming idioms being hauled into Rust land completely unhindered by Rust's type safety features. And the better, correct, and likely more performant way of accomplishing this task could have been done just as safely from C as it could from Rust.

Note that I am not making the argument that one shouldn't do problem-driven optimizations. But if I'm going to maintain general purpose libraries for regexes or integer conversion, then I must work within a limited set of constraints.

(OT: Neither PCRE nor RE2 (nor Rust's regex engine) are built to handle thousands of patterns. You might consider investigating the Hyperscan project, which specializes in that particular use case (but uses finite automata, so you may miss some things from PCRE): https://github.com/01org/hyperscan)