Notice how tight this loop is. In particular, we're dealing with a single simple...

glangdale · on Nov 29, 2016

Hi, author of Hyperscan (https://github.com/01org/hyperscan) here.

I strongly suspect we don't support enough of this:

> many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input

... to really support your use case. But we're interested in the workload, especially as we're looking at extensions to handle more of the zero-width assertion cases. We'll never be able to handle some of them in streaming mode (they break our semantics and the assumption that stream state is a fixed size for a given set of regular expressions).

Can you share anything about what you're doing with zero-width assertions?

burntsushi · on Nov 29, 2016

> Notice that you're reading the data into a statically allocated buffer

It is not statically allocated. The data is on the heap. The give-away is that the data is in a `Vec`, which is always on the heap.

> and so that the first access is unaligned

I modified both benchmarks in this fashion:

    let mut sum: u64 = 0;
    let mut i = 1;
    while i + 8 <= data.len() {
        sum += LE::read_u64(&data[i..]);
        i += size_of::<u64>();
    }
    sum

The results indicate that both benchmarks slow down. The gap is narrowed somewhat, but the absolute difference is still around 4x (as it was before):

    test bit_shifting ... bench:   2,293,921 ns/iter (+/- 65,243)                                                                                                                                                        
    test type_punning ... bench:     659,350 ns/iter (+/- 15,550)

The loop is not so tight any more:

    .LBB4_6:
    	leaq	-8(%rcx), %rdi
    	cmpq	%rdi, %rsi
    	jb	.LBB4_11
    	cmpq	$7, %rax
    	jbe	.LBB4_12
    	movq	(%rbx), %rdi
    	addq	-8(%rdi,%rcx), %rdx
    	addq	$8, %rcx
    	addq	$-8, %rax
    	cmpq	%rsi, %rcx
    	jbe	.LBB4_6

> Now, I'm not saying that type-punning can't be faster, but to do it properly from a general-purpose library it should be done correctly so that every case is as fast as possible.

You haven't actually told me what is improper with byteorder. I think that I've demonstrated that type punning is faster than bit-shifts on x86.

You have mentioned other workloads where the bit-shifts may parallelize better. I don't have any data to support or contradict that claim, but if it were true, then I'd expect to see a benchmark. In that case, perhaps there would be good justification for either modifying byteorder or jettisoning it for that particular use case. With that said, the data seems to indicate the the current implementation of byteorder is better than using bit-shifts, at least on x86. If I switched byteorder to bit-shifts and things got slower, I have no doubt that I'd hear from folks whose performance at a higher level was impacted negatively.

> Note that I'm not stranger to optimizing regular expressions. I wrote a library to transform PCREs (specifically, a union of thousands of them, many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input) into Ragel+C code and got a >10x improvement over PCRE. After that improvement micro-optimizations were the last thing on our minds. We eventually got to >50x improvement by doubling-down on that strategy and modifying Ragel internally. Much like micro-optimizations RE2 couldn't even come close to competing; and unlike re2c, the Ragel-based solution would compile on the order of minutes, not lifetimes.

My regex example doesn't have anything to do with regexes really. I'm simply pointing out that a micro-optimization can have a large impact, and is therefore probably worth doing. This is in stark contrast to some of your previous comments, which I found particularly strongly worded ("irrational" "premature" "bad" "incorrect"). For example:

> It's all sort of ironic, which I suppose was the point upthread--this is an example of the irrational urge for premature optimization and of bad programming idioms being hauled into Rust land completely unhindered by Rust's type safety features. And the better, correct, and likely more performant way of accomplishing this task could have been done just as safely from C as it could from Rust.

Note that I am not making the argument that one shouldn't do problem-driven optimizations. But if I'm going to maintain general purpose libraries for regexes or integer conversion, then I must work within a limited set of constraints.

(OT: Neither PCRE nor RE2 (nor Rust's regex engine) are built to handle thousands of patterns. You might consider investigating the Hyperscan project, which specializes in that particular use case (but uses finite automata, so you may miss some things from PCRE): https://github.com/01org/hyperscan)