> It can't optimize it down to simple loads and stores unless it can prove that it's aligned. If it can't optimize it to a simple load, it has to check for alignment. If it has to check for alignment, it's unlikely to be faster than the byte-loading function.
I had edited my comment after-the-fact to include the "on x86" qualification.
> And that's what I meant by saying effort and code complexity is better spent refactoring the algorithm at a higher level than trying to micro-optimize such a small operation.
Your advice is overspecified. If you want to make something faster, then build a benchmark that measures the time you care about and iterate on it. If "micro optimizations" make it faster, then there's nothing wrong with that. I once doubled the throughput of a regex implementation by eliminating a single pointer indirection in the inner loop. It doesn't get any more micro then that, but consumers are no doubt happier with the increased throughput. In general, I find most of your hand waving about performance curious. You seem keen on making a strong assertion about performance, but the standard currency for this sort of thing is benchmarks.
I did all of this with byteorder when I built it years ago. I'll do it again for you.
It's no surprise that the type punning approach is faster here. (N.B. Compiling with `RUSTFLAGS="-C target-cpu=native"` seems to permit some auto-vectorization to happen, but I don't observe any noticeable improvement to the benchmark times for bit_shifting. In fact, it seems to get a touch slower.)
I could be reasonably accused of micro-optimizing here, but I do feel like reading 1,000,000 integers from a buffer is a pretty generalizable use case, and the performance difference here in particular is especially dramatic. Finding a real world problem that this helps is left as an exercise to the reader. (I've exceeded my time budget for a single HN comment.)
> It's beyond dispute that the gains from SeaHash primarily come from how it refactored its inner loop to operate on a 64-bit word instead of 8 8-bit words.
Do you feel anyone has contested this point? I note your use of the word "primarily." If type punning gives a 10% boost to something that is already fast, do you care? If not, do you think other people might care? If they do, then what exactly is your point again?
Note that I am responding to your criticism of byteorder in particular. I don't really know whether the OP's optimization of reading little-endian integers is actually worth while or not. I would hazard a guess, but would suspend certainty until I saw a benchmark. (And even then, it is so incredibly easy to misunderstand a benchmark.)
Notice how tight this loop is. In particular, we're dealing with a single simple load to read our u64.
Notice that you're reading the data into a statically allocated buffer, and doing it in such a way that it's trivial for the compiler to prove alignment. This is a classic case where the benchmark is irrelevant for a general purpose implementation.
Try running the code so that the buffer is dynamically allocated, and so that the first access is unaligned.
Now, I'm not saying that type-punning can't be faster, but to do it properly from a general-purpose library it should be done correctly so that every case is as fast as possible.
Assuming I'm correct and that the modified benchmark sees substantially different results, reimplement byteorder such that it produces the same tight loop even when the data isn't aligned.
I don't think it can be done without modifying the byteorder interface to expose something more iterator-like, because it needs to maintain state across invocations for doing the initial unaligned parse followed by the aligned parse.
If you can get it done in a reasonable amount of time[1], look at the difference between type-punning and byte-loading. I'll bet that relative difference will be much smaller than the difference between the unaligned performance before you refactored the interface, and the unaligned performance after refactoring the interface. In that case my point would stand--the most important part is refactoring code at a higher-level; gains quickly diminish thereafter.
If my argument is over-specified, that's because it's meant as a rule of thumb. Specifying a rule of thumb but then qualifying it with "unless" is counter-productive. For inexperienced engineers "unless" is an excuse to avoid the the rule; for experienced engineers "unless" is implied.
Note that I'm no stranger to optimizing regular expressions. I wrote a library to transform PCREs (specifically, a union of thousands of them, many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input) into Ragel+C code and got a >10x improvement over PCRE. After that improvement micro-optimizations were the last thing on our minds. (RE2 couldn't even come close to competing; and unlike re2c, the Ragel-based solution would compile on the order of minutes, not lifetimes.)
We eventually got to >50x improvement by doubling-down on the strategy and paying someone to modify Ragel internally to improve the quality of the transformations.
[1] Doubtful as I bet it's non-trivial and you have much better things to do with your time. But I would very much like to see just benchmarks numbers after making the initial changes--dynamic allocation and unaligned access. I don't have a Rust dev environment. I'll try to do this myself later this week if I can. However, given that I've never written any Rust code whatsoever it'd be helpful if somebody copy+pasted the code to dynamically allocate the buffer. I can probably figure the rest out from there.
I strongly suspect we don't support enough of this:
> many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input
... to really support your use case. But we're interested in the workload, especially as we're looking at extensions to handle more of the zero-width assertion cases. We'll never be able to handle some of them in streaming mode (they break our semantics and the assumption that stream state is a fixed size for a given set of regular expressions).
Can you share anything about what you're doing with zero-width assertions?
> Now, I'm not saying that type-punning can't be faster, but to do it properly from a general-purpose library it should be done correctly so that every case is as fast as possible.
You haven't actually told me what is improper with byteorder. I think that I've demonstrated that type punning is faster than bit-shifts on x86.
You have mentioned other workloads where the bit-shifts may parallelize better. I don't have any data to support or contradict that claim, but if it were true, then I'd expect to see a benchmark. In that case, perhaps there would be good justification for either modifying byteorder or jettisoning it for that particular use case. With that said, the data seems to indicate the the current implementation of byteorder is better than using bit-shifts, at least on x86. If I switched byteorder to bit-shifts and things got slower, I have no doubt that I'd hear from folks whose performance at a higher level was impacted negatively.
> Note that I'm not stranger to optimizing regular expressions. I wrote a library to transform PCREs (specifically, a union of thousands of them, many of which used zero-width assertions that required non-trivial transformations and pre- and post-processing of input) into Ragel+C code and got a >10x improvement over PCRE. After that improvement micro-optimizations were the last thing on our minds. We eventually got to >50x improvement by doubling-down on that strategy and modifying Ragel internally. Much like micro-optimizations RE2 couldn't even come close to competing; and unlike re2c, the Ragel-based solution would compile on the order of minutes, not lifetimes.
My regex example doesn't have anything to do with regexes really. I'm simply pointing out that a micro-optimization can have a large impact, and is therefore probably worth doing. This is in stark contrast to some of your previous comments, which I found particularly strongly worded ("irrational" "premature" "bad" "incorrect"). For example:
> It's all sort of ironic, which I suppose was the point upthread--this is an example of the irrational urge for premature optimization and of bad programming idioms being hauled into Rust land completely unhindered by Rust's type safety features. And the better, correct, and likely more performant way of accomplishing this task could have been done just as safely from C as it could from Rust.
Note that I am not making the argument that one shouldn't do problem-driven optimizations. But if I'm going to maintain general purpose libraries for regexes or integer conversion, then I must work within a limited set of constraints.
(OT: Neither PCRE nor RE2 (nor Rust's regex engine) are built to handle thousands of patterns. You might consider investigating the Hyperscan project, which specializes in that particular use case (but uses finite automata, so you may miss some things from PCRE): https://github.com/01org/hyperscan)
I had edited my comment after-the-fact to include the "on x86" qualification.
> And that's what I meant by saying effort and code complexity is better spent refactoring the algorithm at a higher level than trying to micro-optimize such a small operation.
Your advice is overspecified. If you want to make something faster, then build a benchmark that measures the time you care about and iterate on it. If "micro optimizations" make it faster, then there's nothing wrong with that. I once doubled the throughput of a regex implementation by eliminating a single pointer indirection in the inner loop. It doesn't get any more micro then that, but consumers are no doubt happier with the increased throughput. In general, I find most of your hand waving about performance curious. You seem keen on making a strong assertion about performance, but the standard currency for this sort of thing is benchmarks.
I did all of this with byteorder when I built it years ago. I'll do it again for you.
(The `RUSTFLAGS="--emit asm"` dumps the generated asm to target/release/deps.)The benchmark reads 1,000,000 64 bit integers from a buffer in memory and sums them.
Analyzing the hotspots of each benchmark using `perf` is instructive. For type_punning:
The corresponding asm is: Notice how tight this loop is. In particular, we're dealing with a single simple load to read our u64. Now let's repeat the process for bit shifting: The hotspot's corresponding asm is: It's no surprise that the type punning approach is faster here. (N.B. Compiling with `RUSTFLAGS="-C target-cpu=native"` seems to permit some auto-vectorization to happen, but I don't observe any noticeable improvement to the benchmark times for bit_shifting. In fact, it seems to get a touch slower.)I could be reasonably accused of micro-optimizing here, but I do feel like reading 1,000,000 integers from a buffer is a pretty generalizable use case, and the performance difference here in particular is especially dramatic. Finding a real world problem that this helps is left as an exercise to the reader. (I've exceeded my time budget for a single HN comment.)
> It's beyond dispute that the gains from SeaHash primarily come from how it refactored its inner loop to operate on a 64-bit word instead of 8 8-bit words.
Do you feel anyone has contested this point? I note your use of the word "primarily." If type punning gives a 10% boost to something that is already fast, do you care? If not, do you think other people might care? If they do, then what exactly is your point again?
Note that I am responding to your criticism of byteorder in particular. I don't really know whether the OP's optimization of reading little-endian integers is actually worth while or not. I would hazard a guess, but would suspend certainty until I saw a benchmark. (And even then, it is so incredibly easy to misunderstand a benchmark.)