More

ashtonsix · 2025-12-17T19:53:53 1766001233

No, but it could help people build those synthesis tools much faster.

P4Synth takes a (mathematical) group of functions/expressions and finds strong candidate implementations for every class in that group. Then, as long we have a fast (expression → class) mapping for that group we can use the generated solutions as a database embedded within the compiler for automated expression replacement (or mapping from expressions to circuits/technology).

Vivado/Quartus (FPGA) technology mapping and LLVM's InstCombine stage are essentially this. InstCombine's pattern library is partially human-authored, partially generated by search tools; it lists ~30k subexpression replacements like a+a+a → a*3. P4Synth competes with those search tools. For hard function classes, existing methods might take weeks on a supercomputer: P4Synth speeds that up exponentially.

It only solves a narrow toy problem right now (4-input boolean functions), but I believe the technique could scale with modifications (like A* style prioritisation over signal-set novelty and implementation score).

ashtonsix · 2025-10-14T20:02:43 1760472163

Mm, I used that exact writeup as a reference to implement this algorithm in WebGL 3 years ago: https://github.com/ashtonsix/webglc/blob/main/src/kernel/sca...

It even inspired the alternative "transpose" method I describe in the OP README.

ashtonsix · 2025-10-14T20:00:36 1760472036

I just did a mini-ablation study for this (prefix sum). By getting rid of the cross-block carry (16 values), you can increase perf from 19.85 to 23.45 GB/s: the gain is modest as most performance is lost on accumulator carry within the block.

An absolute value every 16 deltas would undermine compression: a greater interval would lose even the modest performance gain, while a smaller interval would completely lose the compressibility benefits of delta coding.

It's a different matter, although there is definitely plausible motivation for absolute values every X deltas: query/update locality (mini-partition-level). You wouldn't want to transcode a huge number of values to access/modify a small subset.

zamadatix · 2025-10-14T22:22:04 1760480524

I think what GP is saying is you can drop an absolute value every e.g. 8192 elements (or even orders of magnitude more if you're actually processing GBs of element) and this frees you to compute the blocks in parallel threads in a dependency free manor. The latency for a block would still bound by the single core rate, but the throughput of a stream is likely memory bound after 2-3 cores. It still hurts the point of doing delta compression, but not nearly as bad as every 16 values would.

Even if one is willing to adopt such an encoding scheme, you'd still want to optimize what you have here anyways though. It also doesn't help, as mentioned, if the goal is actually latency of small streams rather than throughput of large ones.

ashtonsix · 2025-10-14T22:44:28 1760481868

Oh right. That's sensible enough. Makes total sense to parallelise across multiple cores.

I wouldn't expect a strictly linear speed-up due to contention on the memory bus, but it's not as bad as flat-lining after engaging 2-3 cores. On most AWS Graviton instances you should be able to pull ~5.5 GB/s per-core even with all cores active, and that becomes less of a bottleneck when considering you'll typically run a sequence of compute operations between memory round-trips (not just delta).

ashtonsix · 2025-10-14T19:04:18 1760468658

Oh neat. I have some related unpublished SOTA results I want to release soon: PEF/BIC-like compression ratios, with faster boolean algebra than Roaring Bitsets.

ashtonsix · 2025-10-14T18:13:13 1760465593

The weirdness probably comes from heavy use of "SIMD intrinsics" (Googleable term). These are functions with a 1:1 correspondence to assembly instructions, used for processing multiple values per instruction.

ashtonsix · 2025-10-14T18:04:54 1760465094

If the data is already in GPU memory, yes. Otherwise you'll be limited by the DRAM<->VRAM memory bottleneck.

When we consider that delta coding (and family), are typically applied as one step in a series of CPU-first transforms and benefit from L1-3 caching we find CPU throughput pulls far-ahead of GPU-based approaches for typical workloads.

This note holds for all GPU-based approaches, not just PTX.

_zoltan_ · 2025-10-14T18:32:45 1760466765

what is a typical workload that you speak of, where CPUs are better?

We've been implementing GPU support in Presto/Velox for analytical workloads and I'm yet to see a use case where we wouldn't pull ahead.

The DRAM-VRAM memory bottleneck isn't really a bottleneck on GH/GB platforms (you can pull 400+GB/s across the C2C NVLink), and on NVL8 systems like the typical A100/H100 deployments out there, doing real workloads, where the data is coming over the network links, you're toast without using GPUDirect RDMA.

tmostak · 2025-10-14T20:05:43 1760472343

Even without NVLink C2C, on a GPU with 16XPCIe 5.0 lanes to host, you have 128GB/sec in theory and 100+ GB/sec in practice bidirectional bandwidth (half that in each direction), so still come out ahead with pipelining.

Of course prefix sums are often used within a series of other operators, so if these are already computed on GPU, you come out further ahead still.

ashtonsix · 2025-10-14T20:34:49 1760474089

Haha... GPUs are great. But do you mean to suggest we should swap a single ARM core for a top-line GPU with 10k+ cores and compare numbers on that basis? Surely not.

Let's consider this in terms of throughput-per-$ so we have a fungible measurement unit. I think we're all agreed that this problem's bottleneck is the host memory<->compute bus so the question is: for $1 which server architecture lets you pump more data from memory to a compute core?

It looks like you can get a H100 GPU with 16xPCIe 5.0 (128 GB/s theoretical, 100 GB/s realistic) for $1.99/hr from RunPod.

With an m8g.8xlarge instance (32 ARM CPU cores) you should get much-better RAM<->CPU throughput (175 GB/s realistic) for $1.44/hr from AWS.

_zoltan_ · 2025-10-15T13:11:01 1760533861

GH200 is $1.5/hr at lambda and can do 450GB/s to the GPU. seems still cheaper?

ashtonsix · 2025-10-14T18:56:59 1760468219

By typical I imagined adoption within commonly-deployed TSDBs like Prometheus, InfluxDB, etc.

GB/GH are actually ideal targets for my code: both architectures integrate Neoverse V2 cores, the same core I developed for. They are superchips with 144/72 CPU cores respectively.

The perf numbers I shared are for one core, so multiply the numbers I gave by 144/72 to get expected throughput on GB/GH. As you (apparently?) have access to this hardware I'd sincerely appreciate if you could benchmark my code there and share the results.

_zoltan_ · 2025-10-14T19:24:13 1760469853

GB is CPU+2xGPU.

GH is readily available for anybody at 1.5 dollars per hour on lambda; GB is harder and we're just going to begin to experiment on it.

ashtonsix · 2025-10-14T19:47:04 1760471224

Each Grace CPU has multiple cores: https://www.nvidia.com/en-gb/data-center/grace-cpu-superchip

This superchip (might be different to whichever you're referring to) has 2 CPUs (144 cores): https://developer.nvidia.com/blog/nvidia-grace-cpu-superchip...

ashtonsix · 2025-10-06T14:44:56 1759761896

I'm assuming you're referring to BFM/EXTR? NEON absolutely improves here.

The core I developed on (Neoverse V2) has 4 SIMD ports and 6 scalar integer ports, however only 2 of those scalar ports support multicycle integer operations like the insert variant of BFM (essential for scalar packing).

More importantly, NEON progresses 16 elements per instruction instead of 1.

ashtonsix · 2025-10-06T09:28:55 1759742935

From the working set size and knowledge of hardware cache behaviour. Whenever you access data from memory not already in-cache it's copied four times: to L3, L2, L1 and to CPU registers. As you access data, the hardware evicts old cache entries to make space for it.

If you loop through an array once, and then iterate through it again you can figure out where it will be cached based on the array size.

ashtonsix · 2025-10-05T19:44:53 1759693493

If you have an array of numbers with a known upper-bound, such as enums with 8 possible values (representable with 3 bits), and a memory-bound operation on those numbers eg, for (int i; i < n; i++) if (user_category[i] == 0) filtered.push_pack(i), which is common in data warehouses, using my code can more than 2x performance by allowing more efficient usage of the DRAM<->CPU bus.

ashtonsix · 2025-10-05T16:15:02 1759680902

Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).

TinkersW · 2025-10-05T17:08:13 1759684093

8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.

dzaima · 2025-10-05T17:59:17 1759687157

More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.

namibj · 2025-10-05T22:38:33 1759703913

GFNI is distinct from AVX-512; it was merely introduced in cores that also had AVX-512.

Sesse__ · 2025-10-05T18:55:48 1759690548

Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?

camel-cdr · 2025-10-05T19:35:34 1759692934

Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog

Sesse__ · 2025-10-05T21:18:21 1759699101

I don't know either; I was talking about the lack of PMOVMSKB in NEON and then the comment I replied to started talking about (1 << k) - 1 not being a single instruction sucks. Which I don't think it is in any non-NEON set either.