the compression algorithm you select for your data is quite dependent on the dataset you have. the equations in this blog post don't help you choose which compression to use, but rather "how much" and when to compress. I would be curious to formalize the math for different compression algorithms though... might be a good follow up post!
I was calculating timings and compression ratio for each array with each algorithm. Then I would save the “best” one to use for next chunks of data.
But it is hard to decide how to judge the cpu vs disk/network tradeoff like you explain in the article.
I was a bit curios if I could make an API so on the top level user enters some parameters and the system can adjust this calculation according to that.
But had some issues with this because the hardware budget used by all parts of the system, not only by the compression code.
As an example network is mega fast in data center but can be slow and expensive when connecting to a user. The application can know which case it is executing but it is hard to connect that part of the code into the compression selection stuff cleanly.
Also on network case. It might make sense to keep data large but cpu time low until I hit the limit but nothing matters when I hit the limit.
Would be cool to have a mathematical framework to put some numbers in and be able to reason about the whole picture
This is spot on, I understand very little about how terminal rendering works and was able to build github.com/agavra/tuicr (Terminal UI for Code Review) in an evening. The initial TUI design was done via Claude.
RocksDB actually does something somewhat similar with its prefix compression. It prefix-compresses texts and then "resets"the prefix compression every N records so it stores a mapping of reset point -> offset so you can skip across compressed records. It's pretty neat
The trick is, it's not all in memory - it's a memory-mapped file
If you look at the cache (with `fincore` or similar) you'll see that the buinary search only loads the pages it examines, roughly logarithmetic in the file size.
And a text file is the most useful general format - easy to write, easy to process with standard tools.
I've used this in the past on data sets of hundreds of millions of lines, maybe billions.
It's also true that you could use a memory-mapped indexed file for faster searches - I've used sqlite for this.
So cool to see this make the front page of hacker news! I'm the author, I'll be online here throughout the weekend to answer any questions you might have :) excited for the next post which is in the works about LSM trees.
We haven't even started to discuss Object Storage, but it ends up looking very very similar if you're building data systems that use that instead of raw filesystems (not so much for physics reasons, but because of the way object storage require immutable objects and penalize you for many API calls)
Thanks! I use https://monodraw.helftone.com/ which is my favorite one-time-purchase software of all time. I definitely agree the buttons on the top left are unnecessary but ... it's cute and it makes me happy so I can't help it. Maybe I'll come up with a different style for the next blog
Sounds like a perfect fit for https://slatedb.io/ -- it's just that (an embedded, rust, KV store that supports transactions).
It's built specifically to run on object storage, currently relies on the `object_store` crate but we're consdering OpenDAL instead so if Garage works with those crates (I assume it does if its S3 compatible) it should just work OOTB.
for Garage's particular use case I think SlateDB's "backed by object storage" would be an anti-feature. their usage of LMDB/SQLite is for the metadata of the object store itself - trying to host that metadata within the object store runs into a circular dependency problem.
There's a whole class of interesting problems related to query evolution - and it varies greatly depending on the "environment" that you're interested in (see mjdrogalis' docs on updating a running query). Generally, the strategy that ksqlDB takes at the moment is to validate what upgrades are possible to do in-place and which are not - for the former, ksqlDB "just does it" and for the latter, we are designing a mechanism to deploy topologies side-by-side and then atomically cut over when the new topology is caught up to the old one.
There's an in-progress blog post that describes exactly this class of problems - keep an eye out for it!
reply