Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Introducing pgzx: create PostgreSQL extensions using Zig (xata.io)
156 points by eatonphil on March 21, 2024 | hide | past | favorite | 37 comments


This looks very cool and it's nice to see some finicky details like PG_GETARG abstracted away.

Are there any plans to make pgzx a Trusted Language so that Zig functions could be written with CREATE FUNCTION? Rust recently achieved this milestone and AWS RDS now supports PL/Rust.

Being a trusted language would mean it could be used to create Trusted Language Extensions (TLEs) and could be installed on cloud hosted databases without filesystem access.


No TLE support is a not a goal. Languages/extensions in Postgres should only be marked as trusted if you can guarantee that they do not access Postgres internals or even the file filesystem. With Zig and in pgzx in particular we can not guarantee this to be the case.

With pgzx we provide you a number of utilities on top of the C API. But depending on the extension you are working on you might heavily use the Postgres internals and their C APIs as is.

Postgres has support for a number of languages. Some having their own strenghts or weaknesses. One strength of Zig is its closeness to C, while providing a many quality of live improvements over C (optional types instead of null pointer, generic data structures, ...). This makes it a formidable choice for extensions that want to integrate deeply with the Postgres internals.


If I suddenly had unlimited resources and attention, I'd really like to provide infrastructure to write trusted functions by running them via a webassembly runtime, with a small injected safe surface for accessing postgres functionality. That'd make it a lot easier to write trusted PLs in a variety of languages. Still probably needs a bit of per-language sugar on top, but a lot less than doing it from scratch.

Edit: Minor grammar fix


I don't know much about it, but I think this has been tried?

https://dylibso.com/blog/pg-extism/

Looking at the repo though it looks like it hasn't been updated in a while.


we're actually working on a new version of this, which _does_ support TLE! if you're interested, please join https://extism.org/discord and we can get you a preview.


Oh wow, that is really cool.


Hehe :)

I also have WASM in mind when thinking about future TLE support :)


Oracle, IBM and Microsoft have been doing that for decades with JVM, CLR and TIMI, yet another use case where Webassembly is a follower.


It would be amazing if we could could use Zig for TLEs, and we'll look into it, but I worry it might not be possible because Zig can't guarantee memory safety in the same way that Rust can.


If you build it, they'll come. The only limit is yourself!


I understood that reference, zombo :)


Y'all have built something cool. I'm excited to see where it goes from here.

If there's anything the pgrx team can do to help, just let us know.


PostgreSQL is such an amazing piece of software, that keeps being more awesome release after release. It has almost become a full-featured operating system at that point.

And the ability to write extensions in Zig is very nice.


Postgres creator Stonebraker is working on an actual operating system to run database workloads :)

https://en.wikipedia.org/wiki/DBOS


I wonder what things you'd like to see exposed by postgres to make something like this easier.

E.g. you have code to maintain the module magic and function info, for both of which there are some plans to change the contents in the next few major versions. If we were to provide zig consumable ways of creating the struct from C headers, would that help?


One of the reasons why we are using Zig is that we can consume the C API mostly as is.

When importing the C headers Zig translates the C headers into Zig declarations. Unfortunately this is not always possible for C macros, and the reason why we have to maintain those structs. And this is where we have to step in with Custom Zig code. But most of the time we actually consume the C APIs as is.

> If we were to provide zig consumable ways of creating the struct from C headers, would that help?

Yes, that would be awesome. I'm curious how they will look like in the future.

We have had the most difficulties with the module magic, function info, and varatt/Datum macros. Fortunately you have to solve the module magic/function info "only once". The Datum conversion and VARATT macros are more troublesome. We have some conversion support for a number of common zig types. But ideally we would like users to be able to use the C APIs as is, while we provide some type directed default conversions for convenience.

The main problems we've been facing with C macro translations are type conversions/casts in macros, especially if the underyling struct heavily uses union (for example VARATT macros). In some cases translating inline functions instead of macros might work better, due to the translator having more type information available. We fixed some of the translations manually. You can find them in varatt.zig and datum.zig (where we opted to implement the text to cstring translation ourselves).

Data structures like lists, slist, dlist, hash tables are quite consumable as is. We have some typed wrappers for those and provide iterators. Macros with control flow can not be reused, but I think this is fair, especially as the foreach macros are a very common C patterns. All in all we have had no troubles with them.


Hi,

> > If we were to provide zig consumable ways of creating the struct from C headers, would that help?

> Yes, that would be awesome.

I don't know much about zig and won't have a whole lot of time to learn - but if you can outline what is required to make C macros [un]usable, it might be possible to improve something. Either on its own, or as part of future work.

> I'm curious how they will look like in the future.

There's quite a few things.

For one, I'd like to introduce a faster function calling infrastructure for the simple cases (mainly small number of arguments, without SRF support). That'll need to be declared in the function info struct.

For another, eventually I want to support a different encoding for variable length types. Including making it reasonably efficient to have variable-length integers.

> We have had the most difficulties with the module magic, function info, and varatt/Datum macros. Fortunately you have to solve the module magic/function info "only once". The Datum conversion and VARATT macros are more troublesome. We have some conversion support for a number of common zig types. But ideally we would like users to be able to use the C APIs as is, while we provide some type directed default conversions for convenience.

Ugh, the varatt stuff doesn't look easily maintainable long-term. It looks like you just need it for getDatumTextSliceZ()?

At the same time, I don't really know why you need it? Most of this should be doable via C functions, and the parts that are not, you could easily wrap yourself - you already seem to have some C code as part of pgzx.


Edit: I wasn't finished before submitting the response by accident :)

> > > If we were to provide zig consumable ways of creating the struct from C headers, would that help? > > Yes, that would be awesome. > I don't know much about zig and won't have a whole lot of time to learn - but if you can outline what is required to make C macros [un]usable, it might be possible to improve something. Either on its own, or as part of future work.

Hm. Sometimes it is difficult to tell until you try to use a macro. This is because the compiler ignores code (no type checking) that is not used. Difficult to explain, but assume that you write a program that emits typed code that is eventually compiled. This is what enables comptime, and "best effort" C header imports.

The toolchain tries to convert macros into inline functions. That means any macro that contains some form of control flow or opens/closes a code block can't be used. Most obvious ones are the foreach loops, PG_TRY and friends or the PG_RETURN_X macros (luckily we can just use the XGetDatum functions).

Union types are difficult as I said. But maybe this is rather a Zig problem.

Sometimes using consts. For example when working with the varattrib variants we have bit wise operations and shift for example:

``` #define VARSIZE_4B(PTR) \ ((((varattrib_4b *) (PTR))->va_4byte.va_header >> 2) & 0x3FFFFFFF) ```

Now the 2 and the bit pattern might be translated into different types (e.g. int), which might not be compatible with va_header (which is an uint32). Sometimes the types for the constants look ok, sometimes not. Maybe this is something that could still improved in Zig, not sure. I haven't tried this, but I wonder what would happen if I annotate the types for the constatns in the macro (which might not make them more readible :) ).

We later decided to allow mixing C with Zig code in case we need some kind of "complex" wrapping in C. This might not be fully ideal, but fortunately Zig is also a C compiler which allows us to fallback to C if we find something to complicated.

> For one, I'd like to introduce a faster function calling infrastructure for the simple cases (mainly small number of arguments, without SRF support). That'll need to be declared in the function info struct.

This sounds great. In pgzx we actually allow developers to capture the function call info as argument in their function implementation (not shown in our examples). For example if someone wants to use the collation, do some checks on nargs, implement a function with variable number of arguments.

But out of the box we try to derive input and return types and conversions at compile time. I would have to see how the new API looks like, but I think we still would be able to continue to automatically derive the conversions to extract the arguments into values in Zig.

> Ugh, the varatt stuff doesn't look easily maintainable long-term. It looks like you just need it for getDatumTextSliceZ()?

> At the same time, I don't really know why you need it? Most of this should be doable via C functions, and the parts that are not, you could easily wrap yourself - you already seem to have some C code as part of pgzx.

True. We introduced C into the code base later in our development. The project is still very new and we might revisit some choices on the Datum encoding.

The `getDatumTextSliceZ` actually resembles the `text_to_cstring` function, which we might want to use in the future instead. In `Zig` a string is a slice, which is a fat pointer (pointer + length field). The type `[:0]const u8` represents a slice with 0 terminator (fun fact, Zig gives you a stack traces if you forget to write the terminator into your buffer). Initially we implemented this function directly so we can directly initialize the fat pointer without having to get the string length after doing the conversion.

We added C to our code base later in time to allow us to wrap simple cases more easily without having to reproduce Postgres code in Zig. I guess we should revisit `getDatumTextSliceZ` :). Either have a small C wrapper over `text_to_cstring` that also returns the length or just bite the bullet and do a `strlen` after.

Another motivator to try to fix the VARATT/VARDATA macros was to allow developers to use those in their own extensions. Looking at some extensions in contrib we find e.g. `VARDATA_ANY` or `VARSIZE_ANY_EXHDR` being used quite a bit.


Tangential, first time bumping into Xata. What is the difference between Xata and Neon?

Their site says

> Xata is the only serverless data platform for PostgreSQL.

I thought Neon was serverless postgres too?

Edit: This looks really dope actually, going to play around with it later. Really dig the free tier and the built in search from the get go.


We're referring more to the "data platform" part. What we mean by a "Postgres data platform" (and arguably it's not an widely accepted term, so take it with a grain of salt) is that you get not only Postgres, but secondary data stores around it, more tightly integrated than in a typical cloud provider:

- Elasticsearch, to which we replicate data automatically via CDC

- Blob storage for file attachments, we use S3 to store the files and Cloudflare to cache them on CDN

- (in the future) Caching

- (in the future) Time-series databases, again with automatic replication from Postgres

That said, you can use just the Postgres part, and we are very happy if you do.


Edited my comment above, but this looks really rad. I made an account and going to try porting a side project DB to it later!

Particularly we need to improve our search and I've been lazy about tweaking our ts_vector setup.


> Elasticsearch

Do you use zombodb?

https://github.com/zombodb/zombodb


You can get an overview of the platform in this Post: https://xata.io/blog/serverless-postgres-platform#the-platfo...

Zombodb is a really cool project, but no, we don't use it. We use logical replication and triggers to also capture schema changes with your records into an event stream. The event stream is used to send your data to Elasticsearch (and create/update the index). See: https://xata.io/blog/serverless-postgres-platform#logical-re...

Stay tuned, we are planning to open source this component as well.


As far as Postgres goes, Xata is just Amazon Aurora under the hood if I read their last blog post right.

(I work at Neon)


This is a fantastic news. I got into Zig in the last two weeks and it has be a very pleasant language. The comptime is an ingenious invention. It enables type safe meta-programming and easy generic support, in the same language syntax. The native support for sub-byte types like u2, u3, or u7 make packing data very easy. The native vector support on SIMD makes SIMD level parallel programming like child play. The language is looking very good.


It's completely besides the point, but part of the complexity of the char_count example is just due to the C version for some reason copying the input strings into newly allocated memory, which the zig version doesn't :)


This is fair, and it's a bit embarrassing but I copied the C version from an older tutorial without looking at it critical enough. Still, the Zig version gets the arguments as normal function parameters and doesn't need to call `PG_GETARG_*`.


I notice that the c version looks like it uses memcopy as a convoluted typecast from a pointer to text (presumably a postgres typedef?) to a pointer to char (null terminated?)? Could this actually have a purpose?

Ed: I don't think PG uses c strings? Looks like memcopy might be needed for call-by-reference, return-by-value - but not really in the simple function that only walks the array?

https://www.postgresql.org/docs/current/xfunc-c.html#XFUNC-C...


I have a dumb, and likely annoying, question: why isn't Rust the clear choice? I think it has to do with Zig's "interop with C" but this is very abstract for me. Is there a code example that shows something that's super easy in Zig but a PITA in Rust?

I've heard that Rust's types and compiler can be frustrating to deal with as well. Is Zig a more "joyful" programming language?


It's for sure the question that a lot of people have. Initially we had more details about that in the blog post, but we didn't want it to become a "Zig vs Rust" blog post, so we kept it to a minimum.

I will expand just a little bit more here:

First, I think the fact that Rust can be a trusted language for Postgres is a huge advantage, and I am excited about it! I hope we will have the chance to use it and contribute to pgrx as well.

Postgres is not only written in C, but it has developed its own particular style of C code. For example: arena memory allocator (memory contexts), exceptions via setjmp/longjmp (ereport), single-threaded communicating via shared memory, SPI for accessing the data via SQL, etc. Some of these mechanisms kind of conflict with the way Rust likes to do things.

Because of the above, pgrx has to do harder work to expose functionality. It's possible, just a lot of work. In Zig, we can use pretty much anything directly or with small wrappers.

If you need to call into C a lot, you need to use unsafe, and that negates some of the memory safety advantages of Rust. Also, Rust unsafe code is harder to write.


There is a separate project to build Postgres extensions with Rust: https://github.com/pgcentralfoundation/pgrx



I haven't worked with pgzx, but it's possible that memory management is easier. Zig uses memory arenas, and so does Postgres. If one can map directly to the other, it would be a huge win.

With Rust/pgrx (which I have used, extensively) memory integration is more difficult. There's pg memory, and there's Rust memory, and they're not the same, and you have to play games to pass data between them. Lifetime issues crop up. Rust might be able to solve the problem in the future with custom allocators, but it's just not there yet.


> I haven't worked with pgzx, but it's possible that memory management is easier. Zig uses memory arenas, and so does Postgres. If one can map directly to the other, it would be a huge win.

Yes. This was indeed a great motivator for using Zig. It was quite easy to integrate Zig with the Postgres memory management. This way we can use the Zig standard library or other Zig libraries without a second thought. Another advantage is that in Postgres memory context cleanup and error handling are somewhat well integrated with each other. This gives us some peace of mind as almost any Postgres function you might want to use in your extension is likely to raise an exception.


I've heard that when working with complex data structures with Rust you often need to write unsafe code, which is a lot more painful to work with in Rust. Once in you are in unsafe land anyway, some people say Zig is more convenient.


Here is talk and slides about how to create postgres extensions in Rust.

https://rustlab.it/talks/teaching-an-old-dog-new-tricks-exte...


very cool. More amazing work from xata. Special shoutout to pgroll too!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: