I looked at the Go implementation of this in "tagged pointers" [0]
The amount of data that can be used for the tag is architecture-dependant, and the routine discards any tag bits that don't fit into the tagged pointer without telling the caller.
To me, this seems ridiculous - why not just use a struct with a tag and a pointer, and not run the risk of your tag being destroyed without you knowing because the architecture can't fit that many bits?
But the Go folks are smart, and must be doing this for a reason. Can anyone explain the thinking here?
There are a lot of data structures where doubling the size of the struct would be a big deal. I've worked on big graphs, where I used similar tricks, because most of the storage of a graph is pointers, and limits of what can be quickly computed are bound by how much of the graph you can get in memory.
I didn't say it was. I just followed one of the links in the article because I grok Go and it seemed counter-intuitive to how Go normally does things. But the other answers seem to make sense, so I guess this is a thing I accept now.
There's a trick here I hadn't noticed. Good times.
If the plan is 8 byte aligned data and use the three low bits for other stuff, you mask them off before loads/stores. An alternative is to use the high three bits, store the pointer/8, and multiply by 8 to retrieve.
That's appealing on x64 because memory operations can include a *8 in the encoding.
Specifically, I've long wondered whether the pointer masking tricks mess up prefetch/speculation. It seems plausible that making the dereference look more like a normal memory operation is helpful for that.
(It also means the low-alignment-bits and high-unused-bits can be considered contiguous modulo the *8 on decode which is probably less annoying when using both ranges)
I also believe that when masking the bits off before loads/stores, you need to set them to the same value of the highest used bit (bit 47 or bit 56), while this often is 0, it’s not necessarily the case. Something to be aware of.
It's really unfortunate that all of the mainstream OSes run userspace in the lower portions of the address space.
Setting the most-significant 13 bits (really, setting the second to 12th bits and at least one of the following bits) of an IEEE-754 float will result in a NaN bit pattern. That means that any pointer to the top 2 petabytes of a 64-bit address space, if cast to an IEEE-754 double will be NaN. This means the NaN-boxing used in Safari and Firefox's JavaScript engines, LuaJIT, etc. would be no-ops. (Safari and Firefox use different mechanisms, but they'd become the same if moved to the top of the address space.)
It's not enough of a performance difference to re-jigger everything in mainstream OSes, but I imagine if someone were to come up with a unikernel/exokernel OS specifically for JITing some dynamic language, there's some performance to be had by having all of the dynamic language objects in the upper 2 petabytes of the address space.
Very old Macs used this trick to squeeze their ROM routines down a bit, operating with 24 bit addressing and using the top bits for flags and whatnot. Of course they ran into trouble when machines with 16MB of memory started appearing. If you do this you might be making more work for yourself in the future when you buy a new machine with 256EB of main memory.
What’s old is new again: IBM 360 architecture also only used 24 bits for addresses back in the 1960s, so you could stuff data in the upper byte of pointers. Eventually, programs had to declare themselves to be in “Extended Mode” if they wanted to get access to 31-bit addressing instead.
I'm guessing the problem occurred specifically with CPUs newer than the 68000 irrespective of amount of RAM?
(Microsoft's) Amiga Basic also did this and stopped working on newer CPUs, as the 68000 only has 24 address lines and ignores the top 8 bits of a 32 bit address, but 68020 and up uses all 32 (I don't remember about the 68010, but that was pin compatible with the 68000 so I'm guessing not)
There was hardware (not sure if it was CPU state or state in the memory controller) to zero-out all of the address bits above the 24th, allowing backward compatibility. I forget the shortcut (option-I? option-M?), but if you selected an executable in Finder, you could go in and change the hardware compatibility mode used to run it.
That's interesting. I don't think the address hack was widely used on the Amiga, so we didn't get anything like that. Amiga Basic was the main one and that was ditched for Arexx for the newer OS versions (a shame, because as much as I liked to mock Microsoft at the time, Amiga Basic was quite decent). I don't think the incompatibility had anything to do with ditching it - people have patched it and it wasn't a big deal; I'm assuming it didn't get an official patch because Commodore had already decided to replace it.
The 68010 was indeed pin compatible with the 6800 and had only 24 address bits.
But the 68012 (which was otherwise software compatible with the 68010) extended the address bus (and had more pins) and was the first m68k CPU to exhibit this compat issue.
The 68020 was more popular but came a couple of years later
x86_64 (among others) specifically avoided that incompatibility, by the CPU forcing programs to mask those bits out instead of ignoring them. So programs are very compatible into the future, they just need to limit the range their memory allocator uses.
Then it later added explicit automatic masking, which also avoids the problem. As long as your program can make do with smaller amounts of memory, there are no downsides.
It does not actually make the software future proof: if the kernel returns a large address the software will either fault if it checks, or stomp over then mask out the wrong thing.
Virtual memory is what mitigates the issue, as it lets the OS provide large virtual addresses on a per-process basis, and independent of physical addressing.
This was already used on 32b x86: PAE decorrelated the physical address space from the virtual (giving the OS 36 bits physical address — with a theoretical upper bound of 64 — even as individual processes were still restricted to 32); and on windows the /LARGEADDRESSAWARE link switch allowed manipulating the virtual address space on a per-process granularity.
That's why I said you need to control your allocator.
It's so much easier than rewriting everything that deals with pointers.
And of course everything has virtual memory. But even without it, you're not really worse off when you upgrade the CPU if masking was already enforced.
> This was already used on 32b x86: PAE decorrelated the physical address space from the virtual (giving the OS 36 bits physical address — with a theoretical upper bound of 64 — even as individual processes were still restricted to 32); and on windows the /LARGEADDRESSAWARE link switch allowed manipulating the virtual address space on a per-process granularity.
I wouldn't characterize this as very similar. Physical and virtual address space already had to be decorrelated to go above about 1GB, and PAE doesn't meaningfully extend how much memory a 32 bit processes can have. And LAA is based on software assumptions that were never particularly tied to hardware, and it can be relevant with or without PAE.
> That's why I said you need to control your allocator.
The allocator does not matter unless by "control your allocator" you mean "the OS provides an API which can restrict the virtual memory system", which has nothign to do with the allocator.
If you're using 19 bits for tagging and the OS returns a 52 bit pointer, you don't have the space for your tagging scheme, no matter what your allocator is.
> PAE doesn't meaningfully extend how much memory a 32 bit processes can have.
Of course not. What it does is show that there is no correlation between the physical and virtual address spaces, so you can have a 64 bits physical address space and only provide a 48 bits virtual address space to applications.
> And LAA is based on software assumptions that were never particularly tied to hardware, and it can be relevant with or without PAE.
Missing the point, the example of LAA shows that the OS can provide different virtual address spaces to different processes. So the OS can provide 48 bit virtual address spaces by default, and a full 64 bits virtual address space to applications which request it. The former can then keep using tagged pointers just fine.
The allocator gets address space. The OS doesn't just shove things into your process in general. It can't return a 52 bit pointer out of nowhere. My first draft mentioned OS explicitly but it's not really the OS that's making the decisions.
> What it does is show that there is no correlation between the physical and virtual address spaces
It does show that, but that decorrelation is neither necessary nor sufficient to solve this problem.
> Missing the point, the example of LAA shows that the OS can provide different virtual address spaces to different processes.
I agree with that. In the context of your overall post it wasn't clear to me that that was why you mentioned it.
> The OS doesn't just shove things into your process in general. It can't return a 52 bit pointer out of nowhere. My first draft mentioned OS explicitly but it's not really the OS that's making the decisions.
Of course it’s the OS making the decision. You ask the OS for memory and it returns whatever it wants.
The allocators acts as bridge between the application and OS and can do cool stuff but if the OS returns a 52 bits pointer there isn’t a thing the allocator can do about it that would result in a working 48 bits pointer.
> It does show that, but that decorrelation is neither necessary nor sufficient to solve this problem.
It absolutely is necessary, since the entire subject is to not break applications requiring a smaller address space for their tagging scheme to work even as the system and other applications migrate to larger ones.
The allocator can choose which part of the address space to fill. It does not have to deal with "whatever the OS wants".
> It absolutely is necessary, since the entire subject is to not break applications requiring a smaller address space for their tagging scheme to work even as the system and other applications migrate to larger ones.
Even if you have a 1:1 memory mapping, you can leave the applications that want small addresses at the start of memory. You don't need to decorrelate virtual and physical addresses to do that.
> I'm assuming an OS that isn't refusing to do things for no reason.
> Aren't you?
You are apparently assuming the OS would have no reason to refuse. I've seen DeRaadt cook, I'd rather not assume anything that's not guaranteed. And I especially wouldn't make up new modes of operations which don't exist for no justifiable reason.
And in later versions of Classic Mac OS, for each executable, metadata was stored on whether to set the hardware for 32-bit or 24-bit addressing. As a kid, I definitely monkeyed with the checkbox there not understanding what it did. Having seen no change in behavior, I guess all the applications I used were 32-bit clean.
Yeah, this was definitely a thing on old macs. Us olds also remember the pain and agony of cleaning this up for 32 bit addressing. Cool as these hacks are, do yourself a favor and avoid the (eventual) agony.
This is a nice summary of the practical aspects and considerations. It isn’t something anyone should be doing explicitly on a regular basis but there are occasions, particularly in libraries, where it is the perfect tool for the job.
There is also the inverse use case: smuggling pointers inside status codes, enums, and similar. For example, optionally encoding a pointer to additional error metadata for non-zero result codes. In C++ it isn’t that uncommon to also see Result types implemented similarly when the type allows it.
There's also the Rust library smartstring which allows creating short strings without heap allocations: a Rust string consists of three values (pointer, capacity, length), so 24B on a 64bit architecture. With one byte used as tag this leaves 23 bytes for storing the string in the same space.
Because Rust strings are utf-8, not all byte sequences are valid strings, so you can actually store a 24-byte string inline (which is what compact_str does).
High-performance analytic databases also do this. A common representation is to store the length in 4 bytes from a total of 16. Then if the length is <= 12 the rest of the storage is used to store the string inline, otherwise only the first 4 bytes of the string are stored inline (this can drastically speed up sorts/comparisons by not having to do a non-local access), leaving the remaining 8 bytes for a pointer.
This representation is also coming to arrow. But instead of 8 bytes for a pointer it will contain 4 bytes for a buffer index, and 4 bytes for an offset into that buffer. This allows reduced effort when concatenating/merging arrays (no need to copy the data section where the long strings are stored), while remaining a relocatable/serializable format (only offsets, no pointers).
An even better representation is the one from `folly::FBString`.
There, the tag that indicates small string mode is not a particular bound on capacity, but the last byte of the struct that is set to zero. That way, the tag also acts as a null terminator for the stored string.
If null-terminated strings are already a thing in your language and you can just pass an inlined string to whatever function without any bit manipulation I fail to see the downside.
> I fail to see how that is better. Null-terminated strings are a terrible idea in the first place.
TLDR: you get extra bytes in your small strings
Long version: Copying from FBString code:
struct MediumLarge {
Char* data_;
size_t size_;
size_t capacity_;
};
// sizeof(MediumLarge) == 24
struct FBString {
union {
// For accessing the last byte.
uint8_t bytes_[sizeof(MediumLarge)];
Char small_[sizeof(MediumLarge) / sizeof(Char)];
MediumLarge ml_;
};
};
Then `bytes_[23]` is set to `24 - size of small string` for a small string. You also "steal" a couple of bits from `capacity_` as a tag to see if a string is small or large.
This has two advantages:
- You get to reuse seven of the eight bytes of `capacity_` in your small string (i.e. your max small string size is 23, not 16 as it would be with a simpler scheme).
- You get a null terminator "for free" (though, of course, you still have size and capacity). This is on top of size/capacity.
This could be reduced to 16 bytes by making `size_` and `capacity_` uint32_t instead of size_t.
You should be very careful when doing that. It may be rare in practice, but it's surprisingly easy to trigger ABA issues with only a 16-bit sequence number.
Don't you need 65k threads all contending on the same state for that to happen? Even if your process does have 65k threads, you'd need a pretty large critical section for all of them to be preempted in an unlucky point.
That said, it's better to rely on RCU/hazptr to solve ABA issues, but the extra bits are still useful to store state that can be CAS'd together with the pointer.
I don't really believe that it is easy to trigger. The thread would have to be preempted at a very specific point for a very specific duration, the other thread(s) would need to perform exactly 2^16 operations within that time window and the final operation would need to trigger the ABA problem. Possible, but extremely unlikely. (In some applications even impossible.) Do you happen to have some real world examples?
But yes, it is definitely something to be aware of!
BTW, if the pointers are always aligned, the sequence number may have a few bits more.
> But if you aren't writing a compiler or an embedded system, you should never do this.
I have this hex trie library that uses this to differentiate between a leaf and a node that will soon be optimizing single-value nodes as tagged pointers. __edit__ umm... single-value nodes are leaves!?! Yeah, not enough coffee apparently.
Not to mention the tinyscheme interpreter I occasionally poke at that TFA gave me a bunch of ideas to try out.
This will be a problem when trying this on Capability Hardware Enhanced RISC Instructions (CHERI) https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/
systems. Currently the only CPU with this implemented is Arm's Morello prototype.
Pointers are replaced by 128-bit capabilities containing what operations are valid, a 64-bit base address and a size. These capabilities are unforgeable, so trying to play with "unused" parts of 64-bit pointers simply won't work.
FreeBSD & Gnome are running on hardware after minor source-code changes, along with a significant proportion of FreeBSD ports collection.
My favorite hack back in MFC days was a combo box which I stored the pointer address in the text (to the right after a lot of spaces so was hidden). When a user chooses an item, parse the pointer and de-reference it back to an object.
"I think it's quite well known that on a 64-bit system, the maximum bit-width
of a virtual address is somewhat lower (commonly 48-bits)." might actually be a perfect example of the Average Framiliarity xkcd[0]. It's perfectly fine to write it that way and the article obviously knows its intended audience. But I'm wondering what percentage of readers here actually knew this beforehand. (I learned it only pretty recently myself. And yes, I should have known this sooner, but ... didn't.)
Article author here. Thanks for calling that out - the intent wasn't to make readers who might not have been aware feel bad. It really came from my worry about being seen to write about something that is trivial and everyone knows already!
Perhaps I'd be better just dropping "I think it's quite well known that"?
Hey! First of all: don't worry, it didn't make me feel bad–don't know about others though, of course. And I really meant what I said: I think you knew your (originally) intended audience, and there, the percentage of people is almost certainly quite a lot higher than among general HN readership. I just really like that comic and immediately had to think of it when I read that sentence. (And it's only made funnier by your intention, it really shows how true to life the comic actually is!)
But I think there is a good, general argument to be made against these kinds of sentences. I don't think anyone would ever blame you for writing about a well known thing, while at the same time there is a chance of needlessly making people feel bad.
BUT I also think that with the right "mindset", these kinds of sentences can be a good indicator to readers about what is considered essential, common knowledge among a certain group of people.
So ... I don't think you need to remove it, but wouldn't say it's bad if you did. Sorry for the long winded way of saying basically nothing.
My first thought was to use the Address calculation logic for an additional ALU, then... my second thought was trying to justify this during a code review... and lastly, why Microsoft used the LDT upper byte to make Xenix 286 incompatible... and the headache that changing architectures made for poor programming.
The alpha only accessed 8-byte aligned mem addresses (originally anyway). The bottom 3 bits were ignored (masked out) on lookups, per the spec, to allow users to stuff juicy extra info into these.
Do you have a reference on that? This summary of the Alpha AXP <https://danluu.com/dick-sites-alpha-axp-architecture.pdf> states "Normal load or store instructions that specify
an unaligned address take a precise data alignment trap to PALcode (which may do the access using two aligned accesses or report a fatal error, depending on the operating system design)"
Thanks for the PDF, nice when somebody provides more than the minimum information. Okay, from the same doc:
The integer load and store quadword unaligned
(LDQ_U, STQ_U) instructions ignore the low three
bits of the byte address and always transfer an
aligned quadword
So sort of right, if you squint and don't look too closely :-)
(edit: oh, you're the author? Didn't realise. Good stuff)
The amount of data that can be used for the tag is architecture-dependant, and the routine discards any tag bits that don't fit into the tagged pointer without telling the caller.
To me, this seems ridiculous - why not just use a struct with a tag and a pointer, and not run the risk of your tag being destroyed without you knowing because the architecture can't fit that many bits?
But the Go folks are smart, and must be doing this for a reason. Can anyone explain the thinking here?
[0] https://github.com/golang/go/blob/master/src/runtime/tagptr_...