Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The variable length instructions (currently 16 bit or 32 bit but 48 bit on the horizon) complicate instruction fetch and decode and in particular this is a problem for high performance RISC-V implementations.

I want to see variable length instructions, but a requirement for instruction alignment.

Ie. every aligned 64 bit word of RAM contain one of these:

[64 bit instruction]

[32 bit instruction][32 bit instruction]

[16 bit instruction][16 bit instruction][32 bit instruction]

[32 bit instruction][16 bit instruction][16 bit instruction]

[16 bit instruction][16 bit instruction][16 bit instruction][16 bit instruction]

That should make decode far simpler, but put a little more pressure on compilers (instructions will frequently need to be reordered to align - but a review of compiler generated code is that that frequently isn't an issue)



As the other reply states that is effectively the Qualcomm proposal though note the 16-bit instructions likely gobble up a large amount of your 32-bit instruction space. You have to have something to identify an instruction as 16-bit which takes up 32-bit encoding space. The larger you make that identification (in terms of bits) the less encoding space it takes up but then the fewer spare bits you have to actually encoding your 16-bit instruction. RISC-V uses the bottom two bits for this purpose, one value (11) indicates a 32-bit instruction, the others are used for 16-bit instructions. So you're dedicating 75% of your 32-bit encoding space to 16-bit instructions.


By requiring alignment, you can halve or more the size of the identifier.

Since if you have a 16 bit instruction, you know that it must be followed by another 16 bit instruction. Therefore, that 2nd instruction doesn't need the identifying bits. Or, more precisely, within a 32 bit slot, the 2^32 instructions possible need to be divided - and one way to do that is 2^31+2^30 possible 32 bit instructions, and 2^15 * 2^15 16 bit instructions. Now, the 16 bit instructions are only taking 25%, not 75% of the instruction space.


But now you have two kinds of 16-bit instructions, the ones for the leading position and the ones for the trailing position, and the latter ones have slightly more available functionality, right? Personally, at this point I'd think the decoder must already be complicated enough (it has either to maintain "leading/trailing/full" state between the cycles, or to decode 8/16-byte long batches at once) that you could simply give up and go for an encoding with completely irregular lengths à la x86 without much additional cost.


Not necessarily.

    000x -- 64-bit instruction that uses 60 bits
    001x -- reserved
    010x -- reserved
    011x -- reserved
    100x -- two 32-bit instructions (each 30-bits)
    101x -- two 16-bit instructions then one 32-bit instruction
    110x -- one 32-bit instruction then two 16-bit instructions
    111x -- four 16-bit instructions (each 15 bits)
    xxx1 -- explicitly parallel
    xxx0 -- not explicitly parallel
Alternatively, you view them as VLIW instruction sets. This has the additional potential advantage of some explicitly parallel instructions when convenient.


One advantage of just sticking with only 32bit instructions is that nobody needs to write packet-aware instruction scheduling.

Even with decent instruction scheduling, you are still going to end up with a bunch of instruction slots filled with nops.

And it will be even worse if you take the next step to make it VLIW and require static scheduling within a packet.


In this case, it's probably not that bad as with actual VLIW: if you see that e.g. your second 16-bit instruction has to be a NOP, you just use a single 32-bit instruction instead; similarly for 32- and 64-bit mixes.


The packet would be external and always fit in a cache line. You'd specify the exact instruction using 16-bit positioning. The fetcher would fetch the enclosing 64-bit group, decode, then jump to the proper location in that group.

In the absolute worst-case scenario where you are blindly jumping to the 16-bit instruction in the 4th position, you only fetch 2-3 unnecessary instructions. Decoders do get a lot more interesting on the performance end as each one will decode between 1 and 4 instructions, but this gets offset by the realization that 64-bit instructions will be used by things like SIMD/vector where you already execute fewer instructions overall.

The move to 64-bit groups also means you can increase cache size without blowing out your latency.

VLIW doesn't mean strictly static scheduling. Even Itanic was just decoding into a traditional backend by the time it retired. You would view it more as optional parallelism hints when marked.

I'd also note that it matches up with VLIW rather well. 64-bit instructions will tend to be SIMD instructions or very long jumps. Both of these are fine without VLIW.

Two 32-bit instructions make it a lot easier to find parallelism and they have lots of room to mark exactly when they are VLIW and when they are not. One 32-bit with two 16-bit still gives the 32-bit room to mark if it's VLIW, so you can turn it off on the worst cases.

The only point where it potentially becomes hard is four 16-bit instructions, but you can either lose a bit of density switching to the 32+16+16 format to not be parallel or you can use all 4 together and make sure they're parallel (or add another marker bit, but that seems like its own problem).


I think if you have 64bit packets, you might as well align jump targets to the 64bit boundary.

I'd rather have an extra nop or two before jump targets than blindly throw 1-3 instructions worth of decoding bandwidth on jumps (which are often hot)


If you're fetching 128-bit cache lines, you're already "wasting" cache. Further, decoding 1-3 NOP instructions isn't much different from decoding 1-3 extra instructions except that it adversely affects total code density.

If you don't want to decode the extra instructions, you don't have to. If the last 2 bits of the jump are zero, you need the whole instruction block. If the last bit is zero, jump to the 35th bit and begin decoding while looking at the first nibble to see if it's a single 32-bit instruction or two 16-bit instructions. And finally, if it ends with a 1, it's the last instruction and must be the last 15 bits.

All that said, if you're using a uop cache and aligning it with I-cache, you're already going to just decode all the things and move on knowing that there's a decent chance you jump back to them later anyway.


But if you don't have a uop cache (which is quite feasible with a RISC-V or AArch64 style ISA), then decode bandwidth is much more important than a few NOPs in icache.

Presumably your high performance core has at least three of these 64bit wide decoders, for a frontend that takes a 64bit aligned 192bit block every cycle and decodes three 64bit instructions, six 32bit instructions, twelve 16bit instructions, or some combination of all sizes every cycle.

If you implement unaligned jump targets, then the decoders still need to fetch 64bit aligned blocks to get the length bits. For every unaligned jump, that's upto a third of your instruction decode slots site idle for the first cycle. This might mean the difference between executing a tight loop in one cycle or two.

A similar thing applies to a low gate count version of the core, a design where your instruction decoder targets one 32bit or 16bit instruction per cycle (and a 64bit instruction every second cycle). On unaligned jumps, such a decoder still needs to load the first 32bits of the instruction first to check the length decoding, and waste an entire cycle on every single branch.

Allowing unaligned jump targets might keep a few NOPs out of icache (depending on how good the instruction scheduler is), but it costs you cycles in tight branchy code.

Knowing compiler authors, if you have this style of ISA and even it does support unaligned jump targets, they are still going to default to inserting NOPs to align every single jump target, just because the performance is notably better on aligned jump targets and they have no idea if this branch target is hot or cold.

So my argument is that you might as well enforce jump target alignment of 64 bits anyway. Allow all implementations gain the small wins from assuming that all targets are 64bit aligned, and use the 2 extra bits to make your relative jump instructions have four times as much range.


Which is easier to decode?

[jmp nop nop], [addi xxx]

OR

[xxx jmp], [nop nop addi]

OR

[xxx jmp], [unused, addi]

All of these tie up your entire decoder, but some tie it up with potentially useful information. That seems superior to me.


It's only unconditional jumps that might have NOPs following them.

For conditional jumps (which are pretty common), the extra instructions in the packet will be executed whenever the branch isn't taken.

And instruction scheduling can actually do some optimisation here. If you have a loop with an unconditional jump at the end and an unaligned target, you can do partial loop unrolling, for example:

With [xxx, inst_1, inst_2], [inst3]...(loop body) ...[jmp to inst_1, nop, nop], you can repack the final jump packet as [inst_1, inst_2, jump to inst_3]

This partial loop unrolling actually is much better for performance than not wasting I-cache as it has reduces the number of instruction decoder packets per iteration by one. Compilers will implement this anyway, even if you do support mid-packet jump targets.

Finally, compilers already tend to put nops after jumps and returns on current ISAs, because they want certain jump targets (function entry points, jump table entries) to be aligned to cache lines.


Don't forget the possibility of 5x 12-bit instructions. In particular, if you have only one or two possibilities for destination registers for each of the 5 positions (so an accumulator-like model), you could still have a quite useful set of 12-bit instructions.


No, the idea was to say "prefix 0"=>31bit, "prefix 10"=>30bit, "prefix 11"=>2*15bit. If you need you can split the two bits to have the two 15 bit chunks aligned identically.


Once you're moving to alignment within a larger fixed width block you don't even need to stick to byte boundaries. I've got a toy ISA I've played around with that breaks 64 bit chunks into a 62 bit instruction, a 40 and a 21 bit instruction, or 3 21 bit instructions.


I think Itanium did something like that


Yup, 128 bit bundles with 3 instructions each, and ways to indicate that different bundles could execute in parallel.


The CDC 6600 (1963) had 60-bit words and both 15-bit and 30-bit instructions, with the restriction that 30-bit instructions couldn't straddle a word boundary. The COMPASS assembler would sometimes have to insert a 15-bit no-op instruction to "force upper". Careful optimizing programmers and compilers would try to minimize "forcing upper" to avoid wasting precious space in the 7-word instruction "stack".

So it's been done, and is not a big deal.


was it called assembler though? I was looking and https://web.archive.org/web/20120910064824/http://www.bitsav... does not say so but http://www.bitsavers.org/pdf/cdc/cyber/cyber_70/60225100_Ext... does refer to "COMPASS assembly language". Interesitng.


This is basically what qualcomm proposes, 32 bit instructions and 64 bit aligned 64 bit instructions.

I don't think we have real data on it, but I suspect that the negative impact of this would effect 16/32/48/64 way more than just 32/64.


I would like to also have aligned 16 bit instructions. And maybe even aligned 8 bit instructions for very common things like "decrement register 0" or "test if register 0 is greater than or equal to zero". "Jump back by 10 instructions", etc. Those instructions get widely used in tight loops, so might as well be smaller.


8 bit instructions are a really bad idea. you only get a tiny number of them and they significantly increase decode complexity (and massively reduce the number of larger instructions available)


Others have pointed out adding bits to identify instruction types eats into your instruction length, so let's go stupid big time: what if you had the instructions as described here, without any instruction length being part of the instruction, but have that stored separately? (3 bits would be plenty per word) You might put it 1. as a contiguous bit string somewhere, or you might 2. put it at the bottom of the cache line that holds the instructions (the cache line being 512 bits I assume).

Okay, for 1. you'd have to do two fetches to get stuff into the I-cache (but not if it's part of the same cache line, option 2.) and of course you're going to reduce instruction density because you're using up cache, but there's nothing you can do about that, but at least it would allow n-bit instructions to be genuinely n-bits long which is a big advantage.

That this hasn't been done before to my knowledge is proof that it's a rotten idea, but can the experts here please explain why – thanks


> you'd have to do two fetches

I think this is the big downside. You're effectively taking information which will always be needed at the same time, and storing it in two different places.

There is never a need for one piece of information without the other, so why not store it together.


> There is never a need for one piece of information without the other, so why not store it together.

Why not? as I said, so you can have full-length instructions!

And you can store it together in the same fetchable unit - the cache line (my option 2)


Interesting idea. Effectively moving the extra decode stage in front of the Icache, making the Icache a bit like a CISC trace/microOp cache. On a 512b line you would add 32 bits to mark the instruction boundaries. At which point you start to wonder if there is anything else worth adding that simplifies the later decode chain. And if the roughly 5% adder to Icache size (figuring less than 1/16th since a lot of shared overhead) is worth it.


Why not treat it like Unicode does and just have two marker instructions before and after the compressed ones?

Start compressed instructions <size>

Compressed instructions

End compressed instructions <size>


Which Unicode encoding are you talking about? It sounds a bit like you're talking about UTF-16 conjugate pairs, but that's not how those work. It's not how UTF-8 or UTF-32 work. So, which encoding is this?


If I understand you correctly, the guy I'm responding to is proposing allowing the mixing of different sized instructions. Your suggestion effectively says "I'm starting a run of compressed instructions/I'm finishing a run of compressed instructions" which is a different proposition. Just my take though.


> let's go stupid big time

Wouldn't that be to Huffman encode the instructions? Fixed table, but still, would save a lot of bits on the common instructions surely...


> instructions will frequently need to be reordered to align

Can't you pad it with nops up to the alignment boundary?

Even if there's not an explicit nop instruction in 16-bit and 32-bit variants (I don't know) there's surely something you can find that will have no side effects.


Yes, of course - but in general you want to avoid padding with nops because it makes the code larger (which, as well as costing more RAM, also uses more power and time to read the code from RAM, and fits less code into the instruction cache, which makes the power and time cost of reading code from RAM even bigger).

If you can make a compiler fill those NOP slots with useful instructions, then all the better.

It adds complexity for humans writing assembly code by hand, but that is a tiny minority of code now.


For one, I don't see why you would ever pad a 16-bit instruction with a 16-bit noop instead of just using the 32-bit equivalent instruction. That way, you can skip decoding the no-op.


Because you have a long odd numbered chain of 16-bit ops. Ex: 15 16-bit ops, leaves you with a 16bit NOP in order to realign for the following 32-bit op.


To be more clear: if a 16-bit instruction is 32-bit aligned, then you might want a 16-bit noop so that the instruction following the noop is also 32-bit aligned. But, in that case, you could just use a 32-bit aligned 32-bit instruction at the same location. No padding, one instruction saved, and the following instruction is still 32-bit aligned.

If the 16-bit instruction isn't 32-bit aligned, then the following instruction will be 32-bit aligned with no padding.

So, equivalently: "I don't know why you'd ever want to add padding after a 16-bit instruction in order to force the next instruction to not be 32-bit aligned." Is there such a use case (other than the obvious use case of checking behavior/performance of the sub-optimal case, or writing a noop-slide for an exploit payload)?


Use 14 16-bit ops instead, and use a regular 32-bit op as the 15th instruction (which is already correctly aligned since 14 is an even number).


Motorola 68k has this kind of aligned variable length instructions (in their case it’s always aligned on 16 bit boundaries). It’s not super difficult to support this in compilers though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: