More

tsmi · on March 23, 2022

I do totally agree with the comment, and even upvoted it, but in some sense an HDL does 'want to be procedural'. For example:

   Procedure to make brick wall:
   For brick in wheel-barrow do
     To brick, apply mortar, place it, pound it with trowel

(Apologies to all masons. I think you can tell my training is electrical engineering.)

I've thought for some time that the main problem with Verilog is that it looks really close, visually, to C and that gives people the wrong impression.

bsder · on March 23, 2022

Except that's not how you do it.

My go-to example for this is the Carry-skip Adder. (https://en.wikipedia.org/wiki/Carry-skip_adder)

Things are happening simultaneously in time and space. The ripples are happening while the skips are happening. The skips go stable because the ripple goes stable or the carry will skip so the ripple has "extra time" to go stable.

This is the simplest add faster than naive ripple carry. Note that it has no clocks. This is a purely combinatoric problem. And most "digital designers" will get it wrong.

Hardware design is NOT software design. Good digital designers are always thinking about how to interleave things in time and space like this.

tsmi · on March 23, 2022

> The main point of verilog is to simulate digital logic

I disagree. A HDL, which verilog is an instance of, provides a precise description of a digital machine.

Once you have that description, many things can be done with it. For example you can simulate the machine which is to be produced.

The problem is HDL looks like an algorithm, because it is. But it’s not a precise description of a computation, it’s a description of an object.

Most people who I’ve observed program verilog badly do so because they confuse the two. Kinda like confusing giving directions with giving instructions to make a map. Directions and maps are highly related, but they’re not the same thing.

tsmi · on March 23, 2022

That’s not exactly the right way to think about it. In HDL, you’re writing an algorithm, but the algorithm produces a physical device, not a computation.

It’s like programming a machine to make a watch. At the end you have a watch. Would you say the gears execute in parallel while it measures time?

In some sense it’s right, but in another it’s missing the point.

tsmi · on March 22, 2022

Modern OoO CPUs solve the need for more physical registers than logic registers with renaming. https://docs.boom-core.org/en/latest/sections/rename-stage.h...

Taniwha · on March 22, 2022

Yes VRoom! with renaming has as many registers as it has commitQ entries plus the architectural ones. So 64/32 + 31 (integer) + 32 (fp) + 32 (vector)

phkahler · on March 22, 2022

The idea was to have as many simple ALUs as registers. Results are kept in the ALU/register. All reads are essentially result forwarding. For example a simple RV32I requires 2 read ports and one write port on each register. If we use 2 R/W ports and put an ALU on each register, you reduce from 3 to 2 busses and can also do operations 2 at a time when an instruction clobbers one of its inputs. Or an ALU op along with a load/store.

tsmi · on March 21, 2022

One advantage of SkyWater opening its PDK is Universities are starting to back fill all the hardware that is missing.

Here's a SerDes from Purdue. I don't think this particular design has been validated in silicon yet though.

https://arxiv.org/abs/2105.13256

tsmi · on March 21, 2022

> you will need exact "-mcpu" for decent performance

For some definitions of decent, I think that ship has sailed.

https://clang.llvm.org/docs/CrossCompilation.html

-target <triple> The triple has the general format <arch><sub>-<vendor>-<sys>-<abi>, where: arch = x86_64, i386, arm, thumb, mips, etc. sub = for ex. on ARM: v5, v6m, v7a, v7m, etc. vendor = pc, apple, nvidia, ibm, etc. sys = none, linux, win32, darwin, cuda, etc. abi = eabi, gnu, android, macho, elf, etc.

Note, none of those are exhaustive...

blacklion · on March 22, 2022

"arch", "sys" and "eabi" are irrelevant to the core performance. You can not run "arm" on "i386" at all, and "eabi" and "sys" don't affect command scheduling, u-ops fusing and other hardware "magic".

So, only "sub" is somewhat relevant and it is exactly what RISC-V should avoid, IMHO, and it doesn't with its reliance on things like u-op fusion (and not ISA itself) to achieve high-performance.

For example, performance on modern x86_64 doesn't gain a lot if code is compiled for "-march=skylake" instead of "-march=generic" (I remember times, when re-compiling for "i686" instead of "i386" had provided +10% of performance!).

If RISC-V performance is based on u-op fusing (and it is what RISC-V proponents says every time when RISC-V ISA is criticized for performance bottlenecks, like absence of conditional move or integer overflow detection) we will have situation, when "sub" becomes very important again.

It is Ok for embedded use of CPU, as embedded CPU and firmware are tightly-coupled anyway, but it is very bad for generic usage CPU. Which "sub" should be used by Debian build cluster? And why?

Edit: for grammar & typos

tsmi · on March 21, 2022

Have you considered making an ASIC of your design? https://efabless.com/open_shuttle_program

Taniwha · on March 21, 2022

It's likely too big for those programs - I am (just now) starting a build with the Open Lane/Sky tools not with the intent of actually taping out but more to squeeze the architectural timing (above the slow FPGA I've been using for bringing up Linux) so I can find the places where I'm being stupidly unreasonable about timing (working on my own I can't afford a Synopsys license)

tsmi · on March 21, 2022

Gotcha. Did you run into any issues with yosys given that it has limited system verilog support?

Ibex needed to add a pass with sv2v https://github.com/lowRISC/ibex/tree/master/syn

Taniwha · on March 21, 2022

I'm just starting this week, I've recently switched to some use of SV interfaces and it does not like arrays of them - sv2v seems the way to go - but even without that yosys goes bang! somethings too big Vivado compiles the same stuff - I rearchitected the bit that might obviously be doing this but no luck so far.

mithro · on March 22, 2022

Checkout Surelog - https://antmicro.com/blog/2020/12/ibex-support-in-verilator-... and the more recent usage in https://github.com/siliconcompiler/siliconcompiler

tsmi · on March 21, 2022

If you're at the point in your career where you're not sure which is the right textbook then "A Quantitative Approach" is likely to be really tough to get through.

Computer Organization and Design, by the same authors, is considered a better choice for a first book. I personally loved it and couldn't put it down the first time I read it.

https://www.elsevier.com/books/computer-organization-and-des...

camtarn · on March 21, 2022

Definitely recommend this textbook as a great read - it remains one of the very few textbooks I've read end-to-end and genuinely enjoyed.

minroot · on March 21, 2022

Any recommendation for books on (System)Verilog

sitkack · on March 21, 2022

You might like "Digital Design and Computer Architecture, RISC-V Edition" by Harris and Harris.

https://www.google.com/books/edition/Digital_Design_and_Comp...

This book definitely skews pragmatic, hands on and doesn't assume much. Covers both VHDL and Verilog. Has sections on branch prediction, register renaming, etc.

tsmi · on March 21, 2022

I personally am not into the verilog specific books. For me HDLs are hardware description languages, so first you learn to design digital hardware, then you learn to describe them.

For that I highly recommend: https://www.cambridge.org/us/academic/subjects/engineering/c...

Great first book on the subject.

tsmi · on March 20, 2022

I’m sure that’s what the team that invented segment registers said too.

The question is does it make sense to add these to the ISA long term? In the short term, given die density and how memory works today, it has advantages. But die density increases, making OoO cores cheaper, and memory technology changes. It’s not obvious that these are long term improvements.

tsmi · on March 20, 2022

I agree mostly with Keller's take but I think he left of one key factor, the quality of the software tool chain.

The x86 tool chains are amazing. They're practically black magic in the kinds of optimizations they can do. Honestly, I think they're a lot of what is keeping Intel competitive in performance. ARM tool chains are also very good. I think they're a lot of the reason behind why ARM can beat RISCV in code space and performance on equivalent class hardware because honestly, like Keller says, they're not all that different for common case software. But frankly x86 and ARM toolchains should dominate RISCV when we just consider the amount of person-hours that have been devoted to these tools.

So for me the real question is, where are the resources that make RISCV toolchains competitive going to come from (and keep in mind x86 and ARM have open source toolchains too)? And, will these optimizations be made available to the public?

If we see significant investment in the toolchains from the likes of Google, Apple and nVidia, or even Intel. ARM needs to be really worried.

ansible · on March 20, 2022

I don't know that such a heavy investment in the toolchains for RISC-V are actually needed.

If you look at generated code, it seems fairly straightforward. There aren't a lot of tricks or anything.

tsmi · on March 21, 2022

It's not so much "tricks" that one needs to look out for.

The compiler has just tons of internal heuristics on when and when not to apply various code transformations. Those heuristics, first off may not even be applicable for your platform of choice, and even if they are, their magic numbers aren't necessarily tuned well to the platform and application at hand.

Here is a well written and concise case study, albeit somewhat old (2010), that illustrates what I am talking about. The results of variations measurements will have changed since then but the overall high level situation hasn't. If you read the paper, in your mind, just replace every instance of x86 with ARM and every instance of ARM with RISCV and you'll get the idea.

https://ctuning.org/dissemination/grow10-03.pdf

BirAdam · on March 20, 2022

I think the serious investment will be from Intel, Apple (with LLVM), and possibly Microsoft (into the GCC/Linux ecosystem).