I do totally agree with the comment, and even upvoted it, but in some sense an HDL does 'want to be procedural'. For example:
Procedure to make brick wall:
For brick in wheel-barrow do
To brick, apply mortar, place it, pound it with trowel
(Apologies to all masons. I think you can tell my training is electrical engineering.)
I've thought for some time that the main problem with Verilog is that it looks really close, visually, to C and that gives people the wrong impression.
Things are happening simultaneously in time and space. The ripples are happening while the skips are happening. The skips go stable because the ripple goes stable or the carry will skip so the ripple has "extra time" to go stable.
This is the simplest add faster than naive ripple carry. Note that it has no clocks. This is a purely combinatoric problem. And most "digital designers" will get it wrong.
Hardware design is NOT software design. Good digital designers are always thinking about how to interleave things in time and space like this.
> The main point of verilog is to simulate digital logic
I disagree. A HDL, which verilog is an instance of, provides a precise description of a digital machine.
Once you have that description, many things can be done with it. For example you can simulate the machine which is to be produced.
The problem is HDL looks like an algorithm, because it is. But it’s not a precise description of a computation, it’s a description of an object.
Most people who I’ve observed program verilog badly do so because they confuse the two. Kinda like confusing giving directions with giving instructions to make a map. Directions and maps are highly related, but they’re not the same thing.
That’s not exactly the right way to think about it. In HDL, you’re writing an algorithm, but the algorithm produces a physical device, not a computation.
It’s like programming a machine to make a watch. At the end you have a watch. Would you say the gears execute in parallel while it measures time?
In some sense it’s right, but in another it’s missing the point.
The idea was to have as many simple ALUs as registers. Results are kept in the ALU/register. All reads are essentially result forwarding. For example a simple RV32I requires 2 read ports and one write port on each register. If we use 2 R/W ports and put an ALU on each register, you reduce from 3 to 2 busses and can also do operations 2 at a time when an instruction clobbers one of its inputs. Or an ALU op along with a load/store.
-target <triple>
The triple has the general format <arch><sub>-<vendor>-<sys>-<abi>, where:
arch = x86_64, i386, arm, thumb, mips, etc.
sub = for ex. on ARM: v5, v6m, v7a, v7m, etc.
vendor = pc, apple, nvidia, ibm, etc.
sys = none, linux, win32, darwin, cuda, etc.
abi = eabi, gnu, android, macho, elf, etc.
"arch", "sys" and "eabi" are irrelevant to the core performance. You can not run "arm" on "i386" at all, and "eabi" and "sys" don't affect command scheduling, u-ops fusing and other hardware "magic".
So, only "sub" is somewhat relevant and it is exactly what RISC-V should avoid, IMHO, and it doesn't with its reliance on things like u-op fusion (and not ISA itself) to achieve high-performance.
For example, performance on modern x86_64 doesn't gain a lot if code is compiled for "-march=skylake" instead of "-march=generic" (I remember times, when re-compiling for "i686" instead of "i386" had provided +10% of performance!).
If RISC-V performance is based on u-op fusing (and it is what RISC-V proponents says every time when RISC-V ISA is criticized for performance bottlenecks, like absence of conditional move or integer overflow detection) we will have situation, when "sub" becomes very important again.
It is Ok for embedded use of CPU, as embedded CPU and firmware are tightly-coupled anyway, but it is very bad for generic usage CPU. Which "sub" should be used by Debian build cluster? And why?
It's likely too big for those programs - I am (just now) starting a build with the Open Lane/Sky tools not with the intent of actually taping out but more to squeeze the architectural timing (above the slow FPGA I've been using for bringing up Linux) so I can find the places where I'm being stupidly unreasonable about timing (working on my own I can't afford a Synopsys license)
I'm just starting this week, I've recently switched to some use of SV interfaces and it does not like arrays of them - sv2v seems the way to go - but even without that yosys goes bang! somethings too big Vivado compiles the same stuff - I rearchitected the bit that might obviously be doing this but no luck so far.
If you're at the point in your career where you're not sure which is the right textbook then "A Quantitative Approach" is likely to be really tough to get through.
Computer Organization and Design, by the same authors, is considered a better choice for a first book. I personally loved it and couldn't put it down the first time I read it.
This book definitely skews pragmatic, hands on and doesn't assume much. Covers both VHDL and Verilog. Has sections on branch prediction, register renaming, etc.
I personally am not into the verilog specific books. For me HDLs are hardware description languages, so first you learn to design digital hardware, then you learn to describe them.
I’m sure that’s what the team that invented segment registers said too.
The question is does it make sense to add these to the ISA long term? In the short term, given die density and how memory works today, it has advantages. But die density increases, making OoO cores cheaper, and memory technology changes. It’s not obvious that these are long term improvements.
I agree mostly with Keller's take but I think he left of one key factor, the quality of the software tool chain.
The x86 tool chains are amazing. They're practically black magic in the kinds of optimizations they can do. Honestly, I think they're a lot of what is keeping Intel competitive in performance. ARM tool chains are also very good. I think they're a lot of the reason behind why ARM can beat RISCV in code space and performance on equivalent class hardware because honestly, like Keller says, they're not all that different for common case software. But frankly x86 and ARM toolchains should dominate RISCV when we just consider the amount of person-hours that have been devoted to these tools.
So for me the real question is, where are the resources that make RISCV toolchains competitive going to come from (and keep in mind x86 and ARM have open source toolchains too)? And, will these optimizations be made available to the public?
If we see significant investment in the toolchains from the likes of Google, Apple and nVidia, or even Intel. ARM needs to be really worried.
It's not so much "tricks" that one needs to look out for.
The compiler has just tons of internal heuristics on when and when not to apply various code transformations. Those heuristics, first off may not even be applicable for your platform of choice, and even if they are, their magic numbers aren't necessarily tuned well to the platform and application at hand.
Here is a well written and concise case study, albeit somewhat old (2010), that illustrates what I am talking about. The results of variations measurements will have changed since then but the overall high level situation hasn't. If you read the paper, in your mind, just replace every instance of x86 with ARM and every instance of ARM with RISCV and you'll get the idea.
I've thought for some time that the main problem with Verilog is that it looks really close, visually, to C and that gives people the wrong impression.