Would be interesting to see comparisons to the algorithms in that benchmark: htt...

pgeorgi · on Dec 24, 2017

> What I generally wonder is, why ISAs don't have special instructions for a task that is so common.

ISAs used to have instructions to support binary coded decimals (BCD). BCD encoding stores values as 0x[0-9][0-9], and the instructions provide math operations on that representation. On x86, these still exist, for example AAD: https://c9x.me/x86/html/file_module_x86_id_2.html

Other ISAs (eg 6502) had a processor flag to switch between regular and BCD mode.

js8 · on Dec 24, 2017

> why ISAs don't have special instructions

IBM System/360 and its descendants actually does have instructions to convert between decimal and binary representations (CVD, CVB), decimal arithmetic and even instructions for formatting numbers (ED). I should also mention PACK and UNPK to convert between character and packed decimal number representation.

Today, of course, it is less important to have those instructions because computers are used less to crunch numbers and more to do other stuff.

Someone · on Dec 24, 2017

I think we agree, but I would say computers are used _more_ to crunch numbers. Inside the system, everything is binary; you only need decimal to present data to humans.

With old time COBOL workloads, a lot of the stuff you do was presenting stuff to humans. The typical banking application, for instance, reads numbers, does a few additions, and shows them on a terminal or prints a report. Using fixed-point arithmetic was normal.

Nowadays, we do way more floating point arithmetic (for graphics, for machine learning)

Also, I think the reason those complex instructions aren’t created anymore is pipelining. I can’t find timing information for the ED instruction, but I think it is a safe bet it isn’t constant cycle count, let alone single-cycle. It not only writes a character buffer, it also has to read the bytes containing the pattern defining how to format the number.

js8 · on Dec 24, 2017

What you are saying is only partly true. Yes, the packed decimal instructions are horrible today from performance perspective, because they require direct access to memory.

However, recently, IBM has added decimal floating-point arithmetic to z/Architecture CPUs (and also made it part of updated IEEE 754 standard). These modern instructions do not suffer from the problems of old instructions, and in fact should be preferred to packed decimal for applications that need decimal point arithmetic.

And more broadly, IBM still adds a lot of specialized instructions to mainframe CPUs. For example, calls to cryptographic accelerators.

fulafel · on Dec 24, 2017

ISA design is a highly quantitative business. You take a bunch of application benchmarks, compile them with you proposed new instruction, and measure speedup.

So the answer is one of:

1) people have, and decided against it.

2) people have dismissed it in an earlier phase, by observing that it doesn't make a big enough blimp in CPU profiles to warrant further investigation.

Also, it's hard to speed up very much in hardware because it's a bunch of interdependent divides which current cpu divide instructions are already good at. If the hardware acceleration interface is vector style, it's doubtful if many applications can easily take advantage of it. This article talks about printf, which certainly can't.

Do you have an area of applications in mind that spends a big slice of its computation time on integer -> ascii conversions?

Veedrac · on Dec 25, 2017

> Also, it's hard to speed up very much in hardware because it's a bunch of interdependent divides

Most things which seem sequential are actually easily parallelized. This is one of them; all of the divides can be done in parallel, and since divisions are by constants it all falls into some fairly simple hardware.

The real argument is more that it's a bit pointless to hardware accelerate something that takes 10-20ns in optimised software if it's not used 100 billion times per day per core.

userbinator · on Dec 24, 2017

Do you have an area of applications in mind that spends a big slice of its computation time on integer -> ascii conversions?

I don't know if the situation has changed now, but a lot of the "big data" systems I worked with several years ago liked to use text formats (JSON, CSV, TSV, XML) for their input/output --- and I always thought it was a pretty stupid idea, since the data volumes were in the gigabytes and terabytes range.

gwicksted · on Dec 24, 2017

They did have special opcodes in x86 but dropped it in x64:

https://en.m.wikipedia.org/wiki/Intel_BCD_opcode

userbinator · on Dec 24, 2017

...and more puzzlingly, those opcodes just became invalid, and weren't reused for something more useful in 64-bit mode. It reminds me of LAHF/SAHF and segmentation --- both features that AMD decided to remove initially for AMD64, but were forced to reintroduce after it was discovered that things like virtualisation needed them. Unfortunately, the same is unlikely to be true of the BCD instructions...

rurban · on Dec 24, 2017

This benchmark, lut and divide&conquer lut, seems to be really slow compared to paul's new itoa. His biggest trick is to preallocate the target buffer size, and to avoid all this heavy branching.