I am befuddled by the idea of saving the return address explicitly (mov rdi, $ + 15) while also using call/ret. Maybe there is something clever going on here, but I don't understand the purpose.
calculateStrLength is not only calculating the string length; it's also pushing the contents of the string onto the stack (and expanding each byte to 8 bytes, because that's the only valid size of push on x64). The stack pointer is no longer pointing to the return address, so the code puts the return address (which was calculated by hand by looking at the dissasembly and counting bytes rather than using a label for some reason, but I digress) back onto the stack and then using RET to pop it right back off and jump to it (PUSH RDI + RET is equivalent to JMP RDI).
Then reverseStr loops over the string length, popping a character off the stack one at a time and writing all eight bytes into the output buffer (running off the end of the reserved buffer for the last 7 characters, incidentally).
I suppose this is an interesting exercise to understand how the stack and call/ret mechanisms work, but it's not the way to write understandable code. Using the stack in this way will almost certainly also break the return address prediction that modern CPUs perform, making the code slower than the straightforward version that doesn't use the stack at all.
Yes, the text is definitely not what interested readers should actually try to achieve, even if it motivated them to learn more.
I don't know about Linux 64-bit ABI limitations, but if you want the interoperable code in Win64, you should actually learn to not use push except in prologs, epilogs and leaf functions and even then the trickier part is to maintain the exception info consistent. So some other approaches are better to learn first.
drv, I also agree, it's highly inefficient making memory pressure n times 8 bytes on the new stack instead of simply accessing the bytes as they are, keeping the caches doing their job.
> Its a symptom of doing really ugly things with the stack
Yes, that's spaghetti code. In general I strongly recommended to write assembly functions which follow the platform's ABI (I'm a TA for a course using assembly programming). At the very least, function calls and returning values should be consistent and function calls shouldn't have this kind of side effects. In this case
* calculateStrLength effectively leaks stack space because it pushes the input string and doesn't balance the stack before returning - this is why the return value is passed using rdi. At least the return value pushed by call could have been preserved in calculateStrLength itself. It also increments the size counter without resetting it (it's reset to 0 in _start) which is a bit of a weird choice.
* then reverseStr (which is not a function but just a label in _start) pops the string off the stack and copies it to the OUTPUT buffer (but doesn't copy an end of string NULL character - which works because the buffer is in the bss section, but it's not a good practice). The return address pushed by the call to calculateStrLength is left on the stack.
The whole thing is designed pretty weirdly because reversing a string is trivial to do in place (or with just an input and output buffer), there's no need to use the stack.
Does anyone, (the author?), know the best reference for x86-64 assembly? I have been a loath to teach it to students because there I haven't found great material for it. I use "Programming from the Ground Up"* as my main text for 32 x86 but haven't found anything comparable for 64 bit. I also use:
EDIT: I agree with everyone one, I also learned via doing, decompiling small C programs, gdb, and manuals. However, I am looking for something a bit gentler. That process will work for the top 10% but the rest will struggle and at the end of the day they need to write an x86 code generator so without knowing how to write assembly they are in trouble.
I learned x86_64 by (a) reading assembly other people wrote, (b) disassembling compiled code, (c) the Intel architecture manuals, and (d) writing a ton of assembly and fixing my mistakes. This is something of the sink-or-swim approach, but I find it works really well; at the end of the day, assembly is really, really simple--the key thing is to just start writing it.
Of course, it's hard to do this sort of immersion in a course where some of the students aren't really interested and just want to know what they need to complete the assignments, so the efficacy depends a lot on the nature of the class.
If you're using the compiler as a learning tool (and it's great for that!) may I suggest giving GCC Explorer a go: http://gcc.godbolt.org/
The name is deceptive; many compilers other than GCC are available. The turn around time of seeing how small snippets of code translate to assembly makes it pretty well suited to exploring assembly. Though it's my site, so usual author disclaimer here too!
Also I find compiling small tests and then single stepping in assembly mode another way to really 'get' what instructions do.
May I ask why one would learn x86-64 assembly other than for extreme use cases like compiler design and reverse engineering? Modern languages and their compilers can usually produce the best possible machine code most of the time. Further, embedded systems run on ARM nowadays which is a bit simpler to deal with due to the smaller instruction set.
Modern languages and their compilers can usually produce the best possible machine code most of the time.
This is frequently asserted, but rarely by people who spend much time looking at assembly generated by modern compilers. The easiest disproof would be to compile the same C or C++ function with different compilers, and benchmark the generated code. Between CLang, GCC, Intel, and MSVC, I'd usually expect about a 50% difference (1:1.5) between the fastest and the slowest, with any one of them about equally likely to win or lose on a given function.
By the same token, I'd assume it's usually possible to get a 100% speedup by dropping to hand-coded assembly (1:2), and wouldn't be surprised by 5x-10x especially if one is allowed to target the features of specific processor family. Sometimes this can be 'back ported' to the higher level language once the solution is known, and sometimes it can't. But the potential for a significant speedup over the compiler generated code is still there. The more valid question is whether any particular case justifies the extra development and maintenance work involved.
> By the same token, I'd assume it's usually possible to get a 100% speedup by dropping to hand-coded assembly (1:2), and wouldn't be surprised by 5x-10x especially if one is allowed to target the features of specific processor family.
Yup. I've seen 40x speedup over C/C++ code just by using SIMD (SSE/AVX) and eliminating the branches other than loop condition. It was possible to process up to 16 results in parallel (using just one CPU core), with about same number of cycles as the C version took to process one, while getting rid of the penalties from mispredicted branches. Data dependent branches are very expensive. They often have nearly 50% branch mispredict rate. If the branch is data dependent and thus unpredictable, it's usually much faster to compute it every time regardless and simply mask to ignore unwanted results than to eliminate computation by branching over it. Depending on the CPU model, with SIMD you can compute roughly up to 100-400 floating point ops (in SIMD vectors) and 200-800 integer ops in same time as one mispredicted branch takes. In those 14-20 clock cycles! Haswell executes up to 16 floating point ops per core in one clock cycle using SIMD. This is, assuming you don't get data starved, of course. Or have pipeline stalls due to unbreakable too long dependency chains, etc.
One thing compilers just can't get right is when there's a large switch-statement in performance critical code. Compilers can't seem to be able to make intelligent decisions about register allocation in that case, and behave as if you had true function pointers (=disable optimizations). In that particular situation you can gain 2-3x performance over a compiler just by dropping to assembler.
Compilers are pretty good in most cases, though. I wish they were smarter at eliminating branches.
> Modern languages and their compilers can usually produce the best possible machine code most of the time.
You have been lied to. Modern compilers produce better code than non-expert assembly programmers. Expert assembly programmers are still much, much better than compilers (note, however, that we are also much more expensive). As a professional library writer, I typically beat compilers by about 2x for non-trivial non-vectorizable functions; when vectorization is possible, that figure usually increases; now and then I beat a compiler by a few orders of magnitude.
ARM isn't really simpler than x86. ARMv7 has about as many instructions as x86, and probably has more if you differentiate between thumb and arm (which you should if you are micro-optimizing). It also features some constructs like general conditional execution that are notoriously difficult for compilers to use effectively. The 64-bit ARM ISA is every bit as complex as modern x86_64 (though more orthogonal, which makes it lovely to program for).
N.B. "the best possible machine code" is actually non-computable in the general case -- its equivalent to the halting problem. This is sometimes referred to as the "full employment theorem". So from a certain point of view, no matter how good compilers get there will always be room to do better.
Yeah, although pretty often getting that 2x on non-vectorizable code is about doing a different thing that maps better to the hardware.
Mostly you just have to do what is outlined in the next few paragraphs. Optimization is actually pretty simple - don't work against the hardware, work with it. Measure.
Memory is slow. Just changing how things are ordered in the memory can do sometimes 2x speed. So try to keep your data sequential in memory. Avoid locks if you can - it's shared memory between cores which requires a lot of slow communication between cores. Avoid atomic operations if you can - also memory shared between cores. Avoid sharing any mutable data with other cores, if you can. If NUMA (multiple physical CPU sockets, each with it's own physically connected memory), use local CPU memory, otherwise you'll very quickly saturate CPU socket interconnects! Avoid function pointers, especially to short functions; they cause compilers to drop optimizations and cause often CPU branch mispredicts. Avoid pointer chasing, but if you do, measure if it makes sense to early prefetch next entry/entries. Don't waste cache lines! You have less than 512 or 1024 of those very valuable L1 cache lines, they're easy to exhaust. Don't write unnecessarily to cache lines shared by other cores, this makes a lot of very slow things happen behind the scenes. Try to align structs to next larger power of two size, but do remember often just having more data in cache is better, so measure. Or align structs to cache line boundary (usually 32, 64 or 128 bytes), so that CPU doesn't need to fetch two cache lines. Otherwise align to data type size, but measure too. Try to keep your stack aligned.
Don't register spill, be frugal with how many registers you need in inner loops (especially on 32-bit x86!). Don't do long dependency chains in computations, if you can avoid it.
Minimize the number of branches, even nearly 100% predictable ones - they still take up valuable CPU branch resources.
If you have a lot of memory random access, arrange your data in a way that maximizes locality of reference. Use large pages to avoid TLB misses (and worse, page faults).
Noticed a pattern? Computation is fast, memory is very slow. Sequential streaming from memory is faster by an order of magnitude compared to random access. You might be able to do thousands of floating point operations during a single cache miss! Communicating with other cores is slower. A lot of effective optimization comes down to managing cache better.
Always assume what you know is outdated and even when it isn't, that you're simply wrong! Read CPU model specific documentation, but don't trust it blindly. Assume nothing, measure. Small changes in code can affect performance in seemingly unaffected locations. Don't do silly things: it doesn't matter if your loop counters count up or down. "register" keyword does nothing and even if it did, you'd probably make your code slower.
Be sure to measure on multiple CPU architectures (for X86: a few AMD generations, Intel Sandy Bridge, Atom, etc. All of those behave differently, often very differently! Take advantage of the CPU performance counters, they're really useful seeing what happens inside the chip.
Don't optimize until you're really sure you need to, and that you've done all you can in choosing the best algorithm. Is everything really necessary code to execute? Anything you could precompute? Do remember that you can do a huge amount of computation during a memory access! Don't naively assume you should always choose the algorithm with the best big O. Your constants are often large and data is usually small!
> ARM isn't really simpler than x86. ARMv7 has about as many instructions as x86, and probably has more if you differentiate between thumb and arm (which you should if you are micro-optimizing). It also features some constructs like general conditional execution that are notoriously difficult for compilers to use effectively. The 64-bit ARM ISA is every bit as complex as modern x86_64 (though more orthogonal, which makes it lovely to program for).
True. ARMv7 conditional execution (predicates) is also reading from flags register. On a simple pipelined ARM implementation that equals a stall, if flags were touched by the few previous instructions, until last flag touching instruction is retired. And on more complicated designs that implement out-of-order execution it can still often mean a stall because of long dependency chains.
On ARMv7 practically all instructions can have dependency on flags register!
You do avoid branch mispredicts, but at what cost? Having so many read/write ports in the flags register consumes silicon and power too. The cost increases the deeper the pipeline. To get higher clock frequencies, the CPU pipeline needs to become longer.
ARMv8 (64-bit) did the only sensible thing and removed predicate insanity.
If you care about optimizing your C/C++ code, you need to be reading the assembly language output from the compiler on a regular basis. You will constantly stumble over cases where the compiler isn't doing what you expected. For example, it might be reloading memory in an inner loop instead of caching its value in a register outside the loop because its alias analysis was inconclusive. Or not enough inlining is happening where you expected it. Or there's too much inlining like when a side exit to a cold function is inlined within an inner loop. This last case exemplifies a general issue that ahead-of-time compilers have with identifying code paths as hot or cold. You can sometimes give hints to the compiler, but you must verify that the hints were actually taken in the way you expected.
Obviously you only dive into this level of detail with a piece of code once you have reason to believe it's warranted (e.g. profiler information).
Yes. Sampling profiler + reading generated assembly are the two prongs of my optimization approach.
Sometimes the compiler does something really stupid and inefficient, but other times it does something really clever that I (who never wrote any serious assembly by hand) would never think of.
Better be careful though. Sometimes compiler generated output that looks stupid is correct. And what you think is faster, is not faster or correct. Such as signed division by two by strength reduction to a shift. The results are wrong for negative numbers without a fixup.
Compilers also consider whole possible value range. They do reduce the set of possible values based on earlier code. Like if you've previously multiplied a value by two, the compilers know afterwards "value & (~1)" can be safely removed entirely. But the compiler is forced to drop an optimization if there's even one possible value for which the optimization is not valid. Common values for this to occur are for example 0, INT_MIN, INT_MAX, etc.
Compilers also know a lot of cost. Hypothetically speaking (I haven't actually seen following ever happening!) there can be situations where it's faster to multiply by two than to shift left or add to self. Like because of CPU core execution unit availability that the compiler knows about.
Many people concerned with the assembly language output don't use modern languages - they use C and C++ ;) And even if the best possible machine code is produced most of the time, which I'm not sure is actually the case, you've still got to deal with the cases where it isn't. So you need to look at the generated code to see which case you're dealing with. Also comes in handy for stepping through optimized code.
(Contrary to popular belief x64 is not hard to learn, just a bit ugly. The reference manual is pretty easy to read, and the instruction list may look daunting but you'll never need to use most of it. I actually found ARM harder to deal with; the instruction descriptions in the reference manual aren't as well written.)
You'd learn x86-64 assembly to better understand how your high level code maps to the CPU. To see through the level of abstraction you're using. So that you don't do senseless things in your high level language of choice. Not to actually write assembly most of the time.
Although sometimes you have to write asm, if you need high performance. You can beat compilers, simply because they're very bad at autovectorizing and global optimization when control flow is too complicated.
Compilers are good most of the time, just not always.
Code that relies on SIMD or other processor-specific performance features is often written in hand-coded assembly, because even the best compilers do not reliably produce optimally vectorized code. This includes numerical simulations, graphics libraries, media codecs, high-speed text processing, and more.
Maybe for understanding how computer works? What occurs when you launch program, what occurs between turning on computer and that moment when you see login screen, what is it printf("Hello world") and etc... Many interesting questions. Of corse assembly is not answer on all questions but the way to it
Note that "reverse engineering" includes answering mundane questions like "WTF is this closed-source library doing when I get this bizarre inexplicable result?" and "Am I really encountering a compiler bug here?"
Sometimes you want to do things high level languages don't support. For instance C doesn't let you retrieve the 64-bit result from a 32x32-bit multiply. You have to cast the operands to 64-bit first to get the result even though the hardware provides the answer you want without any extra steps. If this is performance critical code it may require dropping down to assembly.
> For instance C doesn't let you retrieve the 64-bit result from a 32x32-bit multiply. You have to cast the operands to 64-bit first to get the result even though the hardware provides the answer you want without any extra steps.
Any optimizing compiler should generate a single 32-bit MUL for uint64_t z = (uint64_t) x * (uint64_t) y. It's easy for instruction selection to match against such patterns.
A better example would be generating a single instruction for the 128-bit product of 64-bit integers on a 64-bit processor. There are no standard types for 128-bit integers, but you can still avoid assembly code by using compiler-specific types or intrinsics. With GCC, Clang and ICC, you can write unsigned __int128 z = (unsigned __int128) x * (unsigned __int128) y. The odd one out is MSVC, where you have to use the _umul128 intrisic.
"Average" programmers had much more practical reason to learn assembly in the days of 16 and 32 bit x86, of DOS, Win32, and software rendered games, than today when the average programmer writes Javascript or some other dynamic language and is several levels removed from the machine code being executed, so I don't think material written about x86-64 will ever be on the level of the stuff written back then. IMO it makes more sense to stick to the high quality 32 bit texts and let students figure out the 64 bit additions on their own if they want to. x86 is conceptually simpler anyway.
There is "Introduction to 64 Bit Intel Assembly Language Programming for Linux", 2nd Ed, by Ray Seyfarth. It's pretty good for beginners with some C experience, I think.
Not really answering you. But I feel like sharing the method our lecturer had.
We learned MIPS first and then when people had a grasp of the basics we had lectures (along with intel manuals, abi reference, and some minor internal documents) explaining x64, partly in a "compared to mips" perspective. I believe this method served us pretty good and most had no problem picking up on x64.
Probably not what you're looking for but Randall Hyde has put up a second edition of The Art of Assembly which is available for free download or at no starch. http://webster.cs.ucr.edu
I don't know if he addresses x86_64 much, if at all though.
[author :)] I'm not sure that i can recomend somethig for you, but for things that i use for learning low-level programming are following.
1. Nasm manual
2. Intel manuals
3. gdb
4. Source code written by other peoples, like bootloaders, simple 64 bit operating systems and etc...
Myself as well. However, my experience is that undergrads who are getting there first experience in x86 need their hands held quite a bit. It doesn't help that there are two competing syntaxes for x86 so I have to be careful to make sure all of the resources use the same syntax (I use AT&T so that it compatible with gcc as I said before).
The gas manual is slightly frustrating because they simply say "consult intel" for the opcode and remember to mostly reverse except for most of the floating point instructions. See: https://sourceware.org/binutils/docs/as/i386_002dDependent.h... So I mostly don't point students there because it isn't that helpful when you are just starting out.
The important things you should learn in your computer architecture class is how assembly and other low-level details work in a pipelined register machine. The exact architecture doesn't really matter as the knowledge is easily transferable.
I think it will. In any case, it is good to know at least two. I don't mean master, necessarily, but will give you broader perspective on how stuff works.
calculateStrLength is not only calculating the string length; it's also pushing the contents of the string onto the stack (and expanding each byte to 8 bytes, because that's the only valid size of push on x64). The stack pointer is no longer pointing to the return address, so the code puts the return address (which was calculated by hand by looking at the dissasembly and counting bytes rather than using a label for some reason, but I digress) back onto the stack and then using RET to pop it right back off and jump to it (PUSH RDI + RET is equivalent to JMP RDI).
Then reverseStr loops over the string length, popping a character off the stack one at a time and writing all eight bytes into the output buffer (running off the end of the reserved buffer for the last 7 characters, incidentally).
I suppose this is an interesting exercise to understand how the stack and call/ret mechanisms work, but it's not the way to write understandable code. Using the stack in this way will almost certainly also break the return address prediction that modern CPUs perform, making the code slower than the straightforward version that doesn't use the stack at all.