Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If anyone is curious, on gcc 12.3.0 and clang 16.0.6 (x86_64), the answers are what most people (who have written lots of C) would expect:

1) 8

2) 0

3) 160

4) 1 (both clang and gcc output a warning)

5) 2 (only clang outputs a warning)

While I like the idea of this quiz, I think it would be more powerful if it provided examples of compilers / architectures where these are not the correct answers. (I also think thorough unit tests would catch most of these errors)



The author is making a deliberate point about undefined behaviour in the article. Hence them not executing worked examples.

In fact, by not doing so they are making a subtle implicit statement that it is uninteresting to consider actually attempting to execute these snippets.

The third paragraph of the "P.S" of the article (you have to press submit to see it) is the one that really gives the game away.


Most of these things are implementation defined rather than undefined. Only the 5th is undefined.


More than implementation defined, for some you need context that simply isn't given. On the ones with mixed-type structs, even if you know what system it's compiled for you don't know if someone has used pragma pack 1 to byte pack the data instead of standard packing. Just seeing the struct, you still don't know.


Good point, although that is not part of standard C.


'#pragma pack' isn't part of the C standard, but #pragma is and "causes the implementation to behave in an implementation-defined manner."


I agree that in theory it would be cool to have C code that uses only defined behavior and works on all platforms for all eternity. However, I think most programs have a fairly clear understanding of what platforms (OS+arch) they are targeting and what compilers they are using to target those platforms.

If the compiler has defined behavior (and you have unit tests for that behavior) on all of these platforms, I don't think it is a huge deal. (Ideally you wouldn't... but sometimes its an accident or unavoidable)

As an example, while struct padding (problem 1) might not technically be in the spec, it is a cornerstone of FFI and every new compiler (that supports C FFI) has a way to compile structs with the same padding.

To my original point, if the article had instead given examples of compilers + architectures that produced different answers, I might feel differently. However, just saying mentioning that these weird edge cases are undefined (in the spec) doesn't mean much to me.


My answers for 2, 3 and 5 were different:

2) I thought the type would be promoted to short. It turns out that the result of the arithmetic operation is promoted to int.

3) The signness of char is platform dependent. It is signed on x86 and amd64, but unsigned everywhere else. After seeing my mistake, I would expect this to cause the answer to be -96 on amd64 from sign extension when it converted into an integer, yet it is 160, which is what I would have expected from a platform where char is signed. If anyone knows why it is 160 here, please let me know.

5) This is a classic. I know to answer I do not know because despite having an Operator Precedence, C famously says that this is undefined. I have no idea why the standard does this when there is a clearly right answer. Java for example makes this have only 1 right answer.


I decided to try #3 for myself. The results are interestingly inconsistent. If you cast a to int and print it, it comes out -96. But the shell reports it as 160. Godbolt compiler clearly shows it returning -96 (movsx should sign extend it) so I don't know what's happening. https://godbolt.org/z/9rxcnM3G3


Replying to myself. I did some digging and figured it out - the shell itself truncates the return value to an 8 bit unsigned number. If you have a simple program that consists of only "return -96;" the shell will still report a return value of 160.


Can you explain 3? A space is 20 IIRC, 2013 is 260. An unsigned char tops out at 255 but I guess this one is* signed so... that's 127. And then I have no idea what happens, some kind of overflow, but I don't know the wrapping rules.


' ' in original C is the encoding for space.

This might be ASCII or EBCDIC or something else local to a specific hardware implementation.

https://en.wikipedia.org/wiki/EBCDIC

So, maybe 0x20, maybe 0x40, maybe something else.

At least you know that '0', '1', .., ''9' are contiguous.


I don't know if wrapping rules are defined by the standard or implementation defined. But the easiest thing for a compiler to implement is simple truncation. A space is 0x20 (32 decimal) in most C compilers, so multiplying it by 13 is 416. Truncating that to 8 bits, the size of char on most compilers, is 160 (0xa0). If char is signed, the upper bit being set will cause it to be a negative number -96. Promotion of the char to int won't change its value.

There are a huge number of assumptions in that simple chain of events, and if any of them are wrong you get a different answer.


I made the same mistake having worked with URL encoding for so long. " " is 20...in hex.


" " is a C string constant ... so a space encoding followed by a NUL encoding.


It's 0x20 like others pointed out.

The wrapping rule here is that signed integer overflow is UB.


Assuming ASCII char encoding ... which isn't a given in C, just extremely commonplace.


The ascii numbers being assigned to characters is a famous example of one C compiler passing on knowledge to a compiler it builds without ever having it specified in the source code. Given that, I am surprised to hear it ever is anything different.


EBCDIC persisted later than many might expect - to 1990 in legacy hardened IBM System/360's used in air traffic and defence (branded as IBM 9020's IIRC).

Early C compiler projects (eg: The Hendrix Small-C of ~1982) would get patched by some to support the full C language and extended to cross compile to and from whatever machines were about at the time, System/360's, VAX, PDP's, early PC's, BBC micros, etc.

It wasn't always the case that char encoding passed on by default, there was always the option to insert a trans table whether compiling or dealing with data stored in not native form (similar to data in big end V little end).


' ' is 0x20 or 32.


or 0x40 .. or something else.

https://en.wikipedia.org/wiki/EBCDIC


The case with multiple increments in an expression might produce different results depending on optimization level, perhaps not in this case but in other cases. That is because the compiler is allowed to use any order, so the order it picks might depend on what is in the registers.


5) How does 2 make sense? Shouldn't be 0 + 1? Or does the pre-increment take precedence over the addition, thus the left i is 1 but not because of the post-increment?


To get 2, there are (at least) a couple of ways it can happen, we can do i=0,i++ and get LHS=0, now i=1,++i and get RHS=2. Or we can do i=0,++i and get RHS=1, then i=1,i++ and get LHS=1.

However we’re also allowed to do something like this: i=0, a=i, b=i, b=b+1, RHS=b (RHS=1), LHS=a (LHS=0), a=a+1, i=a, i=b.

Probably quite a lot of other things are allowed to happen. Usual disclaimer that a standard-compliant compiler is allowed to vaporise your cat etc as part of UB.

The thing to Google is “sequence points”.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: