Hacker News new | past | comments | ask | show | jobs | submit login
So you think you know C? (2016) (wordsandbuttons.online)
152 points by tosh 3 months ago | hide | past | favorite | 185 comments



Since the questions specified neither a specific compiler/platform nor included Undefined or Unspecified Behavior as an option, I assumed this is an informal quiz and the author just tested the programs on their specific compiler.

Turns out the author does mean according to the standard, but thinks that “I don’t know” is a synonym for both undefined and unspecified behavior. It seems weird to use imprecise terminology in a post that’s all about lecturing about standards compliance.


I'm really surprised at how many people are annoyed by this quiz. It doesn't specify a compiler/platform, therefore all the definite answers are clearly wrong.


I think most C programmers know that the size of ints is implementation defined, and I've programmed on platforms where it was 2 bytes instead of 4. But when someone promises a brain teaser, and then asks an unclear question, you read that and go "What the author actually means can't be that dumb, can it?" It's annoying when a charitable reading of an unclear statement leads to an aggressive "gotcha".


>I think most C programmers know that the size of ints is implementation defined, and I've programmed on platforms where it was 2 bytes instead of 4.

So it's easy to say: I don't know.


But I do know. Though not specified by the standard, ints are 4 bytes on any standard platform anyone has used in this millennium.


But C works on non-standard platforms.

On a TI DSP I used this millennium, sizeof(char) == sizeof(short) == sizeof(int) == sizeof(float) == sizeof(double) == sizeof(void *) == 1. Each and every type was 32-bit. Each memory address pointed to a unique 32-bits, ie: (int *)0, and (int *)1 did not overlap on that system. As sizeof measures addressable units, not bytes, so they're all size 1.

Even on more standard systems, int is 4 bytes on systems using an ILP32 or LP64 convention, but ILP64 is a thing too, making int 8 bytes.


Yes, C supports that. But in practice, you're going to re-engineer the code.


Nope. On 8-bit platforms int is 16-bit wide, and even on some 16-bit platforms too. And this is not just history, embedded toolchains are prepared to handle the code as if it were platform-independent. The C99 uint32_t type is defined to 'unsigned int' on most sane platforms (not sure about ILP64, maybe it's unsigned short, but what's the uint16_t then?) But in the arm-none-eabi toolchain it's defined to 'unsigned long', because it is assumed that the same code is being built on 8-bit and 32-bit platforms, and so only 'long' guarantees the 32-bit range. And to avoid format string warnings, printf format strings shall contain the PRI*32 macros like PRIu32 instead of the raw %u / %lu.


That's right.

The crazy one is `long`, which is 32 bits on some platforms and 64 bits on other platforms.

I solved that problem by never using `long`, opting instead for `int` for 32 bits, and `long long` for 64. `long` should be deprecated for 32 and 64 bit platforms, it's not fixable.


In D, we use `int` for 32 bits, `long` for 64 bits, and `size_t` for a pointer index. All the craziness just melts away. You can port the code back and forth between 32 and 64 bit patterns, and it just works. All those `int32_t` types are out back in the bin along with the whiteout.


I like Rust’s choice for basic types: initial and the number of bits: u8, i32, f64, etc.

(I’m not comparing the languages in any way here, just praising a clean notation.)


I suspect the impetus for suffixing the number of bits is a bit of a backlash from C, where you never know how many bits are in a type. That has caused C programmers a lot of trouble and extra work.


int32_t and int64_t are plenty unambiguous, it's been 25 years.


Exactly. The answer isn't "I don't know". The answer is "Undefined". This isn't pedanticism. It's being unambiguous.

This was a semantic quiz, not a technical one.


People do tend to get annoyed when you don't communicate something, and then use it as a gotcha. They were aware of their audience's assumptions, but instead of communicating with those assumptions in mind, they communicated with different assumptions and then criticized their audience for not understanding the questions (i.e. "You didn't know C after all!")?



> People do tend to get annoyed when you don't communicate something, and then use it as a gotcha.

Obligatory xkcd: https://xkcd.com/169/


> but thinks that “I don’t know” is a synonym for both undefined and unspecified behavior.

In either case, is "I don't know" wrong? Given the information in this quiz, I don't think it is.


The quiz is exactly right: yes, the answers are all "I don't know" because the behavior is either implementation defined or undefined for each of them, and a couple of these reflect mistakes that I have had to point out in code reviews in recent years: expressions with multiple side effects without a sequence point, and shifting a N bit integral type by N bits (yes, that is undefined because some processor instruction sets mess up that case). C and C++ programmers need to be taught that they must not write that.


> C and C++ programmers need to be taught that they must not write that.

Things have stabilized quite a while ago. 99% of people who write C or C++ in 2024 will never use any computers where sizeof(int) != 4, or big endian processors, or systems where floats don’t conform to IEEE-754 standard.

Why shouldn’t they write code which requires these particular details from their compiler and the target processor?


I'd say there's no issue with writing such code, so long as it's protected by a compile-time assertion.

That said, if you use a variable-width integer type (`char`, `short`, `int`, `long`, `long long`) instead of a fixed-width type (`int32_t` & such) for anything other than passing parameters to existing libraries (including the standard library) I'd say you're Doing It Wrong. If you actually intend to have a variable-width, use one of the `_least` or `_fast` types to make it clear that you didn't just screw up.


One thing I think C++ (or at least, most C++ users) gets right is suggesting that you should use 'auto' if you don't really care that much about bit width, and the specified sizes otherwise.

Thankfully C23 takes this approach, although it'll probably take forever until it's as adopted as widespread as C99 is now, and that's still not widespread enough.


Any experienced C programmer knows this is undefined behavior.

But taking this test, would you really check "I don't know" if you do know it is UB, when there is no option "platform specific" or "UB"?


Yes, because platform is not specified, so you really do not know. Or you really don't know until you know what the platform is.


Such an explicit option might have given the game away. I haven’t programmed in C much but my thought would be, “wait is the gimmick here that everything is UB or implementation defined?”


zero mention of misra to boot


Seems like some people are getting their ego hurt because they don't like being told they got the wrong answer. Very human.

This quiz isn't an IQ test, people, and the questions are intentionally trick questions. It's essentially a form of cynicism intended to demonstrate major design oversights of the C programming language. So calm down and have a laugh.

Also, "I don't know" is the right answer. If you can't accept that, you may need to meditate more.


I don’t see anyone walking away feeling ego bruised from this, I mostly see people annoyed that the _more_ correct option of “I can’t know because you didn’t provide enough information” isn’t represented and the author is attempting a “clever” gotcha moment instead.


If it were "Implementation defined" and "Undefined" or "I know that I can't answer" instead of "I don't know", I'd have done better. Choosing to answer that you don't know is hard, especially if you know why you can't give definite answer.


Yeah if the first one weren't so obviously implementation defined, I'd have second-guessed myself on a lot of the others.


If anyone is curious, on gcc 12.3.0 and clang 16.0.6 (x86_64), the answers are what most people (who have written lots of C) would expect:

1) 8

2) 0

3) 160

4) 1 (both clang and gcc output a warning)

5) 2 (only clang outputs a warning)

While I like the idea of this quiz, I think it would be more powerful if it provided examples of compilers / architectures where these are not the correct answers. (I also think thorough unit tests would catch most of these errors)


The author is making a deliberate point about undefined behaviour in the article. Hence them not executing worked examples.

In fact, by not doing so they are making a subtle implicit statement that it is uninteresting to consider actually attempting to execute these snippets.

The third paragraph of the "P.S" of the article (you have to press submit to see it) is the one that really gives the game away.


Most of these things are implementation defined rather than undefined. Only the 5th is undefined.


More than implementation defined, for some you need context that simply isn't given. On the ones with mixed-type structs, even if you know what system it's compiled for you don't know if someone has used pragma pack 1 to byte pack the data instead of standard packing. Just seeing the struct, you still don't know.


Good point, although that is not part of standard C.


'#pragma pack' isn't part of the C standard, but #pragma is and "causes the implementation to behave in an implementation-defined manner."


I agree that in theory it would be cool to have C code that uses only defined behavior and works on all platforms for all eternity. However, I think most programs have a fairly clear understanding of what platforms (OS+arch) they are targeting and what compilers they are using to target those platforms.

If the compiler has defined behavior (and you have unit tests for that behavior) on all of these platforms, I don't think it is a huge deal. (Ideally you wouldn't... but sometimes its an accident or unavoidable)

As an example, while struct padding (problem 1) might not technically be in the spec, it is a cornerstone of FFI and every new compiler (that supports C FFI) has a way to compile structs with the same padding.

To my original point, if the article had instead given examples of compilers + architectures that produced different answers, I might feel differently. However, just saying mentioning that these weird edge cases are undefined (in the spec) doesn't mean much to me.


My answers for 2, 3 and 5 were different:

2) I thought the type would be promoted to short. It turns out that the result of the arithmetic operation is promoted to int.

3) The signness of char is platform dependent. It is signed on x86 and amd64, but unsigned everywhere else. After seeing my mistake, I would expect this to cause the answer to be -96 on amd64 from sign extension when it converted into an integer, yet it is 160, which is what I would have expected from a platform where char is signed. If anyone knows why it is 160 here, please let me know.

5) This is a classic. I know to answer I do not know because despite having an Operator Precedence, C famously says that this is undefined. I have no idea why the standard does this when there is a clearly right answer. Java for example makes this have only 1 right answer.


I decided to try #3 for myself. The results are interestingly inconsistent. If you cast a to int and print it, it comes out -96. But the shell reports it as 160. Godbolt compiler clearly shows it returning -96 (movsx should sign extend it) so I don't know what's happening. https://godbolt.org/z/9rxcnM3G3


Replying to myself. I did some digging and figured it out - the shell itself truncates the return value to an 8 bit unsigned number. If you have a simple program that consists of only "return -96;" the shell will still report a return value of 160.


Can you explain 3? A space is 20 IIRC, 2013 is 260. An unsigned char tops out at 255 but I guess this one is* signed so... that's 127. And then I have no idea what happens, some kind of overflow, but I don't know the wrapping rules.


' ' in original C is the encoding for space.

This might be ASCII or EBCDIC or something else local to a specific hardware implementation.

https://en.wikipedia.org/wiki/EBCDIC

So, maybe 0x20, maybe 0x40, maybe something else.

At least you know that '0', '1', .., ''9' are contiguous.


I don't know if wrapping rules are defined by the standard or implementation defined. But the easiest thing for a compiler to implement is simple truncation. A space is 0x20 (32 decimal) in most C compilers, so multiplying it by 13 is 416. Truncating that to 8 bits, the size of char on most compilers, is 160 (0xa0). If char is signed, the upper bit being set will cause it to be a negative number -96. Promotion of the char to int won't change its value.

There are a huge number of assumptions in that simple chain of events, and if any of them are wrong you get a different answer.


I made the same mistake having worked with URL encoding for so long. " " is 20...in hex.


" " is a C string constant ... so a space encoding followed by a NUL encoding.


It's 0x20 like others pointed out.

The wrapping rule here is that signed integer overflow is UB.


Assuming ASCII char encoding ... which isn't a given in C, just extremely commonplace.


The ascii numbers being assigned to characters is a famous example of one C compiler passing on knowledge to a compiler it builds without ever having it specified in the source code. Given that, I am surprised to hear it ever is anything different.


EBCDIC persisted later than many might expect - to 1990 in legacy hardened IBM System/360's used in air traffic and defence (branded as IBM 9020's IIRC).

Early C compiler projects (eg: The Hendrix Small-C of ~1982) would get patched by some to support the full C language and extended to cross compile to and from whatever machines were about at the time, System/360's, VAX, PDP's, early PC's, BBC micros, etc.

It wasn't always the case that char encoding passed on by default, there was always the option to insert a trans table whether compiling or dealing with data stored in not native form (similar to data in big end V little end).


' ' is 0x20 or 32.


or 0x40 .. or something else.

https://en.wikipedia.org/wiki/EBCDIC


The case with multiple increments in an expression might produce different results depending on optimization level, perhaps not in this case but in other cases. That is because the compiler is allowed to use any order, so the order it picks might depend on what is in the registers.


5) How does 2 make sense? Shouldn't be 0 + 1? Or does the pre-increment take precedence over the addition, thus the left i is 1 but not because of the post-increment?


To get 2, there are (at least) a couple of ways it can happen, we can do i=0,i++ and get LHS=0, now i=1,++i and get RHS=2. Or we can do i=0,++i and get RHS=1, then i=1,i++ and get LHS=1.

However we’re also allowed to do something like this: i=0, a=i, b=i, b=b+1, RHS=b (RHS=1), LHS=a (LHS=0), a=a+1, i=a, i=b.

Probably quite a lot of other things are allowed to happen. Usual disclaimer that a standard-compliant compiler is allowed to vaporise your cat etc as part of UB.

The thing to Google is “sequence points”.


Related:

So You Think You Know C? (2020) [pdf] - https://news.ycombinator.com/item?id=37541685 - Sept 2023 (86 comments)

Free e-book (~2 MB): So You Think You Know C? [pdf] - https://news.ycombinator.com/item?id=22958870 - April 2020 (8 comments)

So you think you know C? (2016) - https://news.ycombinator.com/item?id=20366940 - July 2019 (322 comments)

So you think you know C? - https://news.ycombinator.com/item?id=12902304 - Nov 2016 (198 comments)

So you think you know C? - https://news.ycombinator.com/item?id=12900980 - Nov 2016 (1 comment)

So you think you know C? - https://news.ycombinator.com/item?id=12900279 - Nov 2016 (9 comments)

Same title different articles:

So you think you know C? - https://news.ycombinator.com/item?id=4657317 - Oct 2012 (13 comments)

So you think you know C: the Ksplice Pointer Challenge - https://news.ycombinator.com/item?id=3125891 - Oct 2011 (98 comments)


Despite it being the first language I ever learned, I don't touch C a lot anymore since I've removed myself almost entirely from the systems space, and undefined behavior isn't even the main reason.

Primarily, I have determined that I really don't like dealing with manual memory management. For all the tasks I'm interested in, C's performance gains over a GC'd language are marginal at best. Number crunching? Julia is competitive with C. Web servers? The JVM will handle it just fine. Microcontrollers? For what I do Lua+NodeMCU or MicroPython does everything I need it to. Reasonably fast command line application? Go's got you covered.

When I do use C in 2024, I pretty much always cheat and use the Boehm GC, which is fast enough for whatever I need it for. I'm not smart enough to know if I handled pointers correctly, and I don't know that I want to spend the time to get smart enough.

Obviously systems C and manual memory management has a place in the driver and kernel world, and if you're genuinely good with it then my hat goes off to you, but I don't feel like it buys me enough today to use it much.


If you didn't like these because they're "trick" questions you likely also would not enjoy CppQuiz (https://cppquiz.org/)

However you might well enjoy https://dtolnay.github.io/rust-quiz/

Like the C++ quiz, "Undefined Behaviour" is a valid answer, however, the quiz questions are about safe Rust, so that answer is always wrong.

I still get more than half of them wrong unless given far too long to think about it.


Also: https://neal.fun/password-game/

(Point is, some "quizzes" are made cynically to demonstrate a flaw in some design. Don't get your ego all bruised up because you got some questions "wrong" on the internet.)


While that's true, neither the Rust nor C++ Quiz are for this purpose. They're intended as an opportunity to learn, what's actually going on and how does that differ from your intuition.

The password game is a game first and a commentary on how stupid "password rules" are second. The game makes sense in our cultural context (where stupid password rules are a thing) but it would be fun (but less popular) anyway.


> Rust Quiz

>

> What is the output of this Rust program?

>

>

> // JavaScript is required (sorry)

Tears is the output of that Rust program.

EDIT: I will never understand HN comment newlines.


I learned C on a VAX, then used it on 68K, ARM6, very briefly Cray (that's a weird one), x86 in segmented mode, x86 in 32-bit flat mode, and now 64-bit in various flavors. As soon as I saw this code was using "short" and "int" I immediately knew all bets were off!

The comments remind me of the VAX programmers in 1990 who wrote code assuming pointers and ints were the same size, and if yours weren't, they told you to get a "real computer".


My recent WTF with C and C++ was that I found out that C++ compiler can just throw away infinite loop (and clang does that), while C compiler must not throw it away and must compile it as expected. It's all according to the standard. For example it's a typical construct for embedded software.

That solidified my opinion that C and C++ are very different languages despite naive view being that C is just a subset of C++. They're fundamentally different.


I got curious about this. It seems that C11 at least is much closer to the C++ behaviour: under certain conditions, compilers can assume that loops terminate even if they can't prove it. See https://krinkinmu.github.io/2021/12/12/pain-of-infinite-loop... for example and the relevant part of the standard, N1509 (Clang's compatibility chart referencing that: https://clang.llvm.org/c_status.html).

Importantly (see for example https://blog.regehr.org/archives/140) it's not that the standard previously explicitly forbade this sort of optimisation, it's just that it wasn't very clear and compiler writers interpreted it differently.

So in summary, yes, they're different languages, but "modern" C is is less different for this particular thing.


How they treat infinite loops that obviously never terminate (e.g. while (1)) is probably of greater interest. The first link gives this code:

namespace {

void Panic() { while (1); }

} // namespace

extern "C" void kernel() { Panic(); }

The clang++ really does optimize this away:

https://godbolt.org/z/1Wf135918

The same thing happens when I let clang++ compile a C version as C++:

https://godbolt.org/z/MxcGfWhej

However, if I compile it as C with clang, I get an infinite loop:

https://godbolt.org/z/x63fo33Wa


Here is another. In C, union types support type punning. In C++, they broke that guarantee, so if you compile C code that uses a union type as C++ code, it could be broken. It would be nice to know why they broke this. My guess is that it violated the strong type philosophy of C++ and thus had to be broken on ideological grounds.

C++ also broke implicit void pointer conversions, but at least that one had the reason that it was incompatible with function overloading. Not that anyone involved with C++ would tell you this. Instead, they provide bad C code that breaks the strict aliasing rule and claim that breaking implicit void pointer conversions somehow follows from it when in reality that is a non-sequitur:

https://www.stroustrup.com/bs_faq2.html#void-ptr


When would you want a loop that does not terminate and does not perform I/O, read/write a volatile, synchronize with another thread, or perform an atomic operation?


One time the heat at my place went out and I wanted to repurpose my machine as a space heater. In such a scenario, such a loop is desirable. Termination can be done by the operating system.


I have a System76 Serval and I highly recommend it, especially for your use case.


relevant xkcd https://xkcd.com/1172/


It is a very important feature. :P


For example when my algorithm is driven by interrupts and "user code" need to do nothing. One approach is to use something like WFI instruction in the loop, but this instruction has its own issues, for example I experienced issues with attached gdb on some chinese STM32 clones. Using busy loop worked just fine.

Using "unused" volatile also works, but it incurs SRAM traffic, which might slow down DMA and cause other effects. So ideally it should be simple empty loop compiled to `b .` instruction (jump to itself).

Another infinite loop usage is for abort() or exit() handler, as a crash place. For this use-case, using "unused" volatile does not cause issues, because device is already crashed, however it still feels wrong.


This happens regularly in the embedded space. Exception and fault handlers, for example.


a busy loop waiting for an interrupt to fire?


This is reasonable, but for the sake of your power budget I hope you use a wait-for-interrupt instruction instead rather than just spinning :)


Throw away? Hopefully that means replace it with a crash? If it's just carrying on, that sounds bad.


In the linked example clang just compiled function to nothing. In practice it means that CPU will run over random memory after calling this function. It probably will cause a crash, but may be it'll format a disk, who knows.

https://godbolt.org/z/zTz1488P3


> If it's just carrying on, that sounds bad.

Why? If you are an experienced C or C++ programmer you know about these quirks. Replacing an infinite loop without side effects with a crash sounds about as "bad" as optimizing it away.


What about baremetal embedded system code where there is no concept of a 'crash'? Sure you can emit a trap instruction/bus fault/etc, but what to emit for the trap handler code?


Yes, throw away entirely.


Programmers annoyed with the 'quiz' might really enjoy this book on that site, which is an excellent read.

https://wordsandbuttons.online/SYTYKC.pdf


My today’s WTF on the topic. The following C++ returns incorrect value when the argument is 8 or greater:

    inline uint64_t makeRemainderMask( ptrdiff_t missingLanes )
    {
        // This is not a branch, compiles to conditional move
        missingLanes = std::max( missingLanes, (ptrdiff_t)0 );
        // Make a mask of 8 bytes
        // No need to clip for missingLanes <= 8 because the shift is already good, results in zero
        uint64_t mask = ~(uint64_t)0;
        mask >>= missingLanes * 8;
        return mask;
    }
TIL the language standard defines right shift operator for unsigned types in a weird way, making so for uint64_t argument, a >> b is equal to a >> ( b % 64 ). I expected zero on the output for b >= 64, not that.


C++ doesn't even define it as that; it's actually just undefined behavior to shift by any amount greater than or equal to the width!


CPUs handle excessively large shifts in implementation defined ways. Some return 0. Some trap. x86 masks the right hand side with & 63 (for a 64-bit shift) so shifting by 65 is equivalent to shifting by 1.

In C and C++ it's undefined behavior (Really it should be implementation defined)


RISC-V is even more fun where they shift based on shamt[4:0] (or [5:0] in XLEN=64 mode). So, if you do something like:

    li    t0, 128
    li    t1, 1
    sll   t2, t1, t0
You'll get t2 := 1 since t0[5:0] == 6'b0 (i.e. no shift). It's a very sensible solution IMO if you don't have trapping arithmetic since you don't have to do anything special to handle illegal shifts, it just works.


Interesting, thanks. BTW, I fixed like that:

    uint64_t mask = -( missingLanes < 8 );
    mask >>= missingLanes * 8;
I only care about the AMD64 because the surrounding code uses AVX2 intrinsics, i.e. my new version is hopefully good enough. I’ve tried conditional operator to not rely on these details, however VC++ failed to generate fast code from them. It emit branches, and that code is kinda performance critical.


Another different shift implementation is in ARM NEON, where, for a non-constant shift amount, there's only a shift-left, and that with a negative shift amount gets you right shift; thus, were it forced to choose only one behavior, either scalar or vector shifts would get worse.


No, it doesn't. Shifting by a wider amount than the width of the type, or an equal amount, is undefined in C and C++. That was done because different processors handled the case differently, meaning that any specification would give one processor a substantial time penalty compared to the competition. So the committee gave up on obtaining agreement and called it undefined.


But why did they make it Undefined Behavior instead of Implementation Defined?

I love the C language, but there are now so many UBs in the language, it is painful to use.


If they did, we would have two subtly different dialects of C for cyclic shift and zeroing shift, and possibly even worse, combinations of the two.

That said, I am curious what the official answer is myself.


I really like Rust's answer in this space. Check out all the useful methods on the u64 type that can make this kind of code explicit: https://doc.rust-lang.org/std/primitive.u64.html


This happens in C too. Static analyzers tend to be good at catching this.


Thats not on the topic though. Do you think because two languages share a common letter in their name that they are equivocal?


They said C++, but aside from the line containing "std::max," they wrote C code, and what they said applies to C as well.


If the C++ compiler takes it, it is C++ code. The fact that it is almost valid C code does not change that.

The reactions a number of people have to C++ code that is also valid C code or close to being it are ridiculous. Some times, they even deny C++ code is C++ code. :/


I was just saying that it's not off-topic. I don't really understand what you're trying to say otherwise.


then why not post a standard C example?


If I would rework that snippet to be in C instead of C++, that code would no longer be my today’s WTF. Also, IMO `std::max` is more readable than the equivalent conditional operator.


Its irrelevant and a different language. might as well post ocaml or julia


Very relevant. Apart from a few minor exceptions, C++ is a superset of C i.e. almost any C code can be compiled as C++ language, and will result in a correct program.


Try compiling C code using union types for their intended type punning purpose as C++. It will compile, but bad things could happen. :/


do you think that you know what "equivocal" means ?


Knowledge is knowing that sizeof(int) is implementation-defined.

Wisdom is understanding how likely your code is to ever run on any platform where sizeof(int) is not 4, and if the answer is "not really," then stop worrying about it.

To succeed as a C programmer, you need both knowledge and wisdom.


And expertise is baking that assumption into the code explicitly with a static_assert.


I knew all these. I came from a time when portable code had to work on an incredibly diverse range of platforms with many different compilers, some proprietary, and all of these corresponded to real issues, not language lawyer nitpicking.


Having implemented two C compilers, knowing every last detail is sometimes a bit annoying, because one thinks "why the heck is that there!" For example, the syntax for a cast is indistinguishable from that for a function call for many cases. The only way to parse it correctly is to keep a symbol table of the typedefs. Believe me I tried, and finally threw in the towel and did a special symbol table just for typedefs.

This is just pointless complexity. It's why D's cast syntax looks like this:

    cast(T)x;
i.e. cast is a keyword.


The best bit is that the type name resolution depends on the current scope, but the scope itself is very unintuitive on edge cases. Jourdan and Pottier in their paper A Simple, Possibly Correct LR Parser for C11 [1] describe several edge cases, including the following case where scope is not necessarily consecutive:

    typedef long T, U;
    //                   T is an argument      T is a typedef
    //             vvvvvvvvvvvvvvvvvvvvvvvvvvv ~~~
    enum {V} (*f(T T, enum {U} y, int x[T+U]))(T t) {
    //     T is an argument    (until the end of function)
    //  vvvvvvvvvvvvvvvvvvvvvv
        long l = T+U+V+x[0]+y;
        return 0;
    }
[1] https://hal.science/hal-01633123/file/jourdan2017simple.pdf


Would you give C examples of this? I am having trouble imagining how a cast would look like a function call to the compiler.


Although I'm not who you asked, I can think of a slightly contrived example. If a is a variable name, then this is using b as the first argument to call the function a (which is surrounded by unnecessary parentheses). If it's a type name, it's casting b to the type a.

  result = (a)(b);


It's not contrived enough - it happens in real C code.


My example is slightly contrived because the parentheses around b are unnecessary to cast it, but it becomes more realistic if it's a more complex expression like this:

  (foo)(a + b)
Or this:

  // Comma expression or function call arguments?
  (bar)(first = side_effect++, second)


Yeah,

    (a)(b,c)
is that a comma-expression or an argument list?


Seems pointlessly complex too. Why didn't you just make it cast(T, x)?


Python and Haxe have a function-like cast like that, but the argument order differs between the languages. I prefer D's approach because it doesn't have this opportunity for confusion. The prefix operator only has one place for the type to go.

https://docs.python.org/3/library/typing.html#typing.cast

    typing.cast(typ, val)
https://haxe.org/manual/expression-cast.html

    cast expr; // unsafe cast
    cast (expr, Type); // safe cast


it doesn't have this opportunity for confusion

1) it'll be a compiler error anyways

2) of course there's room for confusion. It could be

cast(T) var

Or it could be

cast(var) T

And you must remember which is correct.


Neither is more complex. I just like the former better.


Well, certainly my eyes need to add an extra parsing case so maybe don't assume it's equally complex. I don't think I'm wrong on that count.


Here's a mini-quiz: on a regular gcc/clang for x86/x86-64/arm32/arm64, which of the following functions can ever produce undefined behavior? Answers can be obtained by compiling with "-O3 -fsanitize=undefined" and seeing which functions include __ubsan_handle*.

     int16_t a( int16_t a,  int16_t b) { return a*b; }
    uint16_t b(uint16_t a, uint16_t b) { return a*b; }
     int32_t c( int32_t a,  int32_t b) { return a*b; }
    uint32_t d(uint32_t a, uint32_t b) { return a*b; }


For 32-bit types, the undefined behavior is where I expected it to be, but it is the opposite for the 16-bit types. Why are the 16-bit types different?

To make things weirder, I tried 8-bit and 64-bit versions. The 64-bit version behaved like the 32-bit versions, but the 8-bit versions had no undefined behavior checks. Why is 8-bit so different?


Smaller-than-'int' types get implicitly promoted to 'int' before arithmetic. Results are indeed very funky.


That makes sense then. Multiplying two uint16_t can wraparound in an int, while two int16_t or smaller cannot.


A more useful "so you think you know C?" is this test:

Write a variadic macro `CLEANSE_MACRO_ARGS` that can be used within one of C's unhygienic macros, to turn it into a hygienic macro without reducing the prettiness of the macro body.

In standard C, this requires C23 and only works for macros that are not used as expressions. Or you can use GNU extensions and make it work for expressions and work even on old compilers.


My favorite UD in C is this:

  int x, y;
  x ^= y ^= x ^= y;
I was using this for years until I realized that it is UD.


Two assignments to the same variable in the same statement are undefined for the same reason i++ + ++i is undefined.

That said, even if it were not, you did not define x and y. Reading them is also undefined behavior, which is what I initially though you meant until I read what followed them.


Screwed up on the 4th question by mistakenly assuming the minimum standard-allowed magnitude of INT_{MIN,MAX} implied int was at least 32 bits. While that's true for LONG_{MIN,MAX} and long, int can be a mere 16 bits.

Good to have had that mistaken bit of errata corrected.


A very long time ago, the Microsoft C/C++ compiler used 16 bit ints. I had a boss that insisted we use long instead of int because he had been burned by this. Hadn't been a problem for at least 20 years, but that didn't matter to him.


Well, my first thought was of course, you can't know the answer to the first question without a lot more information.... I guess it's good that was the point.


i'm an old C dog and i loved this post.


well, just from reading the title, i assumed it would be trying to make a point, so I took the test without even reading the questions and got 5/5.

PS: 'sizeof with parentheses' looks so weird to me.


Sizeof without parentheses looks weird to me. Sometimes you need them, sometimes you don't, so I just have a habit of using them every time. Redundant parentheses don't hurt anything.


He could add a 6th question, which is what does this return:

int main(void) { return -1 == (~1 + 1) }

I am fairly confident that the answer will be the same on every system on which that is run, but technically, the C standard does not guarantee that it is the same unless you use C23.


> But even more, the size of the char type itself is not specified in bits either.

I still hope that you will have look at what are the standard type sizes before trying to develop or compile something for a weird platform or compiler...


I never worked so hard to get a 0%


I know C well enough to know that writing C to avoid every edge case left undefined by the standard is a fool's errand. There are a bunch of problems with this approach:

a. Your target ecosystem(s) probably follows a few norms in cases not defined by the standard. If it works on the target ecosystem(s), nobody cares if it's non-standard. Testing on the target ecosystem, preferably automated testing, is the standard that matters.

b. Some of these norms are extraordinarily powerful. For example, check out NaN boxing[1]. Standard? Hell naw. Useful? So useful that it's practically mandatory for dynamic PL interpreters to compete on performance.

c. So let's say you're targeting an ecosystem that does some atypical stuff, i.e. doesn't follow the norms mentioned. Well, joke's on you, off-the-beaten-path ecosystems typically also have bleeding-edge tooling that doesn't implement the standards. So again, testing on the target ecosystem, preferably automated testing, is the standard that matters.

d. So let's say you're targeting a lot of ecosystems, so many ecosystems that you can't reasonably test on all of them, like if you're writing NetBSD or the JVM. Okay, well, the edge cases of the C standard still don't matter because you just shouldn't be using any of the off-the-beaten-path parts of the C standard anyway, since those parts are the parts which aren't implemented in lots of obscure architectures.

e. And sure, if you really just enjoy being annoyingly pendantic you can go and learn the whole C standard and go write your own blog test with integrated quiz and then condescendingly tell people they don't know C, with the implication that you do know C. But the reality is that knowing those weird edge cases of C isn't useful because if you actually use those edge cases of C then nobody else on your team can read your code.

f. And lastly, there's plenty of defined behavior which you shouldn't use, too. Just because you know what it does, doesn't mean the next person reading your code knows.

My answers to the questions in the test:

1. 8 on most ecosystems. If it matters, test it, but if it matters you probably need to use one of the non-standard extensions mentioned to do anything about it, so the standard didn't matter so much did it?

2. 0 on most ecosystems. But frankly, don't use `short int`, ever. Use int with the limits defined in `limits.h` if you just want the most efficient integer width, and if you need a specific integer width use the the types in `stdint.h`. Yes, I'm aware that `int` isn't defined as the most efficient in the standard, but `int` is one of the first things people implement in a C compiler so the likelihood of coming across an int that isn't the fastest integer size is absurdly small.

3. Play stupid games win stupid prizes. Who cares if this is defined? Even if it were defined it would be bad code. Don't multiply chars, FFS.

4. Play stupid games win stupid prizes. Obviously bit twiddling when you don't know teh structure of the bits is bad. The fact that Endianness is relevant in bit shifts and yet is not brought up by the author does in fact show the dangers of smugly pretending you know C better than other people.

5. Play stupid games win stupid prizes. Who cares if this is defined? Even if it were defined it would be bad code because depending on the pre/postfixedness of ++ is hard for most people to reason about.

[1] https://craftinginterpreters.com/optimization.html#nan-boxin...


I can proudly say I got zero and I pretend to be a C programmer!


>But the reasonable doesn’t mean right for C

thats both: funny and sad


Let's C


It really is amazing that over 50 years later, we are only finally getting languages that are working on replacing C. Most went in the direction of being Big Idea languages that tried to supplant domains like C++ and Java. The language that is probably the closest is Zig, but I hear Odin is pretty good too (the website specifically mentions it as a C alternative)

The things I like about C aren't really about the language itself. There just aren't many programming languages that do these things well

1. Cross platform across Linux, MacOS (Intel and Arm), Windows (Without the use of MinGW And Cygwin), *BSD

2. Ability to create an executable

3. Small programs that start quick

4. Simple syntax that does not obfuscate when allocations are occurring

For GCed languages GO is probably the one that best fits this criteria for most people.

Sometimes I wish someone just took C and improved it. Instead of looking at C++ and thinking "what features should C have that C++ has" It should look back at C and go "What features does C not have, that it should"? Some that come to mind are

1. A better syntax for creating pointers and dereferencing them

2. Fixed const so that it works with array initialization

3. Modules instead of headers

4. Cross platform string types that encoded length 5. Hash tables

6. Remove preprocessor macros and replace common use cases with things built into the language

7. Build system that didn't rely on Make, Cmake, Ninja, Autotools, configure etc

8. Easy interop with C

Maybe this is a lot to ask, but most languages I see are solving waay more problems then this.

2,4,5, can be done without leaving C, which is why most people homebrew their own solutions. I think Pascal actually comes close, but it's held back by the fact that the editor support is pretty poor. Pacal-Mode in Emacs is not good, and Lazarus with its multiple floating windows is infuriating to have to alt+tab multiple times to switch to and from it. Plus I could never actually get it to build my programs properly without throwing some dwarf errors.

Recently I've been playing around with Gambit-C and it seems to suit most of my needs. It compiles to C, and then the C code is compiled with Clang, MSVC, or GCC depending on what your platform is. I get the benefit of a language that has more features than C with it being R7RS (but still being small because it's Scheme) and implementing the most common SRFIs (including hashes), but with all the portability of C. It also has dead simple interop with C using the C-Lambda and C-Define special forms (you can write C directly inside the Gambit code)^1 which means you can leverage C code with hardly any effort. There are tradeoffs with this approach, but IMO almost all of them are ecosystem related which can be fixed, as opposed to language related, which might be impossible to fix without breaking compatibility with existing code

1. https://www.deusinmachina.net/p/gambit-c-scheme-and-c-a-matc...


Actually Modula-2 came in 1978, Ada in 1983, Mac OS shipped with Object Pascal in 1984.

But UNIX being free beer gave another push to C, that those languages did not have.


That felt like a waste of time for a "gotcha" post. The author must be very smart™. C has undefined behavior, got it, I probably have a pretty good intuition for how common compilers implement it though.


It's even worse: for example, the first one isn't undefined, it's merely unspecified - i.e., "depends on the system in a well-known and predictable way," not "you're doing something very wrong and the result is chaos."


> it's merely unspecified - i.e., "depends on the system in a well-known and predictable way,"

I think that should be implementation-defined behavior, not unspecified behavior? IIRC unspecified behavior in C is not required to be known or consistent.


My bigger problem with the first one is that the explanation is incorrect and it isn't actually about structure padding. The explanation is making the incorrect assumption that `sizeof(int)` is always 4, but it isn't. `sizeof(*(&s))` can be as small as 2.


That's kind of his point though - don't rely on your intuition for how most common compilers work! Learn enough to write correct code!


The author wasn't asking you to write code, but to analyze some existing snippets. And it wasn't about "intuition about how compilers work", because not all of these deal with undefined behavior. Some are merely unspecified - i.e., platform-dependent in a non-crazy way (example: big endian vs little endian). So the gotcha is that you made the reader assume you might be talking about a modern Intel or ARM CPU, but what you really meant is that the return value will be different on PDP-11. Frankly, seems pedantic.


It’s all pedantry until you are writing software for a 1970s nuclear power plant.

/s


Writing C in the blind for a quiz and deploying tested C to known platforms are two entirely different things.


Not as different as we'd like; "we deployed our C to the platforms we need" works until the minute you change compiler/version/flags, and find out you were relying on something not guaranteed at all.

My personal gotcha was finding out I was relying on shifting an unsigned integer by the bit size of the integer to be 0, as if "you shifted every bit out of the integer, leaving 0". Shift any size under the integer's size in bits? Yep, those bits were shifted out leaving 0. Shift that last bit? Nope, that's Undefined Behaviour, and suddenly it's not 0 just because you changed a flag.


Take the manipulation of the space, for example.

Most of us would say that it has a value of 32 as defined by ASCII. This would be wrong on a platform that was using EBCDIC.

We all have to make assumptions in our programs, which are sometimes wrong.


Serious question, is there a C compiler that uses EBCDIC?


Definitely exists for IBM zOS.

RedHat Linux has also been ported to this platform; it's probably ASCII, but not sure.

Oracle has a native client on zOS that does transparent conversion between ASCII and EBCDIC. It was probably built with such a compiler.

There was a paper that the original port of Research UNIX ran as a client on an IBM TSS/370 kernel, which likely was not ASCII.

https://gunkies.org/wiki/UNIX/370


Huh, I didn't consider than angle. I said "I don't know" based on the fact that it could exhibit UB if the char container is signed and small enough that it cannot hold 32 * 13. Good to know though that you can't rely on the platform using ASCII!


No one has written a C program of any complexity with zero undefined behavior. Noting its existence as if it's some kind of revelation is trite and annoying.


Yeah there's a whole genre of talks, blog posts, online rants of the form:

"So you think you know C, huh? What does this horrible piece of code do that no one would ever write and if you see it should be nuked from orbit? <ridiculous contrived example>"

I just clicked I don't know on all of em, cause I realised what the metagame was from a mile away.

Some of these things are at least a little insightful, but overall, kind of a waste of time past a certain point, unless you work on a compiler or something like that. Much more constructive would be making resources on best practices for things like memory management, string handling, knowing where the footguns are in the stdlib and other common libs, managing complexity etc. Most bad C code isn't because of some standards gotcha, it's because of those things.


Except there is some very crazy code out there, and if you come across it and took at least one such quiz, you know you'll have to be careful.


Oh yeah, I've seen some shit, having done a fair bit of professional C work, mostly in the embed space.

But if I'm in an existing codebase, and I see a certain volume of this kind of "potential UB everywhere" code, my first instinct isn't let's spend however long trying to understand what exactly this code may or may not be doing/relying on according to dusty corners of the standard and the datasheet(sometimes you may have to, but I find it's rare). I prefer to approach it as "what is this code supposed to do, and how can it be done more sanely? Usually I find I can replace the crazy code with sane code in a fraction of the time it would take to fully understand the crazy code.


And so the cycle of programmers that create serious errors and vulnerabilities because they think they understand UB and don't respect it continued...


Competing languages don’t have spec, so their entire operation is unspecified. If they wrote it down then they would also have UB.


By corollary does that mean that if there isn't a formal proof of how C compiles to assembly, and how those opcodes are interpreted by a particular CPU model, that C formally isn't specified to have any meaning to a CPU?


There are formally verified CPUs, they're just not the lovely high performance ones you probably want.

For what it's worth, this is a common sentiment people have when they first come to blows with formal. "If you can't verify everything, why verify anything?" is a reasonable question to ask but it's not a practical position to hold since formal does increase the quality of software/hardware. Don't argue for less formal, fight for more!


As a Rust fan, it sounds remarkably similar to "if you still need unsafe somewhere, why bother trying to limit unsafe at all"


There is a formal proof if you use the CompCert C compiler:

https://compcert.org/


That would get you to the assembly generated, but not the behavior of the CPU in the question, which goes to show it's a pretty silly position to hold.


Yes the CPU is vendor defined behavior. However they usually write a spec.

The point is “undefined behavior” is more about gaps in the standard than bugs in programs.


Undefined Behaviour in C is the opposite of a gap in the standard. It's behaviour that's addressed in the standard that for historic, political, and performance reasons is allowed to be inconsistent. That's the reason it sucks to have been bitten by it (e.g. to find out that shifting N-1 bits out of an N sized integer gives you 0 bits, but shift that Nth bit and all bets are off)


So your claim is that a new language could define a spec to a degree that it would not have similar problems?


Yes? The combination of "historic, political, and performance" is a particular choice. Our attitudes to all of these are different to when C was maturing, particularly where reliability and security are involved.

And not least of all, we've learned more about programming language development than when C was created.


I simply test on my target systems. Additionally, I don't write contrived shifts and obvious overflows and math with integer size discrepancies and pre and post increments.


^ This. It's like saying "I gave someone a gift" and arguing that we don't know what that means because 'gift' also means 'poison' in German. Besides, chances are you are very well aware of all these C corner cases if you are in a system that has 6-bit chars for example.


[flagged]


Hopefully not. However those stuff historically have been written in Ada or a very restricted set of C.


Presumably, someone writing code for a nuclear power plant would be using astree and get warnings whenever they do anything wrong:

https://www.absint.com/astree/index.htm

It is a shame that it is not accessible to the rest of us. :/


Not accessible? You mean very expensive?


[flagged]


Just look at the comments here.


I know for what it is used fo and who are using it. That was not the question. There is a lot of undefined platform specific behaviour in C and yet it is mainly promoted for its cross platform property. It makes me a little uneasy if I happen to think about it too much.


Missing "(2016)". (one of) previous discussions: https://news.ycombinator.com/item?id=20366940

I really wish we could let this die. All it shows is that if you provide insufficient context, you get insufficient answers. There are better ways to point out C's shortcomings.


C is really the only language. Every other feature in every other language can be implemented in C, but no one wants it bloated with high level features that can just be compiled as a library.

We can know what is happening in memory in a human readable way. And that is the beauty of it. If you can’t tell what’s happening in memory to some degree, then it’s not that you don’t know C, it’s that you don’t comprehend computer architecture.


The first two answers are 8 and 0.

They are technically `undefined` according to the C standard, but are the behavior of every mainstream compiler. So much of the world's open-source code depends upon these that it's unlikely to change.

Using clang version 15.0, the first 2 produce no warning messages, even with -Wall -Wextra -pedantic. Conversely, the last 3 produce warning messages without any extra compiler flags.

The behavior of the first two examples are practically defined even if undefined according to the standard.

Now, when programming for embedded environments, like for 8-bit microcontrollers, all bets are off. But then you are using a quirky environment-specific compiler that needs a lot more hand-holding than just this. It's not going to compiler open-source libraries anyway.

I do know C. I knowingly write my code knowing that even though some things are technically undefined in the standard, that they are practically defined (and overwhelmingly so) for the platforms I target.


> Now, when programming for embedded environments

Something that a lot of C programmers do...

Most people who write for typical desktop and mobile computers don't do C. They tend to do C++ or other, higher level languages. Those who write C tend to do either quirky embedded code, or code that is highly portable, in both cases, knowing about such undefined or implementation defined behavior is important.

If you intend on relying on such assumptions, make it explicit, for example using padding, stdint, etc... On typical targets like clang and gcc on Linux, it won't change the generated code, but it will make it less likely to break on quirky compilers. Plus, it is more readable.


You start off confident that the first answer is 8.

Then you admit that the microcontroller world presents exceptions.

You've now arrived at "I don't know" the answer.

The article never said "using mainstream C compilers".


The first 4 are implementation defined rather than undefined.

That said, warnings do not necessarily mean that the code is invoking undefined behavior. For example, with if (a = b) GCC will generate a warning, unless you do if ((a = b)). The reason for the warning is that often people mean to do equality and instead write assignment by mistake, so the compilers warn unless a second set of braces is used to signal that you really meant to do that.


In the cases involving overflow, it's implementation-defined whether there's undefined behavior.


> The first 4 are implementation defined rather than undefined.

Third and fourth are only defined in some implementations.


That is fair for 4, although would explain why it is the case for 3?


If char is signed and ' ' * 13 is bigger than CHAR_MAX, you get UB by signed overflow.


Every mainstream compiler targeting a 32 or 64 bit platform.

Have we crossed the point yet that the majority of new microprocessors and microcontrollers sold each year are 32+ bit yet? Most devices I’m familiar with still have more 8 and 16 bit processors than 32 and 64 bit processors (although the 8 bit processors are rarely programmed in C).


>no warning messages, even with -Wall -Wextra -pedantic

A better test is -Weverything.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: