Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Bourne Shell Source Code (tuhs.org)
122 points by kick on Jan 31, 2020 | hide | past | favorite | 39 comments


This is the source to the original Bourne Shell, shipped in Research UNIX v7. You've probably used GNU Bash, which stands for "Bourne-Again SHell."

The Bourne sh is significant for a few reasons. Primarily, its GNU descendant is now installed on billions of devices.

Perhaps of more interest: Bourne's sh source code heavily abuses C macros to look and feel like ALGOL-68. This is made more significant because it came before C was standardized: it took real knowledge to abuse that much.

While this is C that compiled just fine for the day, and might compile mostly without errors for a compiler with a K&R compatibility mode, it's absolutely wild, and is written compensating for some of K&R's faults (see how it handles true v. false).

I recommend mac.h as a file of particular interest:

https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/sh...

Bonus points to anyone who understands what these three lines are doing within it:

     #define LOBYTE 0377
     #define STRIP 0177
     #define QUOTE 0200

Also of interest:

This was specifically the reason that the International Obfuscated C Code Contest was created, started just minutes after seeing Bourne's sh for the first time (I'm sorry for formatting this as code block, but the formatting breaks otherwise):

     Q: How did the IOCCC get started?
     A: One day (23 March 1984 to be exact), back Larry 
    Bassel and I (Landon Curt Noll) were working for National 
    Semiconductor's Genix porting group, we were both in our 
    offices trying to fix some very broken code. Larry had been
    trying to fix a bug in the classic Bourne shell (C code 
    #defined to death to sort of look like Algol) and I had been
    working on the finger program from early BSD (a bug ridden 
    finger implementation to be sure). We happened to both 
    wander (at the same time) out to the hallway in Building 7C 
    to clear our heads.

     We began to compare notes: ''You won't believe the code
    I am trying to fix''. And: ''Well you cannot imagine the 
    brain damage level of the code I'm trying to fix''. As well 
    as: ''It more than bad code, the author really had to try to
    make it this bad!''.

    After a few minutes we wandered back into my office 
    where I posted a flame to net.lang.c inviting people to try 
    and out obfuscate the UN*X source code we had just been 
    working on.


> Bonus points to anyone who understands what these three lines are doing within it:

     #define LOBYTE 0377
     #define STRIP 0177
     #define QUOTE 0200
That's just 0xff, 0x7f and 0x80 in octal, and the high bit used to be a flag for all kinds of "magic" behaviour back when 7-bit ASCII was the norm...


Yup, the dash shell (from Debian) still uses this, so it won't support unicode any time soon.


I find myself validating a lot of input to be ascii still. I think its time to write a lib to make use of all those wasted bits.


I signed the above post with an emoticon that does not render? Could it be that hn is not 8bit safe?


HN filters out emoji (well, most of them, the blacklist appears to be incomplete)


> the blacklist appears to be incomplete

I was under the impression that the emojis that are not blacklisted are intentionally whitelisted.

For example, all of the flag emojis are possible to use. That seems to be no coincidence.

There are a few others that are possible to use as well, most of which appear to be logically grouped together.

For example, multiple emojis relating to time are possible to use.

The flags make sense I think. And I could see the ones relating to time being sort of relevant in post titles about time management for example.

For some reason, some of the symbols resembling media playback controls are possible to use too.

I don't see any reason for the media playback controls symbols to be available (while so many others are blacklisted I mean), but it does make the following possible though:

"This is so sad. Alexa, play Despacito 2."

Now playing: Despacito 2 (feat. Eminem)

⏯ ⏮ ⏭ —————————○—————— (2:01 / 3:14)

Not that there is any immediate practical use of that of course.

I think it would be interesting if the mods would comment on which emojis and other symbols outside of human writing systems are available and if indeed that is intended or not, and if intended what they are expecting people to use them for and why they decided to whitelist exactly those that they did while still blacklisting a few of the other ones that might be useful.

It might just be like you said also though, maybe they explicitly blacklisted some symbols only and then more symbols were introduced later and the blacklist was never updated to include those.


That can't be the (only) explanation, since according to Emojipedia the play/pause button was introduced in Emoji 1.0.


Flags work also: 🇯🇵 🇰🇷 🇩🇪 🇨🇳 🇺🇸


Techies at theregister.co.uk told me mysql cant persist some types of utf8 chars in their forums. Its not deliberate but common emojis dont work there too. Anyone know if HN uses mysql?


News, the software behind HN, doesn't, no. The filter for emojis is deliberate.


MySQL has multiple Unicode character sets. The one named "utf8" doesn't support all of utf8, but utf8mb4 supports it all.


mysql `utf8` is only 3 byte -- you have to specify something different that `utf8` to get full utf8 https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-sets...


I can attest MySQL 8.0 eats up emoji characters with no fuss.


HN doesn't even use a database.


Bingo!


Everyone talks about the macro abuse, nobody talks about the memory management abuse:

https://www.in-ulm.de/~mascheck/bourne/segv.html

> In comp.arch, 05/97, <5m2mu4$guf$1@murrow.corp.sgi.com>, John Mashey writes:

> For speed, Steve B had used a clever trick of using a memory arena without checking for the end, but placing it so that running off the end would cause a memory fault, which the shell then trapped, allocated more memory, then returned to the instruction that caused the trap and continued. The MC68000 (in order to go fast) had an exception model that broke this (among other things) and caused some grief to a whole generation of people porting UNIX to 68Ks in the early 1980s.


To be fair, that's not exactly a horrible perversion, relying on the existence of memory protection. Hell, that's precisely how stack grows dynamically on Windows: there is a guard page at the bottom, when it's touched, the kernel allocates it and marks the page below it as the new guard page. The downside is that stack allocations larger than 4 K have to manually probe memory to trigger this behaviour, that's what _stkchk from CRT does.

Most of such low-level hacks are nowadays reserved exclusively for runtime implementations, probably for the better.


The early 68k machines had memory protection, they just couldn't resume the instruction that triggered it, it wasn't until the 68010 that this was fixed. You can also see evidence of this in early 68k C compilers, the function entry code would have a dummy instruction to probe the end of the area on the stack that was needed to hold any local variables.


Nit, but it's _chkstk.


> memory fault

That reminds me of how virtual memory works.


Stephen Bourne's love for Algol 68 is also why the control flow in his shell makes use of backwards words such as 'fi' and 'esac' to end blocks: they originated in Algol 68, and Bourne loved them so much he put them in his shell.


And the only asymmetry there is do matching with done, as opposed to od, because od is the program octal dump.


You don't really have to use code blocks for that. Here's a copy that will be readable on mobile and preserves the formatting from the original (which the code block didn't):

Q: How did the IOCCC get started?

A: One day (23 March 1984 to be exact), back Larry Bassel and I (Landon Curt Noll) were working for National Semiconductor's Genix porting group, we were both in our offices trying to fix some very broken code. Larry had been trying to fix a bug in the classic Bourne shell (C code #defined to death to sort of look like Algol) and I had been working on the finger program from early BSD (a bug ridden finger implementation to be sure). We happened to both wander (at the same time) out to the hallway in Building 7C to clear our heads.

We began to compare notes: "You won't believe the code I am trying to fix". And: "Well you cannot imagine the brain damage level of the code I'm trying to fix". As well as: "It more than bad code, the author really had to try to make it this bad!".

After a few minutes we wandered back into my office where I posted a flame to net.lang.c inviting people to try and out obfuscate the UN*X source code we had just been working on.

From: https://www.ioccc.org/faq.html


I tried it without, initially.

How'd you get the censored UNIX to work without italicizing half of the comment incorrectly? I tried escaping it in a few different ways, and no dice.


That must have been sheer luck! Or maybe because it was the last asterisk in the comment, and the ones before it were paired?

I took the original and replaced the doubled single quotes with regular double quotes, and surrounded the italicized quotes with asterisks. That seemed to do the trick.

Of course if the censored UNIX appeared earlier, or more than once, that would likely be a problem. Here are three of them in a row:

UNX UNX UN*X

Here's another idea. I suspect that the other Unicode asterisk-like characters might render without triggering italics. I'll use the asterisk operator (U+2217), heavy asterisk (U+2731), bold five spoked asterisk (U+1F7B1) in that order. Let's see how they look:

UN∗X UNX UN🞱X

OK, it looks like the heavy asterisk just disappears, bold five spoke is too big in Chrome on Windows and a box with an X in it in Chrome on Android, but the asterisk operator is a pretty good substitute for the regular one (albeit a bit small and light on desktop, a bit large on mobile). Now let's see if repeating it triggers any formatting:

UN∗X UN∗X UN∗X

I think we have a winner! Ah, the joys of HN comment formatting...


> Algol 68

By the way, it is an interesting (and simple) exercise to use macros to make C look almost like Oberon-07.


Back when I was writing C compilers the Bourne Shell was my nemesis. The Bourne shell did exercise nearly every "feature" of the C language. Compiling and then running the shell was a great test case for an optimizing compiler and turned up many bugs. But, when the compiled code failed, winding back to the underlying C through the all of the macros and optimizations was exceptionally difficult. I still remember many a late night trying to figure out what happened. (Thanks Steve, for many fascinating hours of struggle.)


The whole program trusts definitions in mac.h [1] like:

    #define IF if(
    #define THEN ){
    #define ELSE } else {
    #define ELIF } else if (
    #define FI ;}

    #define BEGIN {
    #define END }
    #define SWITCH switch(
    #define IN ){
    #define ENDSW }
    #define FOR for(
    #define WHILE while(
    ...
Isn't it nowadays considered bad practice? After taking a glance at the code, I see there might be some advantages like not forgetting to add missing {}. Is there any other explanation on why they created a dialect on the top of C using the preprocessor?

[1] https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/sh...

EDIT: fix English


Bourne liked ALGOL. A lot. So much that he was one of the few people who wrote their own ALGOL-68 compiler. Using the preprocessor to feel more at home is a pretty good idea in this case.

This wasn't particularly popular to anyone who wasn't, well, Bourne, even at the time. I posted an example here: https://news.ycombinator.com/item?id=22199664


I’m under the impression that building up high level languages using macros was very common among assembly programmers, since C was new whoever wrote this may have come from assembly and taken the habit with them.


The author worked on ALGOL 68C and probably intended to reuse the more familiar syntax.

Nowadays, I would indeed consider it a bad practice to use such macros. Especially if you intend to share the project with anyone else.


As someone who used the C preprocessor to generate CPP, Java, and C# from a common source in order to have a common library for native apps, I always appreciate a good bit of preprocessor abuse - it's one of the things that makes C so much fun!


My favorite idea of an extreme preprocessing is using C itself as the preprocessor language, with, optionally, the only syntactic sugar being provided by ASP-like brackets.


I'm another who recalls this mess being used (1990 or so) as the acid test for C compilers, source code analyzers and debuggers. I can still picture the look of pride on a certain salesman's face when he demoed a valgrind-like tool for us that didn't just crumble to pieces when asked to chew on this tangle.


Interestingly, https://news.ycombinator.com/item?id=22188704 , with similar techniques reinvented 4 decades later, was on Hacker News only yesterday.


Yes, I commented there about the Bourne Shell and the IOCCC: https://news.ycombinator.com/item?id=22192910


The technique of macroing C to death to look like another language never died! It's still a very popular thing to do, but it's gotten a bit less wild now that C has been standardized to death.


I remember in college some friends of mine decided to macro C into really bad fake German. Think 'inten mainen" and 'printenoutenf'.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: