Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Text is a human language.

Computers do not really run on text. You can't talk to Unix like it's the bridge computer in Star Trek. No, Unix runs on bytes that sort of look like text because we have encoded and rendered it that way.

What computers need most for structured processing are data types. At instruction set level we had a period of experimentation with byte and word lengths which was still in effect when Unix was born. At that time we also had experimentation with text encodings such as EBCDIC and the various ASCII codepages. Since then we have pretty much settled on standards based on multiples of an 8-bit byte, and on a common text encoding - UTF-8. At the time of Unix's inception the byte stream was a relatively bold move because it enforced variable-length encodings of bytes as the building block of interoperability, versus some kind of static record. Now it is mundane, and the world has moved on to always parsing the bytes for some other purpose.

We desire standards that encode data containers, strings and numeric primitives, so that we do less parsing. XML and S-expressions offer two methods of encoding hierarchies of strings. More recently, JSON has become a very common encoding. It encodes strings, numeric values, and two types of hierarchical containers(key-value and array). We also have SQL databases as an older example, which encode various primitives and a variety of tabular data relationships. With each data language we get one or more associated query languages to express how we select data.

Unix's handling of files is incomplete because of the ambiguity of queries on files. We rely heavily on relationships between directories and on glob syntax to perform selection in a Unix filesystem. Unix paths and glob syntax are not really a byte stream or a file, but they are most certainly part of Unix, and you would be suffering in short order without those concepts. Yet Unix does not respect the existence of its own query language, and never declares a basic system representation for queries and their results akin to the results of a SQL query. Instead we parse the query as bytes and output the results of queries to bytes and parse those bytes, which means there is no guarantee of support at any of the boundaries where your program touches paths. As long as you only deal with one file at a time you don't have an issue, but at scale, the way we work with our applications needs more complex selections on files; the workaround is to "bottle up" the data into one file that has some other structure, which then necessitates special tools to move around the data.

And if you look at what people complain about when working with the shell, the behavior of file selection is a major complaint, and adds a lot of complexity.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: