michael orlitzky

Greybeard's tomb: the lost treasure of language design

posted 2019-05-14

Practical programming prudence prescribes potent passé principles, period.

In which I conflate implementation details with design decisions

None of this is truly language design. A language can be designed on the wall of a bathroom stall (any PHP programmers in the house?). But as a wide man once said,

Ideas are cheap. I have more ideas now than I could ever write up. To my mind, it's the execution that is all-important.
George R. R. Martin

I want to focus on the execution of a new programming language. The elders knew a few tricks that we forgot over the years—implementation details that influence the success, popularity, longevity, and overall unfuckwithability of a language. For contrast, here's a long list of things that I don't want to talk about, because everybody else does:

Memory-safety (I don't want to get it right and I don't want to get caught)
Direct memory access (thought the list would be consistent? you must be new here)
A strong static type system (no one knows what this means, but we need it)
Speed (vrrrrooooooooooooooooooom)
First-class functions (yo dawg meme but for functions)
A good standard library (you don't need nine ways to make an HTTP request)
Encapsulation (keeps people with no underscore key from calling private methods)
Localized side-effects / immutability (don't launch missiles in toString())
Stateless control flow (if I don't need a loop index, I shouldn't need a loop index)
Totality (it's an error to not handle errors)
A code of conduct (written by a white guy in a wig targeting white guys not in wigs)
Good documentation (yup, that about explains UTF32 encoding and decoding)
Metaprogramming (do things you probably shouldn't do more easily)

To measure the relative success of my ideas, I'll use the absolute worst metric: the May 2019 TIOBE ranking of the top fifty programming languages. Top fifty based on what, you ask? Based on their number, of course. I do this partly because I find humor in the existential meaninglessness of all human endeavor, and partly because everyone else does it. Also, their data agree with me.

It should have an independent, standard, formal specification

The year is something something whenever. The design for a new programming language begins to take shape in the mind of its creator. Features are imagined, syntax is drafted, history is eschewed. Announcements are made on Hacker News, Reddit, and Slashdot. Early adopters adopt early. Praise is lavished. The new language cures segfaults and athlete's foot. Obama is going to rewrite the constitution in it. Version 1.0 is finalized. Industry catches on. Real people use it to write real code for real projects. Everything is great. Until, the cascade:

Three months later, four breaking changes are made to the core language.
One month later, five breaking changes.
One month later, four more breaking changes.
Two months later, thirteen breaking changes.
One month later, five breaking changes are introduced. But they've stopped calling them “breaking changes,” opting instead for “compatibility notes.” Note: none of your shit will work any more! Problem solved.
Two months later, five breaking changes.
One month later, eight breaking changes.
One month later, four breaking changes.
Two months later, eight breaking changes.
One month later, three breaking changes.
Two months later, three breaking changes.
Two months later, seven breaking changes.
I assure you, the pattern continues.

As a user of this language, your only recourse is to recommend that these folks ingest an abundance of dicks, and go back to using C.

Compare the timeline above with two grownup languages:

C
1. Originally standardized by ANSI in 1989.
2. One year later, it becomes an ISO standard with no changes.
3. Five years later, everything still works basically the same in C95.
4. Four years later, everything still works basically the same in C99.
5. Twelve years later, everything still works basically the same in C11.
6. Seven years later, everything still works basically the same in C18.
Ada
1. Originally standardized by ANSI in 1983.
2. Four years later, it becomes an ISO standard with no changes.
3. Eight years later, Ada 95 adds object-orientation to the language while still managing to remain almost perfectly backwards-compatible.
4. Six years later, parts of the Ada 95 standard are clarified in Technical Corrigendum 1 without breaking anything.
5. Six years later, a major amendment called Ada 2005 in theory tolerates incompatibilities, but in practice leaves stuff alone.
6. Seven years later, Ada 2012 is standardized with minimal incompatibilities.

Follow the grownup example:

Don't be the change you don't wish to see in the world.
Gandhi, more or less

A programming language should have a formal specification, governed by an independent standards body such as the American National Standards Institute (ANSI), the International Organization for Standardization (ISO), or the European Computer Manufacturers Association (ECMA). Because why? Because:

A formal specification prevents vendor lock-in

Google kills more babies than Jenny McCarthy. Are you willing to bet that they won't kill your favorite programming language? A formal specification encourages multiple implementations by different organizations, and ferrets out ambiguities in the specification itself. More implementations, more support, more users, more bug reports, more documentation—it's a virtuous cycle. This is a boon for portability, because if someone wants your language to run on his IoT dildo, he can do it himself without shipping you the hardware.

But for a specification to be useful, it can't change every week at the whim of a single company. Implementors can't hit a moving target, and few are dumb enough to try; this is why Chrome is the only browser left after Google usurped the standards process.

An independent standard is hard to change

This is a good thing. ANSI, ISO, ECMA, and other standards bodies are inhabited by conservative old curmudgeons. Every change to a specification is evaluated by a committee, and you want that committee stacked with Luddites because those ancient dinosaur sticks-in-the-mud (these are terms of the utmost respect) understand the robustness principle,

be conservative in what you do
Jon Postel, author of TCP, more or less

The only people more resistant to change are working programmers. Each and every one of them cherishes not having to go back and re-fix shit that already works. The buzzword cowboy trend jumpers are going to hadoop their blockchains into neural networks regardless. But if you write working software, and if its lifetime isn't measured in mmmbops, you crave stability.

Stability keeps the dumbasses out

If you've recommended dick ingestion to no avail, you do still have one option for coping with an unstable language: to bundle a copy of the language itself with your program. That's not a good solution, but programming isn't about good solutions; it's about crafting similacrums that self-immolate immediately upon becoming someone else's responsibility. That's where this option shines. Bundling is a plague on (both)² your houses: ease of administration, security, disk usage, and performance all suffer. But in general, the programmer won't be responsible for any of those things.

And when the language itself is bundled, people go full retard. With the tooling to do so already in hand, they begin to bundle every dependency of every dependency of every dependency until they run out of dependencies and they've got… 114MiB of code and 15,092 files in 1,875 directories to display a bullet list. This only appeals to dumbasses, but boy, does it ever appeal to them. And so over time these languages attract armies of dumbasses, and those dumbasses earn the approval of the other dumbasses (he thinks just like I do! eats booger), and eventually wind up in decision-making capacities perpetuating the cycle of dumbass design flaws. I assume this is what happened to the republican party.

A language that is boring and stable won't attract these dumbasses, and a dumbass deficit is crucial for long-term success. How many nineteen-year-olds are trying to write an Ada dependency manager on Github right now? Answer: maybe I should start using Ada.

TIOBE Tally

Yup

C (2): ISO/IEC 9899:2018.
C++ (3): ISO/IEC 14882:2017(E).
C# (6): ECMA-334, 5th edition.
JavaScript (7): ECMA-262, 9th edition.
SQL (8): Defined in ISO/IEC 9075, but totally not a real programming language.

Sorta

Java (1): The Java Platform, Standard Edition. Highly backwards-compatible, but controlled entirely by Oracle who tried to fuck Google with it but instead fucked you because now your legal department won't let you use it.
Python (4): The Python Language Reference. Looks like a specification, but changes all the time. There are two incompatible versions of it. Wat.
Visual Basic .NET (5): Visual Basic Language Specification. Changes whenever; controlled entirely by Microsoft.
Assembly (10): Go home TIOBE, you're drunk. What they call “assembly” is nothing more than syntactic sugar on top of machine code. But I guess if you pick an architecture, there is a something like a specification, because hardware don't change.

Nope

PHP (9): An unofficial Github project?

It should compile…

That is, there should be some explicit process to turn source code into runnable stuff.

Interpreted languages are an old idea. Lisp and APL, for example, are approaching retirement age. But—facing competition from FORTRAN, COBOL, BASIC, and later C—they remained an academic curiosity until the dot-com boom. As the web became popular, we began a trend that continues to this day. In order to efficiently demoney investors, it was proclaimed that people who don't work should be able to quickly write programs that don't work and then push them into production before anyone notices they're garbage programs written by garbage people. Today, those garbage programs are known as Wordpress plugins, and the garbage people are called front-end developers, hinting at their answer to that all-important question: which end of Brendan Eich would you rather fuck?

I offer no explanation for Python's popularity, but the other two interpreted languages in the top ten are unadulterated World Wide Detritus:

Python (4)
Javascript (7)
PHP (9)

It turns out that garbage programs and interpreted languages are a natural fit, because interpreted languages cure the following ailments that beleaguer compiled languages:

You only need one program to run one program.
Bugs are found on the programmer's computer, before users encounter them.
The operating system's execute permissions can prevent malware from being run.
Some programs run too fast and use too little memory.

Interpreted languages don't have any of those problems. Regardless, how do you ensure that programs are compiled? This is a tricky one. You can ship the language with a compiler, but that's no guarantee. Vaccines don't cause autism, but autism causes C++ interpreters. Just, uh, try your best:

Ship your language with an unassailable compiler.
Publicly shame anyone that asks about an interpreter.
Keep your language boring and stable so that the type of person who would ruin everything gets a job shoveling node_modules instead.

Because here's what you'll get out of it.

Compiling forces you to admit that you have a build system

With interpreted languages, it's easy to pretend that you don't have a build system. The sources files are like, ready to like, run. Aren't they? They are not. Witness:

Composer genius annihilates documentation for security! Because Composer's installation routine is to copy/paste the entire package onto your public website!
Javascript wizard obliterates test suite to save space! Because NPM installs seven thousand redundant copies of it!
Shell master transforms entire man page into an echo statement! So that the version number doesn't have to be updated in two places—you have to see it to believe it!

These senseless tragedies were all avoidable. A build system can choose which documentation to install, or not, and where. A build system can run the tests and delete them afterwards. A build system can replace a $version variable in multiple places. No matter how sure you are that you don't have a build system, you're wrong. For example, if you're writing a daemon that uses a PID file, where do you put it? On FreeBSD it goes in /var/run, but on Linux, PID files go in /run. There's a long list of these long lists of incompatible paths that depend on where your program will be deployed.

So which paths do you use in your program? Typically, you hard-code the paths that work on your own machine, because fuck everyone else. But you still have a build system: when you release your code, the BSD/Linux distribution maintainers will take it and patch your hard-coded paths out in favor of the paths that work on their distributions. The maintainers then package everything up and ship it off to the users. You still have a build system, but your build system is to send your source code halfway across the world to a stranger who fixes it before ultimately giving it to the people who want to use it. That is not simpler than autotools.

To compile, you need a build system anyway, and it avoids these stupid problems.

Compilation discourages dependency dipshits

Language-specific “package managers” are a cancer. Fortunately, none of them are real package managers: they're largely a wrapper around wget and cp -r, the easy part of package management. The hard parts are left undone, because the hard parts are hard.

In an interpreted language, you can almost get away with that. If you're bad enough at your job, you could be convinced that wget … && cp -r … is a satisfactory installation routine for, say, a Python library. And since this is what a language-specific package manager does, someone else who's bad at his job is going create one. When that happens, your ecosystem begins its long kiss goodnight. The ability to specify exact version requirements on the programmer's machine frees him from the responsibility to design a sensible, stable API. Eventually two different programs require two different versions of that API, and they can no longer be installed together. This chain of events concludes with everyone bundling their dependencies and having sex with children, which are commensurate sins.

All glory to compilation. Ten thousand years ago, the people who write language-specific package managers would have been food. They don't actually know how to build software, so if you place them in front of a compiler, they'll just stand there, drooling, waiting to be eaten. No amount of cp -r can turn source code into executables, so eventually they'll give up and return to hunting rocks. Programmers will have to adapt to not knowing the exact versions of their dependencies that will be installed. Library designers will be forced to think about their API and ABI. Your ecosystem will be better for it.

The compiler does free program analysis

Having a strong, static type system makes your programs better. All of the warnings and errors that would normally be shown to your users (often accompanied by a crash) can instead be caught during development, while you build the executable.

The compiler brings two advantages here. First, you can add all the extra type annotations and safety mumbo jumbo you want to the language at no performance cost. The compiler analyzes the program to ensure that, for example, all strings are of the appropriate length. But then, that check can be deleted: once the compiler has proved that a check will succeed, it doesn't need to do it again at runtime. In an interpreted language, the checks need to be performed the first, and every subsequent time that they are encountered. That means that adding safety to an interpreted language is slow, but adding it to a compiled language is free.

Second, if you want to ask questions about a program, then the compiler is the dude you want to ask. The compiler already has to know everything about your language, because he's gonna compile it. Example: if you want to do syntax highlighting in an IDE, the compiler already knows how to do that. You hand it some code, it marks up the various important bits, and then hands it all back to you. All you have to do is associate some colors to the marked-up parts. The hard work is already done. Example: if you want to lint your code, you first need a usable programmatic representation of that code. Guess what, the compiler has one already, because it's what he transforms into runnable stuff. The clang-tidy analyzer leverages clang for this low-level machinery, allowing its authors to concentrate on the static analysis features. Contrast with the Pylint project, which needs a huge library called astroid to interpret the Python source code even though the Python interpreter already does that.

When you build a compiler, you also build this ancillary cool shit.

Compiled code is hard to read

When you ship someone a PHP script, he can just read it! What if the code contains trade secrets, or security vulnerabilities? Or if your master database password is in there? Pleas for a legal solution to this problem have gone unanswered, so a technical measure is needed: when you compile your code, it becomes unreadable. This is a highly-effective form of DRM that everyone should be using.

TIOBE Tally

Yup

Java (1): Yeah.
C (2): What?
C++ (3): Okay.
Visual Basic .NET (5): Shots.
C# (6): Shots.
Assembly (10): Shots.

Sorta

Python (4): Python has the setuptools system, which is kind of half-assed, but is at least official and allows distributions to fix things in a single place.

Nope

JavaScript (7): Mmmmmmmmnah. There's Grunt, but it's not official, and you can't count on it being used.
SQL (8): Who left this here?

PHP (9): Not even a little.

…to machine code

Up until recently (say, Google exists but you don't yet need to be transgender to work there), computers ran what were known as programs. These so-called programs were made of machine code, consisting of microscopic numbers that tell your CPU how to arrange particles of electricity into pornography. Ask your parents. The last program ever written was the V8 engine in 2008, after which programming was over and we all set about writing Javascript engines in Javascript for the next Javascript years.

Machine code was outlawed: in order to display pop-up advertisements, everyone agreed that it was best if we blindly ran whatever code was sent to us by strangers on the internet. We all got hacked for a while, but as a result, we now have a long list of extremely specific things that code from strangers shouldn't be allowed to do. We've only had to amend the list a few thousand times in the past; and—thanks to our collective willful ignorance of statistics, history, computer science, economics, crime, psychology, and of how lists even work—we're pretty sure that the list is complete this time. The problem with machine code, then, is that it lets you do all of those things. And so it has fallen out of favor with the people (pop-up ad creators, new programmers, and sentient trashcans) who promote list-of-bad-things-based security.

Let's bring it back.

Machine code is fast

Literally as fast as possible, because anything else that you think might be faster is made of machine code. If you have something interpreted, turning it into machine code makes it faster. Just-in-time compilation? Just-shut-the-fuck up.

VRRRRROOOOOOOOOOOOOOOOOOOOOOMMMMMMM
machine code

The other shit you were thinking of using will eventually become machine code anyway, so you might as well get it over with. And doing it yourself produces better code, because semantics can be lost in translation: I can easily turn “double every element in this list” into efficient machine code, but it's a lot harder to turn it into bytecode and then ensure that a bytecode interpreter will turn every such loop into efficient machine code.

Machine code is portable

Ok, it's not. But if your end game is machine code, then you can use the C language as an intermediate representation between your own high-level language and machine code. The C language is the most portable programming language on Earth. “But Python runs anywhere” you say, looking up from your coloring book. No, Python runs anywhere that has a Python interpreter. And the Python interpreter is written in C.

Doesn't using C as an intermediate representation contradict the previous item (semantics can be lost in translation)? Honestly: yes. But dishonestly: no, it's fine. The C language is low-level enough to be able to express anything efficiently, if you do it right. And decades of work have gone into making C compilers produce efficient code. So it's possible to use C as an intermediate representation without slowing things down, although this item should problem come with an asterisk if I'm being honest (I'm not).

Machine code is reusable

People want to call libraries written in one language from executables written in another. If they can't, then they need to write every library in every language. The Rust people seem to find that entertaining, but it's literally reinventing the wheel and a huge waste of time. In the best of worlds, you'd still wind up with a mountain of code that needs to be maintained indefinitely. But in the actual of worlds, the reinvention has problems: it's missing half of the features and all of the bugfixes that have accumulated in the original over the years. Long before you've brought the two to parity, a new language du jour coalesces and drains the manpower from your half-finished attempt; now someone needs to rewrite your library in the new language! Your library is abandoned, and the same fate eventually befalls its successor in the new language. And little fleas have lesser fleas, and so, ad infinitum.

So, we want to be able to reuse existing code. How do I call a Python library from a PHP program? I'll tell you how: I print out the source code, roll it up, and go fuck myself with it. The Unix philosophy answers this question at a coarse, whole-program granularity. But if you want to call a single function (and not the whole program), you're out of luck.

Machine code to the rescue. Fortran machine code, C++ machine code, and Ada machine code are all the same shit. And calling a machine code function is easy: you plop its arguments into memory and then jump your program's execution to the beginning of the function. Done.

I can call C++ functions from C, and vice-versa.
I can Fortran functions from C/C++, and vice-versa.
The Ada standard guarantees that I can call Ada functions from C, Cobol, and Fortran, and GNAT has an additional interface to C++.

This is trivial if you were paying attention when I suggested compiling to C. Everything is compatible with C, so if your language can be turned into C, then you get all that compatibility for free. If you're not compiling to C, things are only a tiny bit more difficult. You need to agree on how to call functions, and you need to know how to convert your types back and forth from whatever types you're interfacing with. If I call a function that returns a Pascal string, then the result has to be abused a bit to make it a Haskell string. None of that is hard, so long as you don't constantly fuck with your calling convention or how your types are represented.

TIOBE Tally

Yup

C (2): Duh.
C++ (3): Duh++.
Visual Basic .NET (5): Using .NET Native.
C# (6): Using .NET Native or Mono's ahead-of-time compilation.
Assembly (10): uhhhhhhhhhhhhhhhhhhhhhhhh

Sorta

Java (1): The GNU gcj compiler was capable of compiling Java to machine code, but GraalVM is a better choice these days.
Python (4): Cython does hit a lot of my bullet points.

Nope

JavaScript (7): The NectarJS project can do so, but everyone in this room is now dumber for having learned that. I award you no points, and may God have mercy on your soul.
SQL (8): Guys, SQL isn't a real programming language.
PHP (9): Quick, think of something good. PHP doesn't do that.

The end

Since we're focusing on only a few select aspects, we expect them to be necessary but insufficient for the success of a language. Most popular languages should score well on our three criteria, but some unpopular languages will also score well because there are other factors involved. I've haphazardly calculated scores for the top fifty TIOBE languages in May 2019 using a two-point system: every “yup” is two points, every “sorta” is one, and every “nope” is zero. Not all of these are clear-cut—most should probably come with error bars of ±1. There's also the glaring problem of “to machine code” being highly correlated with “it should compile.” What I'm trying to say is, have fun!

Scores for the TIOBE top 50 languages of May 2019
Rank	Language name	Spec	Compiles	Machine code	Total
1	Java	1	2	1	4
2	C	2	2	2	6
3	C++	2	2	2	6
4	Python	1	1	1	3
5	Visual Basic .NET	1	2	2	5
6	C#	2	2	2	6
7	JavaScript	2	0	0	2
8	SQL	not a real programming language
9	PHP	0	0	0	0
10	Assembly	1	2	2	5
11	Objective-C	0	2	2	4
12	Delphi	0	2	2	4
13	Perl	0	1	0	1
14	MATLAB	0	0	0	0
15	Ruby	2	0	0	2
16	Visual Basic	0	2	2	4
17	Groovy	0	1	1	2
18	Swift	1	2	2	5
19	Go	1	2	2	5
20	PL/SQL	not a real programming language
21	R	1	0	0	1
22	SAS	0	0	0	0
23	D	1	2	2	5
24	COBOL	2	2	2	6
25	Transact-SQL	not a real programming language
26	ABAP	not a real programming language
27	Fortran	2	2	2	6
28	Scratch	0	0	0	0
29	Dart	2	2	0	4
30	Scala	1	2	0	3
31	Prolog	2	1	2	5
32	Lisp	2	1	2	5
33	Lua	1	1	1	3
34	Rust	0	2	2	4
35	Logo	0	0	0	0
36	Ada	2	2	2	6
37	F#	1	2	2	5
38	Apex	not a real programming language
39	Kotlin	0	2	2	4
40	Scheme	2	1	2	5
41	LabVIEW	0	2	2	4
42	TypeScript	1	2	0	3
43	Julia	0	1	2	3
44	Awk	not a real programming language
45	Haskell	0	2	2	4
46	Clojure	0	1	0	1
47	Erlang	0	2	2	4
48	Standard ML	2	2	2	6
49	Bash	0	0	0	0
50	RPG	0	2	2	4

As for necessity: indeed: of the top ten, only Python, Javascript and PHP are stinkers. Javascript and PHP can be excused, since both are popular only because they're required for client/server web development. Javascript could fuck your girlfriend so hard that your parents die and you'd still use it because it's the only client-side web language. Likewise PHP is the only server-side language that you know will be available on a cheap web host. So both remain popular despite being themselves. Ignoring those two, the top ten more or less makes sense—all of the scores are good. Except Python. Whatever.

For insufficiency: also indeed: we see tons of high scores for less-popular languages. That's because there are other key ingredients in a successful language:

To put it plainly, the language can't suck. This is why COBOL is so low on the list.
It can't be new. No one is using a brand-new language.
It can't be too similar to existing languages. This is what kills languages like D who would otherwise be winners. It might be better than C++, but not so much better that I'm willing to append it permanently to the list of shit I need to know. The question isn't “would I rather use D than C++?” Instead it's “would I rather use C++ and D than C++?” My C++ code doesn't just go away if I switch to D.

In any case, that's how I hand-wave away the fact that the bottom half of the list performs about as well as the top half. I warned you that the metric was meaningless but you persisted and here we are. One final thing is interesting. Most languages that score a perfect six are long-lived, well-liked, and in heavy use to this day:

C (2)
C++ (3)
C# (6)
COBOL (24)
Fortran (27)
Ada (36)
Standard ML (48)

Meanwhile, every language that scores an antiperfect zero is a turd sandwich with extra mayo:

PHP (9)
MATLAB (14)
SAS (22)
Scratch (28)
Logo (35)
Bash (49)

The playthings of our elders are called business.
Saint Augustine

TIOBE damned, truth isn't a democracy. A programming language should have an independent, standard, formal specification and it should compile to machine code.