The Semantics Assignment Problem
The Semantics Assignment Problem
This post is about the following fact:
The meaning of a self-hosted compiler is not fully specified by the (source code) alone. It is specified by the pair (source code, compiled binary).
The most striking and well known example of this was Thompson's "trusting trust" attack (1984). He explained that it was possible to put a malicious backdoor into the binary of a compiler that would persist even if you recompile the compiler from completely clean unmodified sources. See also Karger & Schell (1974).
Hacker rntz demonstrated this by construction in his **rotten** scheme compiler (2015):
Rotten includes a (mildly) malicious compiler, evil.rot, which notices when it's compiling a compiler, such as compile.rot, and injects a self-propagating virus into it. The most interesting problem here is quining the virus: to self-propagate, the virus needs access to its own source code! You can see some example quines and quine-generators in quines.rkt.
The only other symptom of this virus is that an infected compiler will compile the symbol rotten to the string "YOUR COMPILER HAS A VIRUS!!1!eleventyone". This is a poor stand-in for the nefarious behavior a real implementation of RoTT could inject into the compiler, but it will have to do for now.
There was also a snowden leak discussing a possible toolchain attack on apples Xcode, but researching this I believe it only stamped output binaries of iphone apps with a virus, and didn't attempt to propagate itself across xcode recompiles (since the attack was targeted at app developers more than apple internal).
The CIA declined to comment for this story.
The security researchers also claimed they had created a modified version of Apple’s proprietary software development tool, Xcode, which could sneak surveillance backdoors into any apps or programs created using the tool. Xcode, which is distributed by Apple to hundreds of thousands of developers, is used to create apps that are sold through Apple’s App Store.
- https://theintercept.com/2015/03/10/ispy-cia-campaign-steal-apples-secrets/ https://researchcenter.paloaltonetworks.com/2015/09/novel-malware-xcodeghost-modifies-xcode-infects-apple-ios-apps-and-hits-app-store/
This problem isn't unique to compilers, it applies to any system of assigning semantics to syntax. This includes interpreters and even hardware CPUs.
Reynolds & the birth of continuations
The prototypical example of this is metacircular "self-interpreters" like McCarthy's lisp interpreter in lisp or lambda calculus interpreters. It was realized that these definitions were not fully sufficient to pin down the semantics of a programming language. A lambda calculus interpreter inherits the evaluation order from the host language. When you run it in a strict evaluator you get a strict lambda calculus interpreter and when you run it in a lazy evaluator you get a lazy lambda calculus. This realization and study led to the idea of continuations in programming language semantics.
- Definitional Interpreters for Higher-Order Programming Languages - JOHN C. REYNOLDS (1998) https://surface.syr.edu/cgi/viewcontent.cgi?article=1012&context=lcsmith_other
- The Discoveries of Continuations - JOHN C. REYNOLDS (1993) http://homepages.inf.ed.ac.uk/wadler/papers/papers-we-love/reynolds-discoveries.pdf
Kernel mode syscall hooking (rootkits)
This is an issue at both high and low level. One of the low level situations to consider is that all programs run ontop of a platform: the operating system kernel. A program does a mix of computation and syscalls, the syscalls are user space requests to the kernel. A modified kernel could target and manipulate user space programs by profiling which syscalls they do to fingerprint, then manipulating the behavior of the program by implementing the syscalls differently. For example invoking the gcc program will use the stat syscall on certain directories to look for things like stdlib.h, once profiled like this a malicious kernel could send off modified results for the read syscall, tricking gcc into compiling a modified version of the program the user expected. Similar to the trusting trust attack, just applied at a different level.
I would like to give some references here but I don't know this area well. Couldn't find a canonical thing to link in phrack.
This is an issue with hardware as well as software. The meaning and expected behavior of a program binary is defined in terms of the abstract and ideal x86 instruction set semantics, but in practice this may not be faithfully implemented or may be tweaked or glitched to act differently. Two(!) beautiful concrete demonstrations of this were created by Christopher Domas who has discovered secret openings into the deeper parts of x86 CPUs which gave him the control needed to make existing programs act in completely different ways than normal.
- https://www.youtube.com/watch?v=lR0nh-TdpVg (The Memory Sinkhole)
- https://www.youtube.com/watch?v=_eSAF_qT_FY (GOD MODE UNLOCKED)
The common theme of all this is that the behavior of program can be different depending on what it rests on top of. Every single piece of our toolchain from the hardware up to the high level compilers like rust may be taken advantage of to influence the next to act incorrectly. When a program rests on top of (the compiled version of) itself, you end up in a situation where it's very difficult to audit one half and if it becomes tainted then you are completely stuck.
One of the key properties of free software as defined by the GNU FSF is that the source code is available, the GNU FSDL guidelines also recommend that binaries provided can be built with a free software compiler. It could be argued either way whether or not a self hosted compiler fits in or outside of this!
If we want to avoid painting ourselves into a corner we should ensure that self hosted systems can also be built from a lower system (also known as The Descent Principle). Like how the go compiler is self hosted, but can be built by gccgo. And the rustc compiler is self hosted but can be built by mrustc. For GNU/linux the situation is much more complex. tinycc can build gcc but there's also binutils and glibc, and the linux kernel, and all the surrounding autotools involved. The mescc project is currently working on resolving all of these loops right up from a trivial hex assembler.