Stage0

From bootstrapping

Purpose[edit]

Bootstrapping is hard, like insanely hard. So hard in fact that everyone who has ever done it, never wants to do it again. The problem of course is that all of the technical infrastructure we have today depends upon binaries that we can't actually trust since there is no way to reproduce them from trusted sources, since we have no absolutely trusted sources.

Stage0 is aimed at making those absolutely trusted sources easier, like less than 400 hours of work total easier.

Design[edit]

Stage0 starts with only 1 thing:

1) A sub 500 byte hex monitor [How you create it is up to you; I like toggling it in manually myself]

from that starting point, I have provided in easy to audit form (direct mapping between Hex, its effective assembly and a C implementation) a series of hex utilities that are required for basic development work. These files are in the stage1 folder.

What is with the weird file extensions?[edit]

File extensions are very important in stage0, they directly indicate the level of infrastructure required to build them.

* HEX0 - indicates that the file can be built using the stage0 hex monitor or any other tool that supports the minimal commented hex syntax
* HEX1 - indicates that the file also requires support for 1 character labels and a single size (commonly 16bit) relative displacements.
* HEX2 - indicates that the file also requires support for long labels, 16bit absolute displacements and 32bit pointers for manual object creation.
* M0/M1/S - indicates that the file can either be built by the platform specific M0 macro assembler or the platform neutral M1 macro assembler
* c/h - indicates that the file contains C code

hex0[edit]

Hex0 is trivial to implement [1] It just needs to read 2 hex nybbles and output a byte, you can ignore all non-hex characters but you need to support 2 types of line comments{# and ;} if you want a more formal specification please see: Hex0

; This is a line comment
# So it is
;; And this
## And this
;; but to be polite please don't mix in non-hex characters in the hex stream,
## it doesn't make you clever, it just makes your code harder to read

# Done
48 c7 c7 00 00 00 00 # mov $0x0,%rdi
48 c7 c0 3c 00 00 00 # mov $0x3c,%rax
0f 05                # syscall

Example of .hex0 code from hex0 This maps out an ELF file for linux which implements a compiler for hex (!).

A reduced subset of hex0 is called boot0 and lacks support for line comments and restricts input to only numbers, upper case A-F, space and the enter key; as it expects humans to be entering input and know not to also type the comments. (All boot0 files are valid hex0 and all hex0 files being manually typed in can be converted to boot0 at enter time by the humans performing that input)

hex2[edit]

(hex1 is a simpler version of this, where labels are limited to 1 char long and only 1 size (commonly 16bit) relative addressing. It is used to build hex2) hex2 extends that language with labels and pointers.

  • ! - 8 bit relative address
  • @ - 16 bit relative address
  • ~ - architecture specific relative
  • $ - 16 bit absolute address
  • & - 32 bit absolute address (for pointers)

Some exotic architectures with alignment and other messy details include:

  • < pad to alignment
  • . insert to word
  • ^ aligned calculation
# ;; Set p->Next = p->Next->Next->Next
18020000	# LOAD32 R0 R2 0 ; Get Next->Next->Next
23010000	# STORE32 R0 R1 0 ; Set Next = Next->Next->Next
:Identify_Macros_1
18010000	# LOAD32 R0 R1 0 ; Get node->next
A0300000	# CMPSKIPI.NE R0 0 ; If node->next is NULL
3C00 @Identify_Macros_Done	# JUMP @Identify_Macros_Done ; Be done
# ;; Otherwise keep looping
3C00 @Identify_Macros_0	# JUMP @Identify_Macros_0
:Identify_Macros_Done
# ;; Restore registers
0902803F	# POPR R3 R15
0902802F	# POPR R2 R15
0902801F	# POPR R1 R15
0902800F	# POPR R0 R15
0D01001F	# RET R15
:Identify_Macros_string
444546494E450000	# "DEFINE"

Example of .hex2 code from M0

line macros[edit]

The M0 macro assembler is implemented in .hex2 [2] Such that using a defs file like this:

DEFINE LOADR 2E0
DEFINE LOADR8 2E1
DEFINE LOADRU8 2E2

you can now program with the mnemonics instead of raw hexadecimal codes. This creates a new ".s" assembly language which looks like this:

# We still support these comments
;; We also added support for hex inserts like so
:My_Global
'00440044'
;; And we also support strings, that we null pad to 4byte boundaries to make disassembly easier.
:My_String
"Hello world!"

:Prompt_Loop
	LOADXU8 R0 R3 R4            ; Get a char
	CMPSKIPI.NE R0 0            ; If NULL
	JUMP @Prompt_Done           ; We reached the end
	FPUTC                       ; Write it to TTY
	ADDUI R3 R3 1               ; Move to next char
	JUMP @Prompt_Loop           ; And loop again

and supports all of the syntax support of Hex2 to allow sample taken from CAT.s

Variations[edit]

The most common variation is to extend hex2 with additional functionality, such as extending the standard set to include

  • ! - 8 bit relative address (short jumps for 8086 or small immediate values)
  • @ - 16 bit relative address (ironically not really used in x86)
  • $ - 16 bit absolute address (rare use in x86)
  • % - 32 bit relative address (long jumps for x86)
  • & - 32 bit absolute address (for pointers)

more exotic mixes may replace hex with octal (for x86 but not AMD64) because it is a better match for the underlying opcode space.

cc_* + family[edit]

These are C compilers targeting a single architecture written in assembly

that support the following primitive types:

   void (and void*)
   int (and int*)
   char (char* and char**)
   FILE (and FILE*)
   FUNCTION (and FUNCTION*)
   unsigned (and unsigned*)
   long (and long*)

Which can be combined with struct and within structs, one may union members.

To support #define declarations; the keyword CONSTANT can be used as # starts line comments and // is ignored. Thus one can write:

   //CONSTANT foo 4
   #define foo 4

and the code will behave the same way in both GCC and cc_*

For conditional execution, if and else are available.

   if(foo) do_something();
   else if(bar) do_it_differently();
   else something_different();

For looping primitives, there is for, do and while loops; with continue (treated like a NOP) and break (behaves like C)

For flow control there is goto and the ability to have labels: just like in regular C. With the only warning to make sure not to define variables inside of a goto loop.

For those needing something special like system calls; asm("add r0 r1 r2" "sub r3 r2 r1"); is supported and not fully gcc compatible so you will likely want uses of it in separate files, so that an alternate which works with gcc can be used for development.

For those needing to do allocation of memory sizeof(type) behaves exactly like C.

For those doing common C code, +, -, /, %, <, >, <<, >>, <=, >=, ==, !=, &, &&, |, || and ^ behave the same as C (signed version of the instructions; should you require the unsigned behavior, leverage the later M2-Planet stage)

For those dealing with assignment = works exactly like in C.

For those dealing with structs or arrays. both array[index] and structure->member work exactly like in C.

M2-Planet + mescc-tools[edit]

M2-Planet is written in the cc_* subset and extends the primitives of cc_* into full C compliance with proper type behavior and cross-platform support by default; while providing additionally useful C primitives such as C multi-string support.

   char* s =  /* comment1 */
   "hello"    /* comment2 */
   "world"    /* comment3 */
   "how"      /* comment4 */
   "are you?" /* commentN */
   ;

and mechanisms to enable adding most useful C primitives easily in C if the need exists.

At this time M2-Planet can generate fully standards compliant ELF binaries that work on Linux, NetBSD and FreeBSD; along with supporting armv7l, RISC-V (64bit), AArch64, x86, AMD64, knight-native and knight-posix targets.

It is possible to leverage existing work: https://github.com/oriansj/stage0-posix To skip directly to this level on posix compatible systems. (on non-posix systems, only system calls need to be updated)