Stories
Slash Boxes
Comments

SoylentNews is people

The Fine print: The following are owned by whoever posted them. We are not responsible for them in any way.

Journal by cafebabe

This is part four of a four part proposal for a trustworthy computer consisting of ALU, registers, memory interface, network cards and compiler. Part four covers the compiler.

So far, I've given an outline for a minimal trustworthy micro-coded mini-computer where every component can be sourced, tested and is otherwise open to inspection. I've also given an outline for a card bus system which allows cards made from stripboard and chips which can be manually soldered. Again, this is open to inspection. The card system also provides a bridge to contemporary networking and storage. This requires some cheating with micro-controllers to keep parts (and part) down to a reasonable level. Use of a micro-controller is obviously not trustworthy and therefore encryption and striping across redundant channels is required to ensure that no untrusted component gains a sufficient stranglehold on any data.

However, all of this is wasted if a trustworthy computer cannot self-host an operating system and compiler. We have the luxury of starting from an untrusted computer environment and therefore we can use any number of facilities to obtain a beach-head into a trustworthy environment. Conceptually, this requires one or more paper tapes which are inspected before transfer into the trustworthy environment. In practice, it will require a uni-directional serial link and a grudgingly trusted EPROM programmer. I argue that it is difficult (but not impossible) to compromise the EPROM programmer on the basis that all EPROM programmers may be sourced prior to micro-code or machine code patterns being finalized. In the absence of a network connection to a (very determined) attacker, malicious corruption is probably the best attack.

A quick recap on the current state of computing. We got away from boot-strapping every machine from its toggle-switches and sighed with relief. However, in the years that followed, computer security has become a quagmire. To get out of this problem, I propose a fairly drastic, unconventional approach. I hate to be a green-field developer but when computer security becomes an insurance category, that's because the details of systems - systems that people created - have become unknowable. Specifically, I propose writing a C compiler in Lisp and then using the C compiler to write an operating system kernel. At this point, the typical approach is to expand until it is possible to to self-host gcc or, more recently, clang. However, before we reach this point, we rapidly encounter a Turing tar-pit. This is where we lose the provinance of each file and this is where the security quagmire begins. Specifically, in the untrusted domain, it has become commonplace for binaries to depend upon more than 100 files from any of 19000 packages. These packages are typically downloaded and deployed without inspection. Furthermore, coupling between packages has become so tight that it is only possible to compile any piece if the remainder of a system is invariant. There are two problems with this arrangement and neither is solved with repeatable builds. The first problem is that we do not have Christmas light divisibility. We cannot sub-divide a system because the coupling is too tight. In BSD systems, we have:-

# make kernel
# make world

This provides separation between kernel-space and user-space. We can build a kernel with user-space software. Using the new kernel, we can build all of the user-space software. But don't ask how that process works because it doesn't follow the layered approach recommended by theory. This world-readable, compile-one-piece-at-a-time approach is also highly vulnerable to privilege escalation. How many routes are there for malicious code to obtain global influence after one, two, three or more global re-compilations? Unknown but there are probably very many. Do all of these paths obtain more scrutiny than OpenSSL? Definitely no. This cannot make a trustworthy system. The system is open but the dependencies are numerous. Therefore, it is not possible to inspect a system in a timely manner.

I wish to change this sloppy practice. I propose writing a C compiler in Lisp and then using the C compiler to write an operating system kernel. I also propose writing the C compiler in Lisp and writing the Lisp interpeter in C. The current practice of writing the C compiler in C (or, more recently, writing the C++ compiler in C++) allows quines to be trivially propagated in the compiler. This can be overcome with the use of three compilers. However, for this, you will have to exclude all commercial compilers for which you do not have the source code. Likewise for trusting any third party who has access to the source of three compilers. And that's the situation. I wish you good luck finding three C or C++ compilers which can compile each each other.

I wish to raise the task of writing a quine to recognizing when the (compiled) Lisp interpreter is running the (interpreted) C compiler. When this occurs, modify the C compiler parse tree. The task of writing a quine remains possible but it is substantial complicated because each compilation phase is separated by interpreted code.

Returning to the kernel, it is possible to compile a kernel and supporting programs with very few dependencies. For example, a hypothetical POSIX login.c (a known target of attack) would depend upon the C compiler written in Lisp, system headers, source input and the drivers and utilities required to make a kernel functional. The output of each compilation will be binaries of historical size. It is hoped that each binary can be inspected manually, especially if correctness is placed ahead of speed.

PerlPowerTools and similar effort comprehensively show that a subset of utilities may skip the compilation process and be implemented with an interpreter. In practice, more than 2/3 of utilities may be interpreted. Although, this may be significantly reduced if launch delay or historical compatibility is an issue. (Much of the historical compatibility arises from pointless tests inside GNU build scripts and the assumed functionality thereof.)

The obvious question is why not use gcc or clang? I'll mention gcc first. Ignoring, the extended mutual loop of dependencies across multiple software licences, gcc is a really good example of Greenspun's Tenth Rule ("Any sufficiently complex program contains an ad hoc implementation of Common Lisp" to which a wag added "including Common Lisp.") On a single core Raspberry Pi, each of the four stages of gcc compilation require more than 10 hours and ideally require more than 700MB RAM. On a homebrew mini-computer, this may require more than 4000 hours. For repeatable builds, each compilation stage would require zero bit errors over a period of six weeks. The worrying part is that gcc depends heavily upon GIMPLE which is a DSL [Domain Specific Language] with Lisp syntax. This is for parse tree manipulation. Specifically, architecture independent optimizations followed by architecture dependent optimizations. The verbosity of GIMPLE explains why LTO [Link-Time Optimisation] offers GZip compression.

Whereas, clang dumps Lisp syntax in favor of C++ templates. It also trades memory for speed. With a suitable infrastructure of processor caches, is about half of the duration. However, with the default compiler flags, gcc compiling clang exceeds the 31 bit writable address-space of a Raspberry Pi. On a homebrew mini-computer, would take longer to self-host clang than gcc.

Obviously, a simpler compiler is required. Access to the source of a such a compiler is also required. Where are they? Most compilers are proprietary or extensions (branded or unbranded) of gcc and clang. Even if we go back to an ancient version of gcc, we still have the notorious mutual dependancy with gmake. That takes us back to the security quagmire.

It is for these reasons that I suggest a mutual dependency of compiler and interpreter. The (interpreted) compiler has similar functionality to gcc but may be written in a much more compact and expressive form. At this stage, we would be writing for correctness rather than speed. This is on the basis that slow runs on amateur hardware will be lucky to complete. It would be counter-productive to get tricksy when lower layers are in question. On this basis, the size of source code should be minimized without compromising legibility. It should be as short as possible but no shorter. If we do not have the processing power to implement an optimizing compiler, correctness and compactness become the only choices.

The next consideration is implementation conformance of compiler and interpreter. The laziest implementation of a C compiler may have very conformance with other implementations. However, in the long-term, low conformance is a false economy. It is undesirable to have a language dialect which is incompatible with standard tools. For example, standard lint utilities catch trivial errors. However, if the language dialect has unusual constructs then it is more difficult to avoid predictable blunders. Increasingly, compilers have integrated lint functionality. However, we don't have that luxury. Regardless, it may be desirable to perform linting on untrusted computers but only perform compilation on trustworthy hardware.

It does not help that C is mostly defined by implementation rather than a formal definition. It was not always like this. Unfortunately, most of the drift occurred when gcc became almost a strict superset of proprietary Unix compilers from the 1990s. That includes the horrible compilers, often sold as an optional extra, for HP-UX, SunOS and Irix. A formal definition of C goes back to Kernighan & Ritchie's book: The C Programming Language from the 1970s. More recent definitions include ISO C 1999. A further complication is that embedded programmers write an eclectic mix of C where features prior to the 1990 standard are mixed with features after the 1999 standard. This created a feedback loop which encouraged dependence on gcc.

More recently, there are efforts to nudge C toward Algol. This is achieved by restricting the grammar. My preference is towards Algol derivates, such as Pascal or Jovial. Both have array bound checks at run-time. This alone eliminates a common cause of critical bugs: buffer overflow. Bruce Schneier agrees. It is better to have 10% of the processing power of a trustworthy computer rather than 100% of an untrusted computer. One method to implement this is to keep assertions (such as bound checks) in production code. One method to assert safety checks is to make them part of the language specification. Hence, my inclination towards Pascal and Jovial. These choices are not arbitrary. The initial target hardware (a trustworthy micro-coded mini-computer) has a passing ressemblance the the Apollo Guidance Computer. One of its successors for aerospace navigation, the obsolete MIL-STD-1750A, was typically programmed in Jovial. It remains easy to write Jovial because Algol, Pascal and Jovial are typically converted to C and then compiled as C. However, in the general case, it is difficult to convert C to an Algol derivative due to array bound checks and other differences. It would be possible to implement a byte-code interpreter in Algol which circumvents bound checks of the native language. However, this would incur significant speed penalty. Although, C derivatives and Algol derivatives are broadly similar, C is the lowest common denominator. Regardless, it is possible to write good code in Jovial and translate it to C with safeguards intact. This may be compiled using the same process as legacy code (which does not have the same protections).

In the general case, a C compiler is sufficiently flexible to provide the back-end for C++, Objective C, Fortran (which has extensive libraries), Pascal, Jovial and other compiled languages. Indeed, the use of a common compiler allows functions written in these languages to be statically linked with relatively little difficulty. We also have the option of compiling raw C (with no safeguards), MISRA C or similar (with retrospective safeguards) or languages which always had safeguards. Unfortunately, people fail to understand these options. This may accelerate the move away from "difficult" "low-level" languages, such as C. However, rather than moving to Algol languages or languages which can be statically linked with C, programmers now skip over interpreted languages (or JIT compiled languages) which are written in C derivatives (Perl, PHP, Python, Ruby, Java derivatives, Haskell, JavaScript, Lua) and settle upon languages with multiple modes and back-ends (Rust, Swift, Dart, Meteor, Flutter) which offer mutually exclusive benefits and make solved problems into lucrative busiwork.

My ultimate objection to a language which is supposedly spans everything below C to everything above Ruby is that it cannot be achieved with one language. A good interpreter has eval and a good compiler doesn't. These are mutually exclusive goals. This can only be fudged by giving the same name to multiple things. There are benefits for a compiler and interpreter to have the same grammar in the majority of cases. But that doesn't make a compiler and an interpreter the "same" language. If you can see past the grammar, it would be more accurate to describe C and Pascal as the "same" language.

Actually, I've comprehensively convinced myself that it would be useful to have a Lisp interpreter with a default dialect similar to C. The syntax and grammar of the compiler is fairly fixed but the syntax of the interpreter is a completely free variable. There may be cases where the overlap of grammar may be a hinderance. For example, when attempting to debug eval in the interpreter. However, this is greatly outweighed by benefits.

Display Options Threshold/Breakthrough Reply to Article Mark All as Read Mark All as Unread
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1)
  • (Score: 1) by Ethanol-fueled on Sunday April 22 2018, @08:59PM

    by Ethanol-fueled (2792) on Sunday April 22 2018, @08:59PM (#670482) Homepage

    Man, you computer science people are so fucking weird. You invent things like Quines and Ackermann's function.

  • (Score: 3, Insightful) by hendrikboom on Monday April 23 2018, @02:25PM

    by hendrikboom (1125) on Monday April 23 2018, @02:25PM (#670745) Homepage Journal

    If you're into Pascal and Lisp, you might want to consider:

    Modula 3 -- a statically typed, garbage collected systems language. It is not the same as Modula 2. It has been used to write an operating system, and has mechanisms to escape from the rigorous run-time security that is otherwise standard. Those mechanisms are rarely needed and are clearly marked when used.

    Typed Racket -- a dialect of Scheme that is statically type-checked, and compatible with Racket, an implementation of not-statically-typed Scheme.

    Gambit -- an implementation of Scheme that compiles to C. (I believe it now also has a compiler that compiles to some machine codes)

    -- hendrik

  • (Score: 2) by Snow on Monday April 23 2018, @05:03PM (1 child)

    by Snow (1601) on Monday April 23 2018, @05:03PM (#670797) Journal

    You spend so much time creating these journal entries. It's like some sort of weird obsession.

    You are obviously very smart -- and you have interesting ideas -- but there is more to life than intellectual pursuit.

    I hope you are enjoying (real) life.

    • (Score: 1, Informative) by Anonymous Coward on Wednesday April 25 2018, @03:29AM

      by Anonymous Coward on Wednesday April 25 2018, @03:29AM (#671499)

      I get that you probably have good intentions, but there's no need to be like that. Many people seriously enjoy thinking deeply and designing things, there's not something wrong because they spend more time in their head than pursuing socialization and sex.

(1)