As someone who writes code, you undoubtedly do so using one or multiple programming languages. You probably enjoy writing code in some programming languages because of their elegance, expressive power or any other reason and you also have probably kept your distance from other programming languages because of, maybe, some of their features that you think are badly implemented.
I think all curious developers asked it at least once. It is normal to be fascinated by how programming languages work. Unfortunately, most answers we read are very academic or theoretical. Some others contain too much implementation details. After reading them we still wonder how things work in practice.
Have you, however, thought about how those programming languages we love and hate came to be? How those particular features we like (or do not like) are designed and implemented and why? How those magic black boxes that are compilers and interpreters work? How code written in JavaScript, Ruby, Python, etc, turns into an executable program? Or have you ever thought about building your own programming language?
Many people have difficulties or frustrations with the programming languages they use every day. Some want things to be handled more abstractly, while others dislike implementing features they wish were 'standard'. Whether you are an IT professional or just a hobbyist, many times you may find yourself wanting to create a new programming language.
Designing a programming language
If you want just to write your own compiler to learn how these things work, you can skip this phase. You can just take a subset of an existing language or come up with a simple variation of it and get started. However, if you have plans for creating your very own programming language, you will have to give it some thought.
I think of designing a programming language as divided two phases:
- The big-picture phase
- The refinement phase
In the first phase we answer the fundamental questions about our language.
- What execution paradigm do we want to use? Will it be imperative or functional? Or maybe based on state machines or business rules?
- Do we want static typing or dynamic typing?
- What sort of programs this language will be best at? Will it be used for small scripts or large systems?
- What matters most to us: performance? Readability?
- Do we want it to be similar to an existing programming language? Will it be aimed at C developers or easy to learn for who is coming from Python?
- Do we want it to work on a specific platform (JVM, CLR)?
- What sort of metaprogramming capabilities do we want to support, if any? Macros? Templates? Reflection?
In the second phase we will keep evolving the language as we use it. We will run into issues, into things that are very difficult or impossible to express in our language and we will end up evolving it. The second phase might not be as glamorous as the first one, but it is the phase in which we keep tuning our language to make it usable in practice, so we should not underestimate it.
Building a compiler
Building a compiler is the most exciting step in creating a programming language. Once we have a compiler we can actually bring our language to life. A compiler permits us to start playing with the language, use it and identify what we miss in the initial design. It permits to see the first results. It is hard to beat the joy of executing the first program written in our brand new programming language, no matter how simple that program may be.
But how do we build a compiler?
As everything complex we do that in steps:
- We build a parser: the parser is the part of our compiler that takes the text of our programs and understand which commands they express. It recognizes the expressions, the statements, the classes and it creates internal data structures to represent them. The rest of the parser will work with those data structures, not with the original text
- (optional) We translate the parse tree into an Abstract Syntax Tree. Typically the data structures produced by the parser are a bit low level as they contain a lot of details which are not crucial for our compiler. Because of this we want frequently to rearrange the data structures in something slightly more higher level
- We resolve symbols. In the code we write things like
a + 1
. Our compiler needs to figure out whata
refers to. Is it a field? Is it a variable? Is it a method parameter? We examine the code to answer that - We validate the tree. We need to check the programmer did not commit errors. Is he trying to sum a boolean and an int? Or accessing an non-existing field? We need to produce appropriate error messages
- We generate the machine code. At this point we translate the code in something the machine can execute. It could be proper machine code or bytecode for some virtual machine
- (optional) We perform the linking. In some cases we need to combine the machine code produced for our programs with the code of static libraries we want to include, in order to generate a single executable
Do we always need a compiler? No. We can replace it with other means to execute the code:
- We can write an interpreter: an interpreter is substantially a program that does steps 1-4 of a compiler and then directly executes what is specified by the Abstract Syntax Tree
- We can write a transpiler: a transpiler will do what is specified in steps 1-4 and then output some code in some language for which we have already a compiler (for example C++ or Java)
These two alternatives are perfectly valid and frequently it makes sense to choose one of these two because the effort required is typically smaller.
A standard library for your programming language
Any programming language needs to do a few things:
- Printing on the screen
- Accessing the filesystem
- Use network connections
- Creating GUIs
These are the basic functionalities to interact with the rest of the system. Without them a language is basically useless. How do we provide these functionalities? By creating a standard library. This will be a set of functions or classes that can be called in the programs written in our programming language but that will be written in some other language. For example, many languages have standard libraries written at least partially in C.
A standard library can then contain much more. For example classes to represent the main collections like lists and maps, or to process common formats like JSON or XML. Often it will contain advanced functionalities to process strings and regular expressions.
In other words, writing a standard library is a lot of work. It is not glamorous, it is not conceptually as interesting as writing a compiler but it is still a fundamental component to make a programming language viable.
There are ways to avoid this requirement. One is to make the language run on some platform and make it possible to reuse the standard library of another language. For example, all languages running on the JVM can simply reuse the Java standard library.
recommends Resources:
Programming Languages: An Interpret Based Approach - Samel N Kamin - this isn't a well known book, but it totally re-ignited my interest. This book goes through various languages and their features and builds interpreters for them (in Pascal, it's an old book).
How to Create Your Own Programming Language: Want to create a programming language, but don't feel like going through one of those expensive and boring 1000-page books ? Well, you're not alone ...
LLVM: Writing a Simple Programming Language - a step by step C++ tutorial on how to build a compiled language (using LLVM). You should basically use LLVM for the back-end, since that will save you hundreds of man-years of work and is open source. It's a well solved problem that is totally applicable to anything anyone can build. Clang is the C/C++/ObjectiveC front-end.
Types and Programming Languages: Benjamin C Pierce - this is perhaps heavy mathematically, however you can skip those parts and still understand the concepts. If you want to properly understand type systems and do it properly, this book has it all. New languages like Scala, Kotlin, Swift (and others such as Typescript) are beginning to include things like set-theoretic type systems (optionals, unions, intersections). This book has been incredibly influential for myself, and will lead you down many good rabbit holes.