Wow, lots of good answers here already; I'll start by saying "read these other answers first". If your goal is to learn (as opposed to make a Best Possible Compiler) - then my two-cents:
- Parsing is boring. Skip it as much as possible. Tools like lex/yacc/bison are heavy weight with a large learning curve, used to compile heavy-weight-syntax languages (C, C++, Java). Instead skip to a trivial syntax; here's three choices: Forth - white-space delimited only. Great starter language (and my first, both interpreter and compiler). Pascal: simple recursive-descent parser; skip the heavy lex/yacc/bison and just write the trivial parser by hand. "Mouse:" ancient bogus language just used to teach people how to write interpreters and compilers; think "Forth, but all tokens are exactly 1 character" - so parsing is really REALLY trivial.
- Relocation, linking loaders, link steps, etc... all mechanical and boring as well. Skip 'em. Just make code into a RAM buffer at first. Later you can figure out all the fiddly bits with linkers & loaders (and why they exist, such as allowing the code to be relocated and still work).
- Generate the machine code yourself - don't generate something in-between (e.g. C code or Java bytecodes). You'll learn a crap-ton more, and for a simple compiler it's not all that hard. Crucial to the learning.
- Write the interpreter first: parse into something internal in little bits, then execute the bits. i.e., be able to execute the input language. Crucial to the learning. Once that works, instead of "execute the bits", you "generate the code that executes the bits". Ta-da, a compiler. This can be really really trivial if you just puke the "execute the bits" code into memory. If later, you write that memory buffer out to disk, it's a classic compiler. If later, you execute that memory buffer, it's a "JIT".
Hope this helps,
Cliff
Keeping it as simple as possible, without getting into the details of how to make each of the parts work, you need at least a few parts:
- Lexical Analyzer: What do all these funny symbols and words represent? For anything significant to happen, you'll want to figure out what each of them are in advance and keep track of them for future uses in some sort of table.
- Parser: You now have an abstract list of symbols from your lexical analysis, so you can now look for patterns, like
variable := Expression;
, whereExpression
is a pattern of its own. - Code Generator: Once you've parsed (recognized) some code, given your symbol table (from the lexical analyzer), you should be able to figure out the intent (semantics) of that code. What does it do? Write out the equivalent instructions in your target language.
In other words, it works a lot like people really want machine translation to work: Figure out the words, figure out the grammar, and transform the input text into the output language.
Some quick things to keep in mind in terms of implementation.
- Lexical analysis is almost certainly pretty much what you expect, comparing strings and recording what you found.
- Parsing has two major schools of thought, state-machines and recursive descent. They're equivalent, saying "if we've seen this so far, what possibilities do we expect to be next?" The state-machine approach works better if you're generating the parser (with a tool like yacc, "yet another compiler compiler), whereas recursive descent is far easier to write by hand.
- You probably want to generate code for a microprocessor. Resist that impulse, especially at first. It's much easier to generate output in another high-level language, if only because you have a tool (another compiler) to validate your output.
Once you have those pieces together, you can look at all sorts of other issues, like handling errors, optimizing code, and so forth, but those are the big pieces that you probably can't do without.