It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Could someone here please confirm that this is the case? And if so, why do mainstream compiler frameworks use handwritten parsers?
Update : interesting blog on this topic here
- 28Almost all the mainstream compilers are using handwritten parsers. What's a problem with that? – SK-logic Jun 11 '11 at 23:22
- 2you have to do it (semi-) manually if you need performance. – Gene Bushuyev Jun 14 '11 at 1:56
- 16And not only performance - better error messages, ability to recover, etc. – SK-logic Jun 17 '11 at 7:39
- What about MS VisualStudio? though not open-sourced, could someone from MS verify that they too are using a hand written recursive descent parser? – OrenIshShalom Nov 1 '17 at 5:00
- 3@GeneBushuyev, from the GCC wiki: "...Although timings showed a 1.5% speedup, the main benefits are facilitating of future enhancements ..." this speedup seems rather marginal ... – OrenIshShalom Nov 1 '17 at 5:49
Yes:
GCC used a yacc (bison) parser once upon a time, but it was replaced with a hand-written recursive descent parser at some point in the 3.x series: see http://gcc.gnu.org/wiki/New_C_Parser for links to relevant patch submissions.
Clang also uses a hand-written recursive descent parser: see the section "A single unified parser for C, Objective C, C++ and Objective C++" near the end of http://clang.llvm.org/features.html .
- 3
- 49No: even C, the simplest of the three, has an ambiguous grammar. For example,
foo * bar;
could parse as either a multiplication expression (with the result unused), or a declaration of a variablebar
with type pointer-to-foo
. Which one is correct depends on whether atypedef
forfoo
is in scope at the time, which is not something that can be determined with any amount of lookahead. But that just means that the recursive descent parser needs some ugly extra machinery added to handle that. – Matthew Slattery Nov 5 '12 at 17:30 - 9I can confirm from empirical evidence, that C++11, C, and Objective C have context free grammars that a GLR parser can handle. – Ira Baxter Dec 17 '12 at 20:49
- 2Regarding context sensitiveness, this answer claims neither: that parsing these languages is likely Turing-complete. – Ioannis Filippidis Nov 21 '14 at 3:26
There's a folk-theorem that says C is hard to parse, and C++ essentially impossible.
It isn't true.
What is true is that C and C++ are pretty hard to parse using LALR(1) parsers without hacking the parsing machinery and tangling in symbol table data. GCC in fact used to parse them, using YACC and additional hackery like this, and yes it was ugly. Now GCC uses handwritten parsers, but still with the symbol table hackery. The Clang folks never tried to use automated parser generators; AFAIK the Clang parser has always been hand-coded recursive descent.
What is true, is that C and C++ are relatively easy to parse with stronger automatically generated parsers, e.g., GLR parsers, and you don't need any hacks. The Elsa C++ parser is one example of this. Our C++ Front End is another (as are all our "compiler" front ends, GLR is pretty wonderful parsing technology).
Our C++ front end isn't as fast as GCC's, and certainly slower than Elsa; we've put little energy into tuning it carefully because we have other more pressing issues (nontheless it has been used on millions of lines of C++ code). Elsa is likely slower than GCC simply because it is more general. Given processor speeds these days, these differences might not matter a lot in practice.
But the "real compilers" that are widely distributed today have their roots in compilers of 10 or 20 years ago or more. Inefficiencies then mattered much more, and nobody had heard of GLR parsers, so people did what they knew how to do. Clang is certainly more recent, but then folk theorems retain their "persuasiveness" for a long time.
You don't have to do it that way anymore. You can very reasonably use GLR and other such parsers as front ends, with an improvement in compiler maintainability.
What is true, is that getting a grammar that matches your friendly neighborhood compiler's behavior is hard. While virtually all C++ compilers implement (most) of the original standard, they also tend have lots of dark corner extensions, e.g., DLL specifications in MS compilers, etc. If you have a strong parsing engine, you can spend your time trying to get the final grammar to match reality, rather than trying to bend your grammar to match the limitations of your parser generator.
EDIT November 2012: Since writing this answer, we've improved our C++ front end to handle full C++11, including ANSI, GNU, and MS variant dialects. While there was lots of extra stuff, we don't have to change our parsing engine; we just revised the grammar rules. We did have to change the semantic analysis; C++11 is semantically very complicated, and this work swamps the effort to get the parser to run.
EDIT February 2015: ... now handles full C++14. (See get human readable AST from c++ code for GLR parses of a simple bit of code, and C++'s infamous "most vexing parse").
EDIT April 2017: Now handles (draft) C++17.
- 6PostScript: Just as getting the grammar to match what the vendors really do is harder, get name and type resolution to match the different vendor's interpretation of the C++11 manual is even harder, because the only evidence you have are programs that compile slightly differently, if you can find them. We're largely past that as of Aug 2013 for C++11 proper, but I despair a little at the C++ committee which seems hell-bent on producing an even larger (and from experience, more confusing) standard in the form of C++1y. – Ira Baxter Aug 16 '13 at 7:53
- 5
- 15@Martin: our parser parses it both ways, producing a tree containing special "ambiguity nodes" whose children are the alternative parses; the children do maximal sharing of their children so we end up with a DAG instead of a tree. After parsing completes, we run an attribute grammar evaluator (AGE) over the DAG (fancy name for "walk the tree and do stuff" if you don't know it) which computes the types of all declared identifiers. ... – Ira Baxter Jan 22 '14 at 11:17
- 12... The ambiguous children can't both be type-consistent; the AGE on discovering an ambiguous child that cannot be sensibly typed simply deletes it. What is left are the well-typed children; thus, we have determined which parse of "foobar;" is correct. This trick works for all kinds of crazy ambiguities found in the real grammars we build for the real dialects of C++11, and *completely separates parsing from semantic analysis for names. This clean separation means lots less engineering work to do (no tangles to debug). See stackoverflow.com/a/1004737/120163 for more discussion. – Ira Baxter Jan 22 '14 at 11:19
- 3@TimCas: Actually, I'm with you on railing at the apparant stupidity of designing language syntax (and semantics) that are so complicated that is is so hard to get it right (yes, C++ language suffers here badly). I wish language design committees would design syntax so simpler parsing technologies would work, and explicitly define the language semantics and check it with some semantic analysis tools. Alas, the world doesn't seem to be like that. So, I take the view that you build what you have to build as well as you can, and get on with life, in spite of the awkwardness. – Ira Baxter Feb 11 '15 at 3:55
Clang's parser is a hand-written recursive-descent parser, as are several other open-source and commercial C and C++ front ends.
Clang uses a recursive-descent parser for several reasons:
- Performance: a hand-written parser allows us to write a fast parser, optimizing the hot paths as needed, and we're always in control of that performance. Having a fast parser has allowed Clang to be used in other development tools where "real" parsers are typically not used, e.g., syntax highlighting and code completion in an IDE.
- Diagnostics and error recovery: because you're in full control with a hand-written recursive-descent parser, it's easy to add special cases that detect common problems and provide great diagnostics and error recovery (e.g., see http://clang.llvm.org/features.html#expressivediags) With automatically generated parsers, you're limited to the capabilities of the generator.
- Simplicity: recursive-descent parsers are easy to write, understand, and debug. You don't need to be a parsing expert or learn a new tool to extend/improve the parser (which is especially important for an open-source project), yet you can still get great results.
Overall, for a C++ compiler, it just doesn't matter much: the parsing part of C++ is non-trivial, but it's still one of the easier parts, so it pays to keep it simple. Semantic analysis---particularly name lookup, initialization, overload resolution, and template instantiation---is orders of magnitude more complicated than parsing. If you want proof, go check out the distribution of code and commits in Clang's "Sema" component (for semantic analysis) vs. its "Parse" component (for parsing).
- 4Yes, semantic analysis is harder by a lot. We have some 4000 lines of grammar rules that comprise our C++11 grammar, and some 180,000 lines of attribute grammar code for the "semantic analyses" Doub lists above, with another 100,000 lines of supporting code. Parsing really isn't the problem, although it is hard enough if you start on the wrong foot. – Ira Baxter Aug 16 '13 at 7:58
- 1I'm not so sure that hand-written parsers are necessarily better for error reporting/recovery. It does appear that people have put more energy into such parsers than into enhancing parsers produced by automatic parser generators in practice. There seems to be pretty good research on the topic; this particular paper has really caught my eye: M.G. Burke, 1983, A practical method for LR and LL syntactic error diagnosis and recovery, PhD thesis, Department of Computer Science, New York University, See archive.org/details/practicalmethodf00burk – Ira Baxter Jan 28 '16 at 4:56
- 1... continuing this thought train: if you are willing to modify/extend/customize your hand-built parser to check for special cases for better diagnosis, then you should be willing to make equal investment in better diagnoses of a mechanically generated parser. For any special parse that you can encode for the manual one, you can code a check for the mechanical one, too (and for (G)LR parsers, you can pretty much do this as semantic checks on reductions). To the extent that seems unappetizing, one is just being lazy but that's not an indictment of the mechanically generated parsers IMHO. – Ira Baxter Jan 28 '16 at 8:43
gcc's parser is handwritten.. I suspect the same for clang. This is probably for a few reasons:
- Performance: something that you've hand-optimized for your particular task will almost always perform better than a general solution. Abstraction usually has a performance hit
- Timing: at least in the case of GCC, GCC predates a lot of free developer tools (came out in 1987). There was no free version of yacc, etc. at the time, which I'd imagine would've been a priority to the people at the FSF.
This is probably not a case of "not invented here" syndrome, but more along the lines of "there was nothing optimized specifically for what we needed, so we wrote our own".
- 15No free version of yacc in 1987? I think there were free versions when yacc was first delivered under Unix back in the 70s. And IIRC (other poster seems the same), GCC used to have a YACC-based parser. I heard the excuse for changing it was to get better error reporting. – Ira Baxter Jun 12 '11 at 5:18
- 7I'd like to add it's often easier to generate good error messages from a handwritten parser. – Dietrich Epp Jun 12 '11 at 6:27
- 1Your point on Timing is inaccurate. GCC used to have YACC based parser, but this was replaced with a handwritten recursive descent parser, later on. – Tommy Andersen Jun 12 '16 at 13:27
Weird answers there!
C/C++ grammars aren't context free. They are context sensitive because of the Foo * bar; ambiguity. We have to build a list of typedefs to know if Foo is a type or not.
Ira Baxter: I don't see the point with your GLR thing. Why build a parse tree which comprises ambiguities. Parsing means solving ambiguities, building the syntax tree. You resolve these ambiguities in a second pass, so this isn't less ugly. For me it is far more ugly ...
Yacc is a LR(1) parser generator (or LALR(1)), but it can be easily modified to be context sensitive. And there is nothing ugly in it. Yacc/Bison has been created to help in parsing C language, so probably it isn't the ugliest tool to generate a C parser ...
Until GCC 3.x the C parser is generated by yacc/bison, with typedefs table built during parsing. With "in parse" typedefs table building, C grammar becomes locally context free and furthermore "locally LR(1)".
Now, in Gcc 4.x, it is a recursive descent parser. It is exactly the same parser as in Gcc 3.x, it is still LR(1), and has the same grammar rules. The difference is that the yacc parser has been hand rewritten, the shift/reduce are now hidden in the call stack, and there is no "state454 : if (nextsym == '(') goto state398" as in gcc 3.x yacc's parser, so it is easier to patch, handle errors and print nicer messages, and to perform some of the next compiling steps during parsing. At the price of much less "easy to read" code for a gcc noob.
Why did they switched from yacc to recursive descent? Because it is quite necessary to avoid yacc to parse C++, and because GCC dreams to be multi language compiler, i.e. sharing maximum of code between the different languages it can compile. This is why the C++ and the C parser are written in the same way.
C++ is harder to parse than C because it isn't "locally" LR(1) as C, it is not even LR(k). Look at func<4 > 2>
which is a template function instantiated with 4 > 2, i.e. func<4 > 2>
has to be read as func<1>
. This is definitely not LR(1). Now consider, func<4 > 2 > 1 > 3 > 3 > 8 > 9 > 8 > 7 > 8>
. This is where a recursive descent can easily solve ambiguity, at the price of a few more function calls (parse_template_parameter is the ambiguous parser function. If parse_template_parameter(17tokens) failed, try again parse_template_parameter(15tokens), parse_template_parameter(13tokens) ... until it works).
I don't know why it wouldn't be possible to add into yacc/bison recursive sub grammars, maybe this will be the next step in gcc/GNU parser development?
- 9"for me, it is far more ugly". What I can tell you is that engineering of a production quality parser using GLR and delay ambiguity resolution is practical with a really small team. All the other solutions I have seen have involved years of gnashing teeth in public over the backflips and hacks required to make it work with LR, recursive descent, you name it. You can postulate lots of other cool new parsing technologies, but as far as I can tell, that's just more gnashing of teeth at this point. Ideas are cheap; execution is dear. – Ira Baxter Aug 16 '13 at 7:51
- @Fizz: Interesting paper on parsing Fortress, a complex scientific programming langauge. They said several things of note: a) classic parser generators (LL(k), LALR(1)) can't handle tough grammars, b) they tried GLR, had troubles with scale but the developers were inexperienced so they didn't complete [that isn't GLR's fault] and c) they used a backtracking (transactional) Packrat parser and put a lot of effort into it including work to produce better error messages. Regarding their example of parsing "{|x||x←mySet,3|x}", I beleive GLR would do it just fine and it doesn't need spaces. – Ira Baxter Jun 13 '20 at 16:27
It seems that GCC and LLVM-Clang are using handwritten recursive descent parsers, and not machine generated, Bison-Flex based, bottom up parsing.
Bison in particular I don't think can handle the grammar without parsing some things ambiguously and doing a second pass later.
I know Haskell's Happy allows for monadic (i.e. state-dependent) parsers that can resolve the particular issue with C syntax, but I know of no C parser generators that allow a user-supplied state monad.
In theory, error recovery would be a point in favor of a handwritten parser, but my experience with GCC/Clang has been that the error messages are not particularly good.
As for performance - some of the claims seem unsubstantiated. Generating a big state machine using a parser generator should result in something that's O(n)
and I doubt parsing is the bottleneck in much tooling.