This article is about my Perl compiler kit, which comes in the form of ordinary Perl extension modules and lets you compile your Perl code in various ways. An alpha distribution is freely available, licensed under the GPL and the Artistic License just as Perl itself is.
There have been many requests over the years for a Perl compiler. In most cases, the question "Is there a Perl compiler?" turns out to mean "Is there a magic filter which will make my Perl program go faster without me having to do any work?" Explanations that Perl already compiles code into an internal format and implements many standard compiler optimizations fall upon deaf ears.
In December 1995, Tom Christiansen sent a message to the perl5-porters mailing list saying that he was so frustrated with queries about a Perl compiler that he was considering offering a laptop computer to the implementor of the first/best Perl compiler. I knew there were frequent requests for a compiler but I didn't know they were that frequent, so I started thinking whether I could help.
I saw that it was possible to get the effects that most people would expect from a Perl compiler in a few different ways. Once a Perl program is compiled by /usr/bin/perl (or equivalent) into its own internal format, a frozen version of that data can be saved to a file, and reconstructing that frozen version is faster than parsing Perl source code from scratch. Running the frozen version reduces the startup time, since it replaces the parse and compile steps with the simpler step of loading the frozen format. Moreover, when walking through those internal structures and writing out a frozen version (or even as a post-processing step) there's extra time available for optimizations. Whereas Perl's internal parser and compiler has to compile as fast as it can (since it's all perceived as run-time by the user) a separate compiler can afford to do more.
It has always been awkward trying to define exactly whether code written in Perl (the language) is compiled or interpreted by perl (the program). It becomes even more awkward when the compiler kit is considered. It's an odd beast and calling it a compiler is rather misleading in some ways. The internal parsing and compiling done by perl itself turns your Perl source code into a collection of structures representing your program's operations and data. The structures which represent operations are called OPs; the structures which represent data (variables and so on) are called SVs (for "Scalar Value"). Each OP structure contains a pointer to the C function which implements it, pointers to the data needed at runtime and pointers linking the structure into the syntax tree. One of the primary goals of the "toolkit" side of the compiler extensions is to give convenient access to all of those OPs and SVs.
The compiler currently has three main backends from which you can choose. Two of the backends are primarily decompilers rather than compilers. Each lets Perl parse its way through your program, compile it all into those OPs and SVs and then takes control before letting it actually run. The first backend writes out C source which can be compiled into an executable consisting mostly of pre-initialized OPs and SVs. When the resulting executable is run, it only has to fix up those structures in memory before diving straight into the ordinary Perl runtime loop. You can still use eval() to execute strings at run-time, of course, but Perl's own parse/compile/run routines handle them. The C source code output of the backend can be compiled by any standard C compiler (and maybe a few non-standard ones).
The second backend has a similar goal but writes out a platform independent bytecode rather than C . The bytecode interpreter is a replacement for the parsing and compiling that Perl usually has to do to your program every time it runs. The bytecode itself consists of instructions which create OPs and SVs, of various types and fill in their fields with the necessary data. And there's an intermediate step: generating source code in a new little "assembly language" whose instruction set covers the creation and field-by-field initialization of all the structures that Perl needs. The assembler and corresponding disassembler are ordinary Perl modules. The main parts of the assembler, the disassembler and the bytecode interpreter (a C function) are all auto-generated from a single Perl program which contains a descriptive table of the bytecode instruction set. The resulting bytecode is platform independent for both endianness and whether the architecture is 32 bit or 64 bit.
Of the three backends currently written, only the third, the "CC" backend, does what most people expect a compiler to do. It turns your Perl code, again via a C compiler, into machine code which corresponds to the original Perl code you wrote. Paradoxically, it may well turn out to be less useful than the first two for the majority of Perl code. This backend starts with the internal compiled format of the Perl program, walks the syntax tree in execution order and writes out C source code with the appropriate control flow. The instructions are written out either as calls to the usual internal Perl runtime functions or else inline C source for "hot" instructions. This takes advantage of compile-time knowledge to reduce run-time tests to a minimum or to do away with them completely.
In both types of compilation (freezing the syntax tree and writing out linear code in execution order), there are opportunities for optimization. The time Perl spends on parsing, compiling and optimizing a Perl program is constrained because, to the user, it's still time required to run the program. Perl's peephole optimizer already helps by folding constants, changing operations from floating point to integer-only, and eliminating dead code and unneeded block entry/exit operations with do-nothing NULLOPs. (It's called a "peephole" optimizer because it walks through the syntax tree in execution order peeping at each OPs, and maybe a few surrounding OPs, and makes local optimizations based on what it sees.)
Some options are more expensive, and therefore best left to the separate compiler kit: rewriting the syntax tree to get rid of NULLOPs altogether, inlining constant subroutines and variables, and maybe eliminating constant subexpressions. As time goes by, developers are sure to think of many different possible optimizations for the compiler, and adding them into the compiler kit modules shouldn't be too hard.
Before going into more detail about the compiler kit, I'd better talk about the relevant Perl internals, which are based upon the two aforementioned data structures: SVs and OPs. Any Perl variable ($foo, @bar, %baz, or *quux) ends up being referred to internally by an opaque SV* pointer. It's opaque in the same sense that the FILE* pointer of stdio is opaque: you may happen to know what fields it contains but woe betide anyone who depends on them. Perl programs are compiled into a syntax tree ready for runtime and each structure in the syntax tree is referred to by an OP*.
The SV encodes the type of variable, a reference count, flags and a pointer to an indirect structure encoding the variable's value and associated information. Those indirect structures have typedef names such as XPV, XPVNV, XPVAV and XPVGV (for strings, floats, arrays, and type globs, respectively). They're laid out so that their fields can be inherited: The first few fields of an XPV are used to encode the string value of a Perl scalar. Those same fields appear in the same place at the beginning of an XPVNV. The XPVNV, however, has further fields at the end which encode the string's numeric value as a double floating point. Standard macros applied to the SV can dereference it as a string, for example, whether the indirect structure is an XPV or an XPVNV. The two common cases of variables whose value is only an integer or a double are further optimized, but I won't go into that here.
As Perl compiles a program, it parses it into operations and builds up a syntax tree in memory. For example, the Perl statement
$foo = 42;
(assuming $foo is a lexical my() variable) becomes three operations: OP_CONST, OP_PADSV, and OP_SASSIGN, which are preprocessor constants stored in the op_type field of the OP. Other fields encode compile-time information (e.g. its sibling OP in the syntax tree and its sequence number) and run-time information (e.g. the address of the C function which implements the operator and a pointer to the next OP to be executed).
Many operators need extra fields for both compile-time and run-time information. For example, list operators need fields to find all their children OPs in the syntax tree at compile-time. Operators that modify control flow (such as if or &&) need fields that point to the different OPs to follow its execution at run-time. These are encoded in structures which agree with the OP for all their initial fields but then have extra fields on the end. They have names such as SVOP, LOGOP, BINOP and LISTOP. These structures can be thought of as classes derived from the base OP but with extra instance variables.
The central run-time loop within Perl is then simply the line
while ( op = (*op->op_ppaddr)() ) ;
Earlier I mentioned that one important field of an OP structure was the pointer to the C function implementing it. That's the op_ppaddr field. The functions which implement Perl's run-time are usually referred to as PP code; each PP function uses the global variable op to find the run-time data it needs and returns a pointer the the next OP to perform. Most PP functions get this simply by dereferencing the op_next field of op but operators which alter control flow do not. (pp_goto(), the PP function which implements the Perl goto operator, is the most extraordinary.) PP functions don't take any arguments; Perl arguments are handled by compiling them into PP functions that manipulate a global stack.
The compiler kit includes a number of separate modules. One module, named B, uses an XSUB to provide a Perl interface for accessing SVs and OPs in the Perl internals. For any given SV, it uses the SvTYPE macro to determine the type of the SV and creates an object which refers to it. There are classes and methods for OPs too, but there's an additional problem: None of the various OP structures contains an explicit field indicating exactly what structure it is. For some OPs, the op_type field is all that's needed, but for others close examination of other fields is necessary to intuit the structure type. For example, an OP_CONST is always an SVOP but OP_GOTO can occur as the type field of an OP, an UNOP or a PVOP depending on whether the original goto keyword was followed by a constant label, a non-constant label or a subroutine name. Methods for each class allow the fields of OPs to be manipulated from a Perl program. Fields which contain SVs or OPs are automatically turned into the appropriate sort of object.
The B module also contains utility functions for walking the current syntax tree in either syntax tree order or execution order, for accessing various special internal Perl variables, and for tasks like converting strings into legal C source form.
Modules which implement a compiler backend make use of the B module for manipulating the Perl internals. The interface with the user (i.e. the person attempting to compile his Perl program) is through the O module. The usual way of invoking the compiler is by specifying something like
perl -MO=C myprog.pl
Recall that the command-line option -MO=C causes Perl to place an implicit use O 'C' before it parses and runs myprog.pl. The O module does a require for the appropriate backend (B/C.pm in the case above) and invokes its &compile() subroutine. That handles extra command-line options, does some sanity checking, and passes back a subroutine reference to be executed later for the actual compilation. The O module caches that reference, forces the syntax-checking-only effect of the -c Perl command-line option, and creates an END { } block to call the cached backend subroutine reference. Then the O module finishes and leaves Perl to continue on its way.
Perl then loads myprog.pl - parsing it and creating variables and the internal syntax tree in the way we've already seen. It doesn't run the program because of the -c; however, END { } blocks are always run, and so the one created by the O module invokes the appropriate compiler backend. That backend then gets on with its job and, using subroutines and methods from the B module, saves the internal syntax tree and state of Perl. The CC backend does most of the same job as the C backend but also writes out C source to implement the program's runtime. The inline code that it produces, however, potentially takes up a lot of space which could result in an overall slowdown. At the time of writing, it is too early to tell which, if any, Perl programs will benefit from such aggressive inlining. I suspect that the C backend and Bytecode backends will be of the most use at first.
Optimizations such as those mentioned in a previous section can be applied in two obvious ways. One is by modifying the appropriate backend (which isn't hard, since it's written in Perl rather than C). The other is by using the Bytecode backend as an intermediate file format. A Perl program frozen in bytecode (or Perl Assembler format) can be fed through a filter module which performs the desired optimization and then reuses the Bytecode backend to write its output again. The resulting output file can be processed, run, or used as input to further optimization steps.
I'll end with an amusing example: you can implement a simple optimization filter just by using UNIX grep. One of the fields in every OP structure is op_seq, which contains a unique sequence number assigned and used by Perl's own parser but unused at runtime. Using grep -v op_seq to remove all the assembler instructions that set those fields results in a smaller, and hence faster loading, bytecode file.
__END__