what is "block compilation" anyway?
Those of you using Lisp, or any dynamic language know one thing: Function calls to global, top-level functions are expensive. Much more expensive than in a statically compiled language. They're slow because of the late-bound nature of top-level defined functions, allowing arbitrary redefinition while the program is running and forcing runtime checks on whether the function is being called with the right number or types of arguments. This type of call is known as a "full call" in Python (the compiler used in CMUCL and SBCL, not to be confused with the programming language), and their calling convention permits the most runtime flexibility.But there is another type of call available to us: the local call. A local call is the type of call you would see between local functions inside a top-level function, say, a call to a function introduced via anonymous LAMBDAs, LABELs or FLETs in Lisp, or internal defines in Scheme and Python. These calls are more 'static' in the sense that they are treated more like function calls in static languages, being compiled "together" and at the same time as the local functions they reference, allowing them to be optimized at compile-time. For example, argument checking can be done at compile time because the number of arguments of the callee is known at compile time, unlike in the full call case where the function, and hence the number of arguments it takes, can change dynamically at runtime at any point. Additionally, the local call calling convention can allow for passing unboxed values like floats around, as they are put into unboxed registers never used in the full call convention, which must use boxed argument and return value registers.
Block compilation is simply the compilation mode that turns what would normally be full calls to top-level defined functions into local calls to said functions, by compiling all functions in a unit of code (e.g. a file) together in "block" or "batch" fashion, just as local functions are compiled together in a single top-level form. You can think of the effect of block compilation as transforming all the DEFUNs in a file into one large LABELs form. You can think of it as being a tunable knob that increases or decreases how dynamic or static the compiler should act with the respect to function definitions, by controlling whether function name resolution is early-bound or late-bound in a given block compiled unit of code.
We can achieve block compilation with a file-level granularity in CMUCL and SBCL specifically by specifying the :block-compile keyword to compile-file. Here's an example:
In foo.lisp:
(defun foo (x y)
(print (bar x y))
(bar x y))
(defun bar (x y)
(+ x y))
(defun fact (n)
(if (zerop n)
1
(* n (fact (1- n)))))
; Size: 210 bytes. Origin: #x52E63F90 (segment 1 of 4) ; (XEP BAR)(print (bar x y))
(bar x y))
(defun bar (x y)
(+ x y))
(defun fact (n)
(if (zerop n)
1
(* n (fact (1- n)))))
> (compile-file "foo.lisp" :block-compile t :entry-points nil)
> (load "foo.fasl")
> (sb-disassem:disassemble-code- component #'foo)
; 3F90: .ENTRY BAR(X Y)
; 3FA0: 8F4508 POP QWORD PTR [RBP+8]
; 3FA3: 4883F904 CMP RCX, 4
; 3FA7: 0F85B1000000 JNE L2
; 3FAD: 488D65D0 LEA RSP, [RBP-48]
; 3FB1: 4C8BC2 MOV R8, RDX
; 3FB4: 488BF7 MOV RSI, RDI
; 3FB7: EB03 JMP L1
; 3FB9: L0: 8F4508 POP QWORD PTR [RBP+8]
; Origin #x52E63FBC (segment 2 of 4) ; BAR
; 3FBC: L1: 498B4510 MOV RAX, [R13+16] ; thread.binding-stack-pointer
; 3FC0: 488945F8 MOV [RBP-8], RAX
; 3FC4: 4C8945D8 MOV [RBP-40], R8
; 3FC8: 488975D0 MOV [RBP-48], RSI
; 3FCC: 498BD0 MOV RDX, R8
3FCF: 488BFE MOV RDI, RSI
; 3FD2: E8B9CB29FF CALL #x52100B90 ; GENERIC-+
; 3FD7: 488B75D0 MOV RSI, [RBP-48]
; 3FDB: 4C8B45D8 MOV R8, [RBP-40]
; 3FDF: 488BE5 MOV RSP, RBP
; 3FE2: F8 CLC
; 3FE3: 5D POP RBP
; 3FE4: C3 RET
; Origin #x52E63FE5 (segment 3 of 4) ; (XEP FOO)
; 3FE5: .SKIP 11
; 3FF0: .ENTRY FOO(X Y)
; (SB-INT:SFUNCTION
; (T T) NUMBER)
; 4000: 8F4508 POP QWORD PTR [RBP+8]
; 4003: 4883F904 CMP RCX, 4
; 4007: 7557 JNE L3
; 4009: 488D65D0 LEA RSP, [RBP-48]
; 400D: 488955E8 MOV [RBP-24], RDX
; 4011: 48897DE0 MOV [RBP-32], RDI
; Origin #x52E64015 (segment 4 of 4) ; FOO
; 4015: 498B4510 MOV RAX, [R13+16] ; thread.binding-stack-pointer
; 4019: 488945F0 MOV [RBP-16], RAX
; 401D: 4C8BCD MOV R9, RBP
; 4020: 488D4424F0 LEA RAX, [RSP-16]
; 3FD2: E8B9CB29FF CALL #x52100B90 ; GENERIC-+
; 3FD7: 488B75D0 MOV RSI, [RBP-48]
; 3FDB: 4C8B45D8 MOV R8, [RBP-40]
; 3FDF: 488BE5 MOV RSP, RBP
; 3FE2: F8 CLC
; 3FE3: 5D POP RBP
; 3FE4: C3 RET
; Origin #x52E63FE5 (segment 3 of 4) ; (XEP FOO)
; 3FE5: .SKIP 11
; 3FF0: .ENTRY FOO(X Y)
; 4000: 8F4508 POP QWORD PTR [RBP+8]
; 4003: 4883F904 CMP RCX, 4
; 4007: 7557 JNE L3
; 4009: 488D65D0 LEA RSP, [RBP-48]
; 400D: 488955E8 MOV [RBP-24], RDX
; 4011: 48897DE0 MOV [RBP-32], RDI
; Origin #x52E64015 (segment 4 of 4) ; FOO
; 4015: 498B4510 MOV RAX, [R13+16] ; thread.binding-stack-pointer
; 4019: 488945F0 MOV [RBP-16], RAX
; 401D: 4C8BCD MOV R9, RBP
; 4020: 488D4424F0 LEA RAX, [RSP-16]
; 4025: 4883EC40 SUB RSP, 64
; 4029: 4C8B45E8 MOV R8, [RBP-24]
; 402D: 488B75E0 MOV RSI, [RBP-32]
; 4031: 4C8908 MOV [RAX], R9
; 4034: 488BE8 MOV RBP, RAX
; 4037: E87DFFFFFF CALL L0
; 403C: 4883EC10 SUB RSP, 16
; 4040: B902000000 MOV ECX, 2
; 4045: 48892C24 MOV [RSP], RBP
; 4049: 488BEC MOV RBP, RSP
; 404C: E8F1E163FD CALL #x504A2242 ; #<FDEFN PRINT>
; 4051: 4C8B45E8 MOV R8, [RBP-24]
; 4055: 488B75E0 MOV RSI, [RBP-32]
; 4059: E95EFFFFFF JMP L1
; 405E: L2: CC10 INT3 16 ; Invalid argument count trap
; 4029: 4C8B45E8 MOV R8, [RBP-24]
; 402D: 488B75E0 MOV RSI, [RBP-32]
; 4031: 4C8908 MOV [RAX], R9
; 4034: 488BE8 MOV RBP, RAX
; 4037: E87DFFFFFF CALL L0
; 403C: 4883EC10 SUB RSP, 16
; 4040: B902000000 MOV ECX, 2
; 4045: 48892C24 MOV [RSP], RBP
; 4049: 488BEC MOV RBP, RSP
; 404C: E8F1E163FD CALL #x504A2242 ; #<FDEFN PRINT>
; 4051: 4C8B45E8 MOV R8, [RBP-24]
; 4055: 488B75E0 MOV RSI, [RBP-32]
; 4059: E95EFFFFFF JMP L1
; 405E: L2: CC10 INT3 16 ; Invalid argument count trap
; 4060: L3: CC10 INT3 16 ; Invalid argument count trap
You can see that FOO and BAR are now compiled into the same component (with local calls), and both have valid external entry points. This improves locality of code quite a bit and still allows calling both FOO and BAR externally from the file (e.g. in the REPL). The only thing that has changed is that within the file foo.lisp, all calls to functions within that file shortcut going through the global fdefinition's external entry points which do all the slow argument checking and boxing. Even FACT is faster because the compiler can recognize the tail recursive local call and directly turn it into a loop. Without block-compilation, the user is licensed to, say, redefine FACT while it is running, which forces the compiler to make the self call into a normal full call to allow redefinition and full argument and return value processing.
But there is one more goody block compilation adds...
the :entry-points keyword
Notice we specified :entry-points nil above. That's telling the compiler to still create external entry points to every function in the file, since we'd like to be able to call them normally from outside the code component (i.e. the compiled compilation unit, here the entire file). Now, those of you who know C know there is a useful way to get the compiler to optimize file-local functions, for example automatically inlining them if they are once-use, and also enforce that the function is not visible externally from the file. This is the static keyword in C. The straightforward analogue when block compiling is the :entry-points keyword. Essentially, it makes the DEFUNs which are not entry-points not have any external entry points, i.e. they are not visible to any functions outside the block compiled unit and so become subject to an assortment of optimizations, For example, if a function with no external entry point is never called in the block-compiled unit, it will just be deleted as dead code. Better yet, if a function is once-use, it will be removed and directly turned into a LET at the call site, essentially acting as inlining with no code size tradeoff and is always an optimization.
So, for example, we have
> (compile-file "test.lisp" :block-compile t :entry-points '(bar fact))
which removes FOO for being unused in the block compiled unit (the file). This is all documented very well in the CMUCL manual section Advanced Compiler Use and Efficiency Hints under "Block compilation". Unfortunately this section (among others) never made it over to SBCL's manual, though it is still 99% accurate for SBCL.
a brief history of block compilation
Now to explain why the word "Fresh" in the title of this post is in quotes. You may be surprised to hear that CMUCL, the progenitor of SBCL, has had this interface to block compilation described above since 1991. Indeed, the Python compiler, which was first started in 1985 by Rob MacLachlan, was designed with the explicit goal of being able to block compile arbitrary amounts of code at once in bulk in the manner described above as a way to close the gap between dynamic and static language implementations. Indeed, the top-level intermediate representation data structure in Python is the COMPONENT, which represents a connected component of the flow graph created by compiling multiple functions together. So, what happened? Why did SBCL not have this feature despite its compiler being designed around it?
Cool stuff. Nice job.
ReplyDeleteIs file the maximum granularity? Could we compile a whole system or a whole application in this fashion? Would we want to?
Thanks! CMUCL allows compile-file to take a list of files instead of just one file, which allowed block compilation to operate on a multi-file level of granularity. SBCL removed this extension in an effort to make compile-file conform to the spec. However, there are certainly other mechanisms that could be implemented to achieve the same thing in a more ANSI compliant way: for example, Christophe suggested extending with-compilation-unit with a keyword which *is* allowed. Block compiling entire systems would definitely be desirable for some applications, and in conjuction with the :entry-points keyword (if the application has a single entry point named "main" for example) it could even function as a flexible tree shaker, depending on how much code is being compiled at once.
ReplyDeleteHow to use this feature from slime? Special variants of slime-compile-and-load-file function?
ReplyDelete