Optimizing Ruby Lazy Initialization in TruffleRuby with Deoptimization
Shopify’s involvement with TruffleRuby began half a year ago, with the goal of furthering the success of the project and Ruby community. TruffleRuby is an alternative implementation of the Ruby language (where the reference implementation is CRuby, or MRI) developed by Oracle Labs. TruffleRuby has high potential in speed, as it is nine times faster than CRuby on optcarrot, a NES emulator benchmark developed by the Ruby Core Team.
I’ll walk you through a simple feature I investigated and implemented. It showcases many important aspects of TruffleRuby and serves as a great introduction to the project!
Ruby developers tend to use the double pipe equals operator
||= for lazy initialization, likely somewhat like this:
Syntactically, the meaning of the double pipe equals operator is that the value is assigned if the value of the variable is currently not set.
The common use case of the operator isn’t so much “assign if” and more “assign once”.
This idiomatic usage is a subset of the operator’s syntactical meaning so prioritizing that logic in the compiler can improve performance. For TruffleRuby, this would lead to less machine code being emitted as the logic flow is shortened.
To confirm that this usage is common enough to be worth optimizing for, I ran static profiling on how many times this operator is used as lazy initialization.
For a statement to count as a lazy initialization for these profiling purposes, we had it match one of the following requirements:
- The value being assigned is a constant (uses only literals of int, string, symbol, hash, array or is a constant variable). An example would be
a ||= [2 * PI].
- The statement with the
||=operator is in a function, an instance or class variable is being assigned, and the name of the instance variable contains the name of the function or vice versa. The function must accept no params. An example would be
def get_a; @a ||= func_call.
These criteria are very conservative. Here are some examples of cases that won’t be considered a lazy initialization but probably still follow the pattern of “assign once”.
After profiling 20 popular open-source projects, I found 2082 usages of the
||= operator, 64% of them being lazy initialization by this definition.
Before we get into optimizing TruffleRuby for this behaviour, here’s some background on how TruffleRuby compiles your code.
TruffleRuby is an implementation of Ruby that aims for higher performance through optimizing Just In Time (JIT) compilation (programs that are compiled as they’re being executed). It’s built on top of GraalVM, a modified JVM built by Oracle that provides Truffle, a framework used by TruffleRuby for implementing languages through building Abstract Syntax Tree (AST) interpreters. With Truffle, there’s no explicit step where JVM bytecode is created as with a conventional JVM language, rather Truffle will just use the interpreter and communicate with the JVM to create machine code directly with profiling and a technique called partial evaluation. This means that GraalVM can be advertised as magic that converts interpreters into compilers!
TruffleRuby also leverages deoptimization (more than other implementations of Ruby) which is a term for quickly moving between the fast JIT-compiled machine code to the slow interpreter. One application for deoptimization is how the compiler handles monkey patching (e.g. replacing a class method at runtime). It’s unlikely that a method will be monkey patched, so we can deoptimize if it has been monkey patched to find and execute the new method. The path for handling the monkey patching won’t need to be compiled or appear in the machine code. In practice, this use case is even better—instead of constantly checking if a function has been redefined, we can just place the deoptimization where the redefinition is and never need a check in compiled code.
In this case with lazy initialization, we make the deoptimization case the uncommon one where the variable needs to be assigned a value more than once.
Before when TruffleRuby encountered the
||= operator, a Graal profiler would see that since both sides have been used, the entire statement should be compiled into machine code. Our knowledge of how Ruby is used in practice tells us that the right hand side is unlikely to be run again, and so doesn’t need to be compiled into machine code if it’s never been executed or has been executed just once.
TruffleRuby uses little objects called nodes to represent each part of a Ruby program. We use an
OrNode to handle the
||= operator, with the left side being the condition and the right side being the action to execute if the left side is true (in this case the action is an assignment). The creation of these nodes are implemented in Java.
To make this optimization, we swapped out the standard
OrNode for an
OrLazyValueDefinedNode in the
BodyTranslator which translates the Ruby AST into nodes that Truffle can understand.
OrNode executes like this:
The ConditionProfile is what counts how many times each branch is executed. With lazy initialization it counts both sides as used by default, so compiles them both into the machine code.
OrLazyValueDefinedNode only changes the else block. What I’m doing here is counting the number of times the else part is executed, and turning it into a deoptimization if it’s less than twice.
Benchmarking isn’t a perfect measure of how effective this change is (benchmarking is arguably never perfect, but that’s a different conversation), as the results would be too noisy to observe in a large project. However, I can still benchmark on some pieces of code to see the improvements. By doing the “transfer to interpreter and invalidate”, time and space is saved in creating machine code for everything related to the right side.
With our new optimization this piece of code compiles about 6% faster and produces about 63% fewer machine code by memory (about half the number of assembly instructions). Faster compilation means more time for your app to run, and smaller machine code means less usage of memory and cache. Producing less machine code more quickly improves responsiveness and should in turn make the program run faster, though it’s difficult to prove.
Function foo without optimization
Above is a graph of the
foo method in sample code above without the optimization that vaguely represents the logic present in the machine code. I can look at the actual compiler graphs produced by Graal at various stages to understand how exactly our code is being compiled, but this is the overview.
Each of the nodes in this graph expands to more control flow and memory access, which is why this optimization can impact the amount of machine code so much. This graph represents the uncommon case where the checks and call to the
calculate_foo method are needed, so for lazy initialization it’ll only need this flow once or zero times.
Function foo with optimization
The graph that includes the optimization is a bit less complex. The control flow doesn’t need to know anything about variable assignment or anything related to calling and executing a method.
What I’ve added is just an optimization, so if you:
- aren’t using
||=to mean lazy initialization
- need to run the right-hand-side of the expression multiple times
- need it to be fast
then the optimization goes away and the code is compiled as it would have done before (you can revisit the
OrLazyValueDefinedNode source above to see the logic for this).
This optimization shows the benefit of looking at codebases used in industry for patterns that aren’t visible in the language specifications. It’s also worth noting that none of the code changes here were very complicated and modified code in a very modular way—other than the creation of the new node, only one other line was touched!
Truffle is actually named after the chocolates, partially in reference to the modularity of a box of chocolates. Apart from modularity, TruffleRuby is also easy to develop on as it’s primarily written in Ruby and Java (there’s some C in there for extensions).
Shopify is leading the way in experimenting with TruffleRuby for production applications. TruffleRuby is currently mirroring storefront traffic. This helped us work through some bugs, build better tooling for TruffleRuby and can lead to a faster browsing for customers.
We also contribute to CRuby/MRI and Sorbet as a part of our work on Ruby. We like desserts, so along with contributions to TruffleRuby and Sorbet, we maintain Tapioca! If you’d like to become a part of our dessert medley (or work on other amazing Shopify projects), send us an application!
- TruffleRuby Source – (separate link to pull request)
- Code used for static profiling
- Post on Optcarrot (NES emulator benchmark) performance in TruffleRuby
- Understanding How Graal Works, Chris Seaton (50 minute video + corresponding blog post)
- Truffle Compilation Pipeline (45 minute video, slides)
- Deoptimizing Ruby, Chris Seaton (30 minute video + corresponding blog post)
- What Ruby’s ||= (Double Pipe / Or Equals) Really Does (explains why x ||= y isn’t equivalent to x || x = y )
- V8 blog post about deoptimization
- Partial Evaluation (academic papers)
- Practical Partial Evaluation for High-Performance Dynamic Language Runtimes (PDF)
- Tutorial on Online Partial Evaluation, William Cook (PDF)
Tangentially related things about Ruby and TruffleRuby
- So You Want To Optimize Ruby, Charles Nutter
- Top 10 Things To Do With GraalVM
- Towards The Ruby 3×3 Performance Goal
If this sounds like the kind of problems you want to solve, we’re always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.