vda19999 Aug 31 2023 at 07:38

Writing an interpreter (virtual machine) for a simple byte-code + JIT compilation

Medium

10 min

1.3K

Translation

Original author: vda19999

There are two articles on Russian, the author of which writes a virtual machine (interpreter) for executing a simple bytecode and then applies different optimizations to make this virtual machine faster. Besides that, there is a compiler of a simple C-like language into this bytecode. After reading this article and getting familiar with the compiler, I thought that it would be interesting to try writing a virtual machine for this language that would be able to apply JIT compilation to this bytecode with the libjit library. This article describes the experience of doing that.

I found several articles online that describe the usage of this library, but those that I saw, describe the compilation of concrete programs with libjit, while I was interested in compiling arbitrary bytecode. For people interested in further reading, there is an official tutorial, a series of articles and a series of comparisons (in Russian).

The implementation was done in C++ because we aren`t playing games here. All my code is in my repository. The "main" branch has just the interpreter of the PigletVM bytecode; "labels-with-fallbacks" has a partial JIT compilation implementation (that doesn`t support JUMP instructions), "full-jit" has fully working JIT compilation; "making-jit-code-faster" makes code generated by JIT work faster and "universal-base-vm*" branches merge the interpreter and JIT compilation implementations, by implementing a base generalised executor, which can be used for different implementations of PigletVM (both the interpreter and libjit compilation)

Compiling to bytecode

As I mentioned earlier, PigletC can compile a simple C-like language into PigletVM bytecode in TEXT format. First, I will need a program that will translate bytecode from text format to binary.

Some details:

For simplicity, I will encode each instruction with 4 bytes (int type). This isn`t really optimal as we only have 25 bytecode instructions, but that will make things simpler because I want instruction arguments to be 4-byte as well. Theoretically, making the size of the executable smaller should make execution faster, but in our case of small programs, the difference will be marginal.
Given that, the binary file for a program will be a sequence of int-s, each of which is an instruction opcode, an instruction argument, or a special value (see later). Instruction arguments only follow those instructions that have an argument.
JUMP instructions (JUMP, JUMP_IF_TRUE, JUMP_IF_FALSE) are followed by the index of the instruction that this instruction jumps to (i. e. the index of the number that encodes the target instruction.
This way, the assembler program first walks over the bytecode, generating the instructions, but not setting the addresses after the jump instructions - these are set in the end, when the address of each instruction is known.
For convenience, I decided to put two special numbers on the place of each label in the binary file: 0xcafe and 0xbabe. This way, when analyzing the bytecode, it will be clear where jump instructions may lead to. This isn`t really necessary though (and affects performance), as I could have saved the addresses after all the JUMP instructions.

Otherwise, there is nothing really interesting in the code of the assembler. It can be seen here.

Interpreting byte code

Now it is time to write a bytecode interpreter so that we have something to compare the execution of JIT compiled code with.

The state of the virtual machine is defined by the following elements:

stack. We can store 4-byte signed numbers on the stack
memory - a line with 4-byte signed numbers that can be accessed with an index.
the number of the next instruction to execute (instruction pointer)

Initially, we could use STL stack and vector containers for the stack and memory, respectively (or we could use vector or deque for both) - then the sizes of stack and memory won`t need to be bounded by some constants. However, I ended up using raw pointers instead, because STL containers wouldn`t work well with JIT compilation (technically, we can call arbitrary functions from JIT compiled code, but inlining STL container methods would be impossible, which would cause dramatic performance).

All in all, the implementation of the interpreter is straightforward: it is a big switch that executes each instruction:

code

void store_to_memory(int addr, int value) {
        memory[addr] = value;
    }

int stack_pop() {
    return stack[--stack_size];
}

void run() {
        size_t ip = 0;
        while (ip < instructions_number) {
            ip++;
            switch (instructions[ip - 1]) {
                case OP_JUMP: {
                    auto arg = instructions[ip];
                    ip++;
                    ip = arg;
                    break;
                }
                case OP_JUMP_IF_TRUE: {
                    auto arg = instructions[ip];
                    ip++;
                    if (stack_pop()) {
                        ip = arg;
                    }
                    break;
                }
                case OP_JUMP_IF_FALSE: {
                    auto arg = instructions[ip];
                    ip++;
                    if (!stack_pop()) {
                        ip = arg;
                    }
                    break;
                }
                case OP_LOADADDI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size - 1] += memory[arg];
                    break;
                }
                case OP_LOADI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size++] = memory[arg];
                    break;
                }
                case OP_PUSHI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size++] = arg;
                    break;
                }
                case OP_DISCARD: {
                    stack_size--;
                    break;
                }
                case OP_STOREI: {
                    auto addr = instructions[ip];
                    ip++;
                    store_to_memory(addr, stack_pop());
                    break;
                }
                case OP_LOAD: {
                    auto addr = stack_pop();
                    stack[stack_size++] = memory[addr];
                    break;
                }
                case OP_STORE: {
                    auto val = stack_pop();
                    auto addr = stack_pop();
                    store_to_memory(addr, val);
                    break;
                }
                case OP_ADDI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size - 1] += arg;
                    break;
                }
                case OP_DUP: {
                    stack[stack_size] = stack[stack_size - 1];
                    stack_size++;
                    break;
                }
                case OP_SUB: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] -= arg;
                    break;
                }
                case OP_ADD: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] += arg;
                    break;
                }
                case OP_DIV: {
                    auto arg = stack_pop();
                    if (arg == 0) {
                        cerr << "ZERO DIVISION\n";
                        return;
                    }
                    stack[stack_size - 1] /= arg;
                    break;
                }
                case OP_MUL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] *= arg;
                    break;
                }
                case OP_ABORT: {
                    cerr << "OP_ABORT called\n";
                    return;
                }
                case OP_DONE: {
                    cout << "program DONE\n";
                    return;
                }
                case OP_PRINT: {
                    cout << stack_pop() << "\n";
                    break;
                }
                case OP_POP_RES: {
                    stack_pop();
                    break;
                }
                case OP_GREATER_OR_EQUALI: {
                    auto arg = instructions[ip];
                    ip++;
                    stack[stack_size - 1] = stack[stack_size - 1] >= arg;
                    break;
                }
                case OP_GREATER_OR_EQUAL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] >= arg;
                    break;
                }
                case OP_GREATER: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] > arg;
                    break;
                }
                case OP_LESS: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] < arg;
                    break;
                }
                case OP_LESS_OR_EQUAL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] <= arg;
                    break;
                }
                case OP_EQUAL: {
                    auto arg = stack_pop();
                    stack[stack_size - 1] = stack[stack_size - 1] == arg;
                    break;
                }
                case 0xcafe: {
                    // skip 0xbabe
                    ip++;
                }
            }
        }
  }

Moving to JIT compilation

Since I write in C++, I will use the version of libjit for this language, with ~~blackjack and hookers~~ classes and inheritance.

The way libjit works is that you first generate some intermediate representation (IR) of the code of your function and then libjit translates this representation into machine code. The IR is generated by calling libjit functions.

In order to create a JIT compilable function, you need to define a class inherited from jit_function, define a constructor and override the build and create_signature methods. In reality, a PigletC program only has one function and the bytecode doesn`t have instructions for function calling and returning from functions (and since JUMP instructions can only jump to addresses known at compile-time, we can`t have functions without additional support from the bytecode). Regardless, we can define a universal create_signature function:

jit_type_t create_signature() override {
        auto ret_type = (is_void ? jit_type_void : jit_type_int);
        jit_type_t params[num_args];
        for (int i = 0; i < num_args; i++) {
            params[i] = jit_type_int;
        }
        return jit_type_create_signature(jit_abi_cdecl, ret_type, params, num_args, 1);
}

Above, we consider a function to be able to accept a fixed number of int arguments and return either int or void.

Now the rest we need to do is to create a function based on bytecode. Here I spent some time thinking what to do with the stack. When the interpreter is executing code, the stack is the array/vector/std::stack, defined in the virtual machine itself. When it comes to JIT compilation, I thought it would be efficient to use the real stack (one referenced by the %rspregister on x86). However, I didn`t find libjit methods that would allow this: firstly, the library is meant to be cross-platform and the stack might be different on different architectures, and secondly, libjit is going to use the stack itself and wouldn`t give direct access to it.

Here, it is important to clearly understand, which code is executed when we are creating the IR and which - while the JIT compiled function is executed. For example, one might want to create an array of jit_type_int values in the build function for the stack. But the indexes on the stack aren`t known at compile time, so that is impossible.

Given that, we will pass the pointers to memory, stack and stack size into the constructor of the JIT compiled function and use them in the same way, as in the interpreter. A side effect of this is that it will be possible to switch between function execution in JIT mode and interpreting mode - and if JIT compilation doesn`t support some of the instructions, it will return to the interpreter on finding such an instruction, which will execute the unsupported instructions. For now, JIT compilaion won`t support JUMP instructions.

Another option would be calling malloc from the JIT-function and using the allocated memory for the stack.

Getting ready for function compilation will look this way:

// creating a poiner to int* and writing the outer pointer to memory into it
auto memory = (jit_type_create_pointer(jit_type_int, 1));
memory = this->new_constant(memory_outer);
// doing the same for the stack
auto stack = new_value(jit_type_create_pointer(jit_type_int, 1));
stack = this->new_constant(stack_outer);
// creating a pointer for stack size
jit_value stack_size_ptr = new_value(jit_type_create_pointer(jit_type_int, 1));
stack_size_ptr = this->new_constant(stack_size_ptr_outer);
// convenience wrapper over the function for pushing 
// to the stack
auto push = [this, &stack, &stack_size_ptr] (jit_value&& value) {
    push_on_stack(stack, stack_size_ptr, std::move(value));
};
// same for stack pop
auto pop = [this, &stack, &stack_size_ptr] () {
    return pop_from_stack(stack, stack_size_ptr);
};
int& ip = *ip_ptr;

And somewhere below:

void push_on_stack(jit_value& stack, jit_value& stack_size_ptr, jit_value&& arg) {
    insn_store_elem(stack, insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int), arg);
    insn_store_elem(stack_size_ptr, new_constant(0),
                        insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int) + new_constant(1));
}

jit_value pop_from_stack(jit_value& stack, jit_value& stack_size_ptr) {
    insn_store_elem(stack_size_ptr, new_constant(0),
                    insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int) - new_constant(1));
    return insn_load_elem(stack, insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int), jit_type_int);
}

jit_value peek_stack(jit_value& stack, jit_value& stack_size_ptr) {
    return insn_load_elem(stack, insn_load_elem(stack_size_ptr, new_constant(0), jit_type_int) - new_constant(1),
                          jit_type_int);
}

All the variables and values that a JIT compiled function can use, should be declared as variables of jit_value type.

For convenience, we need to write functions that will "push values on the stack" and "pop values from the stack". They are in quotes because in reality, they won`t be doing that - they will be adding corresponding actions to the IR of the function.

Now the processing of bytecode instructions is relatively easy. Some examples:

case OP_STORE: {
    auto val = pop();
    auto index = pop();
    insn_store_elem(memory, index, val);
    break;
}
case OP_LOAD: {
    auto index = pop();
    push(insn_load_elem(memory, index, jit_type_int));
    break;
}
case OP_PRINT: {
    auto tmp = pop().raw(); // .raw returns a C struct from the C++ wrapper
    this->insn_call_native("print_int", (void*)(&print_int), signature_helper(jit_type_void, jit_type_int, end_params),
        &tmp, 1, 0);
    break;
}

In the implementation of the third instruction, we can see how to call a native function, i. e., compiled into machine code without libjit.

Here it can be noted that the code of creating libjit IR is quite similar to the interpreter code. It seems that we could make the interpreter and JIT compilation share the code for most of the instructions - and we will do this later. This would allow us not to do double work when adding new instructions.

The virtual machine will now check if the next instruction is a JUMP. If yes, it will be executed. Otherwise, JIT compilation will be applied to the following instructions until we reach another JUMP. The JIT compiled code will be saved into the cache and reused.

Naturally, this works very slowly - the parts that we compile, are relatively small, and the interpreter is called quite often. For now this is just a demonstration of what we can do with libjit.

Now we need to add JUMP instructions support. In libjet, besides variables, we can create labels and do two operations with them: setting where the label is to be placed in the code and define JUMPS to this label.

Now we will create a dictionary with the mapping from instruction addresses to labels. When we encounter 0xcafe - we put a label right there. When we meet a JUMP instruction - we set that there should be a JUMP to the label that we get from the dictionary by the corresponding instruction address. Code:

case 0xcafe: {
    // skipping 0xbabe
    ip++;
    // ip is now pointing to the instruction after the label
    if (labels.count(ip) == 0) {
        labels[ip] = jit_label_undefined;
    }
    insn_label(labels[ip]);
    break;
}
case OP_JUMP: {
    auto arg = instructions[ip];
    ip++;
    if (labels.count(arg) == 0) {
        labels[arg] = jit_label_undefined;
    }
    insn_branch_if(new_constant(true), labels[arg]);
    break;
}
case OP_JUMP_IF_TRUE: {
    auto arg = instructions[ip];
    ip++;
    if (labels.count(arg) == 0) {
        labels[arg] = jit_label_undefined;
    }
    insn_branch_if(pop(), labels[arg]);
    break;
}
case OP_JUMP_IF_FALSE: {
    auto arg = instructions[ip];
    ip++;
    if (labels.count(arg) == 0) {
        labels[arg] = jit_label_undefined;
    }
    insn_branch_if_not(pop(), labels[arg]);
    break;
}

At this point we can start the virtual machine for executing some simple code, expecting it to be way faster with JIT compilation. However, our JIT compilation only made the code slower - 1.2 seconds with the interpreter against 1.4 with JIT compilation!

What is wrong? The problem is that we generate code that is kind of complicated (because I expected the optimizer to endure everything), while the libjit optimizer is relatively weak and can`t simplify it.

Let`s try optimizing the code we generate.

The implementation of bytecode instructions is relatively simple - while the operations with the stack look kind of verbose, because we not only increment or decrement the variable with stack size - we read it from some address - and then write it back - and libjit is probably not able to put stack size on the register. Let`s simplify this by making stack size a local variable. We could still save the ability of the interpreter and the JIT compiler to work together - if we save stack size before the JIT compiled part returns.

jit_value stack_size = new_value(jit_type_int);
stack_size = new_constant(0);
auto push = [this, &stack, &stack_size](jit_value&& value) {
    insn_store_elem(stack, stack_size, value);
    stack_size = stack_size + new_constant(1);
};
auto pop = [this, &stack, &stack_size]() {
    stack_size = stack_size - new_constant(1);
    return insn_load_elem(stack, stack_size, jit_type_int);
};
auto peek = [this, &stack, &stack_size]() {
    return insn_load_elem(stack, stack_size - new_constant(1), jit_type_int);
};

This time performance improvement is there and is significant: 1.2 seconds vs 0.4 seconds.

On the other hand, we could expect a more significant improvement. I suspect that weak libjit optimizations combined with the fact that PigletVM is a stack machine, not a register machine, prevent the code from being really fast - bytecode instructions are using the stack all the time which is way slower than using registers. This problem could be partly solved by replacing some sequences of instructions with more complex instructions (like the author of the original PigletVM articles introduced PUSHI, STOREI, LOADADDI, etc)

When measuring performance, I was testing bytecode resulting from the compilation of the following PigletC program:

int res;
int i;

void main() {
    i = 1000000000;
    while (i > 0) {
        i = i - 1;
    }
    print(i);
}

Tags:

Hubs: