From Brainfuck to Python bytecode

Posted on 2019-08-27 with tags python, com­piler, brain­fuck, byte­code

When I wrote this post (and the script) the last Python ver­sion was 3.7, due of its low-​level na­ture, the pro­gram no longer works with newer ver­sions. How­ever it should be sim­ple to up­date it (start­ing from, for ex­am­ple, chang­ing the magic num­ber) or even bet­ter, using this li­brary.

About one month ago I wrote this sim­ple tran­spiler from Brain­fuck to Python byte­code. I’m going to as­sume you al­ready can “pro­gram” in Brain­fuck, oth­er­wise I ad­vise you to read the ded­i­cated page on Es­olang, in my opin­ion the best place to learn about the lan­guage. I’m also as­sum­ing that you have a min­i­mal knowl­edge about stack ma­chines, per­son­ally I only used them as a the­o­ret­i­cal ob­jects dur­ing some proofs. I learned to use them as a real soft­ware only dur­ing the cre­ation of this tran­spiler.

Un­der­stand­ing how to do all of this was not easy be­cause the Python vir­tual ma­chine is a mov­ing tar­get, I worked with Python 3.7 and I think my pro­gram should work with Python 3.6+. The best places to un­der­stand how the things work under the hood is the of­fi­cial dis mod­ule doc­u­men­ta­tion and this source di­rectly from the Python source code. I rec­om­mend to take a look at the first link and to use the sec­ond one only for con­sul­ta­tion.

I sug­gest to keep an eye on the source of the pro­gram while read­ing this post.

In the first part of the pro­gram there are some im­ports and the ini­tial­iza­tion of the cli parser, the in­ter­est­ing part starts with the parse func­tion.

def parse(src):  # parse the brainfuck source
    stack = []  # to remember if inside a [...]
    endAt = {}   # correspondence between brackes [...]

    for i, char in enumerate(src):
        if char == '[':
            stack.append(i)
        elif char == ']':
            endAt[stack.pop()] = i

    def recParse(start=0, end=len(src)-1):  # recursive parser
        ast = []
        i = start

        while i < end:
            char = src[i]
            if char == '+':
                if ast != [] and isinstance(ast[-1], int):
                    ast[-1] = (ast[-1] + 1) % 256
                else:
                    ast.append(1)
            elif char == '-':
                if ast != [] and isinstance(ast[-1], int):
                    ast[-1] = (ast[-1] - 1) % 256
                else:
                    ast.append(255)
            elif char in ('>', '<', '.', ','):
                ast.append(char)
            elif char == '[':
                ast.append('[')
                ast.append(recParse(i+1, endAt[i]))
                ast.append(']')
                i = endAt[i]

            i += 1

        return ast  # return the abstract syntax tree

    return recParse()

That func­tion cre­ates the ab­stract syn­tax tree, which is a bit overkill as word be­cause it sim­ply re­turns a list of lists where the el­e­ments can be [, ], ,, ., <, > or an in­te­ger [0,256)\in [0, 256). This pro­ce­dure ig­nores any char­ac­ter that isn’t one of the stan­dard 8 brain­fuck com­mands and au­to­mat­i­cally sim­plify the suc­ces­sions of + and - in a num­ber mod­ulo 256. It would be pos­si­ble to sim­plify also < and > but I was too lazy.

The visit func­tion sim­ply depth vis­its the ab­stract syn­tax tree ap­ply­ing the vis­i­tor func­tion for every el­e­ment of the tree.

def visit(visitor, ast):  # depth visit the ast with the visitor function
    for child in ast:
        if isinstance(child, list):
            visit(visitor, child)
        else:
            visitor(child)

Then it’s cre­ated a bytearray ob­ject called instructions con­tain­ing some in­struc­tions for the stack ma­chine (every in­struc­tion is 2 bytes long), that part is manda­tory for every pro­grams that the com­piler gen­er­ates, ba­si­cally it con­tains some im­ports.

In­stead the global vari­able addresses is a stack where the top el­e­ment is the ad­dress (as index of instructions) of the last [ met dur­ing the com­pi­la­tion.

The real com­pi­la­tion oc­curs in­side visitor, the func­tion it­self is ba­si­cally a big switch state­ment where dif­fer­ent things hap­pen de­pend­ing on the el­e­ment of the ab­stract syn­tax tree. The only note­wor­thy branches are [ and ], in­side the first one is an­no­tated the ad­dress at the top of addresses and then 6 NOP in­struc­tions are added to the pro­gram so that when the in the sec­ond branch ] is met, it’s pos­si­ble to change the NOP in­struc­tions to man­age the JUMP in­struc­tion.

ast = parse(source)
visit(visitor, ast)

instructions.extend([  # the last instructions for every program
    opmap['LOAD_CONST'], 0,
    opmap['RETURN_VALUE']
])

code = CodeType(
        0,  # argcount
        0,  # kwonlyargcount
        3,  # nlocals
        1000,  # stacksize
        0,  # flags
        bytes(instructions),  # codestring
        (None, *range(257), '', ('end',), arraySize, ('stdin',)),  # consts
        ('print', 'input', 'ord', 'chr', 'sys', 'stdin', 'read'),  # names
        ('array', 'pointer', 'stdin'),  # varnames
        args.outputfile,  # filename
        args.outputfile,  # name
        0,  # firstlineno
        bytes(),  # lnotab
        (),  # freevars
        ()  # cellvars
)


if args.show:
    print(dis(code))  # show the bytecode in a readable format
    exit(0)

with open(args.outputfile, 'wb+') as out:
    # printing the first 16 bytes in the file
    out.write(MAGIC_NUMBER)  # this depends on the Python version
    out.write(bytes([0] * 12))  # because of the pyc file format
    marshal.dump(code, out)

exit(0)

Here the pars­ing and the vis­it­ing (i.e. the instructions cre­ation) are done, then two last in­struc­tions are added, they ba­si­cally let the pro­gram re­turn None. Then a CodeType “ob­ject” is cre­ated, it was not easy to find some doc­u­men­ta­tions about this. At the end the bytearray is se­ri­al­ized and wrote to a file using the mar­shal mod­ule. This is how the .pyc files are struc­tured in­side.

Let’s note that be­fore writ­ing the in­struc­tions in the file a magic num­ber of 16 bytes is writ­ten, it de­pends on the Python ver­sion, so this tran­spiler should gen­er­ate byte­code work­ing only with the same Python ver­sion used for the in­ter­piler ex­e­cu­tion.

The com­plete source code of the pro­gram is here on Gist.

Here you can see an usage ex­am­ple where I com­pile a Brain­fuck pro­gram which prints an ascii ver­sion of a fa­mous frac­tal.

--:----:--

For this I have to thanks Daniel Cristo­fani, an in­sanely good Brain­fuck de­vel­oper who wrote pro­grams such as the (maybe) short­est pos­si­ble quine, a Brain­fuck in­ter­preter (a.k.a. Meta-​circular eval­u­a­tor) and a Brain­fuck to C tran­spiler.

This project has a lot of pos­si­ble im­prove­ments…