ChunkBake Reference


Introduction

ChunkBake is a line-oriented assembler for Lua 5 virtual machine instructions. It produces a Lua binary chunk from assembly code, which can then be loaded and run on a Lua 5 virtual machine. The project is hosted on LuaForge, at: http://luaforge.net/projects/chunkbake/. Currently, Lua 5.0.2 (instructions and binary chunk format) is supported.

This reference guide assumes you know what Lua 5 virtual machine instructions do and how a binary chunk is put together.  If you want to learn about Lua 5 virtual machine instructions, one document you can read is "A No-Frills Introduction to Lua 5 VM Instructions."


Usage

ChunkBake currently sports a very simple interface:

usage: %s [options] [filenames]

options:
  -h, --help        prints usage information
  --quiet           do not display warnings
  -o <file>         specify file name to write output listing
  --                stop handling arguments

example:
  >%s myscript.asm -o myscript.out

If an output file is not explicitly specified using the -o option, the basename of the source file is appended with .out to form the output filename. The format of the binary chunk file that is generated should conform to that of the platform the script is running on. ChunkBake uses Lua to generate a small binary chunk and takes those parameters as defaults. If this somehow fails (and it should not), the defaults are set for the x86 platform. For other platforms, this is untested but should work. The binary chunk format can also be customized using assembler directives, overriding the platform defaults.


Overview

ChunkBake is a classic line-oriented assembler, meaning that a statement (which can either be an instruction or an assembler directive) must be in a single logical line. Usually each logical line occupies one physical line, but the line continuation character '\' can be used to split a long line into multiple lines. A statement can contain a directive or a mnemonic, plus associated operands or parameters.

ChunkBake's lexer is derived from Lua's lexer. Mnemonics (e.g. MOVE,) directives (e.g. .function) and key names for header properties are not case-sensitive, but other declared symbols are case-sensitive.


Lexical Analyzer

The lexer recognizes and handles the usual line-ending styles automatically: LF, CR, CRLF.

Only Lua 5.0.2 opcodes are supported at this time. The mnemonics follow the symbol names defined in the Lua 5 sources (their syntax will be described in detail later):

MOVE LOADK LOADBOOL LOADNIL
GETGLOBAL SETGLOBAL GETUPVAL SETUPVAL
GETTABLE SETTABLE NEWTABLE SELF
ADD SUB MUL DIV POW UNM NOT CONCAT
EQ LT LE TEST
JMP CALL TAILCALL RETURN
FORLOOP TFORLOOP TFORPREP
SETLIST SETLISTO CLOSE CLOSURE

Most numbers allowed by the assembler are integers. For constants, a number must be prefixed immediately by a '#' (pound or sharp) symbol, e.g. #12, #0.123 #-123 #1.2e34. When specifying constants, the number may have a fractional part and an exponent part.

Expressions are not supported at the moment, it is planned feature, though. The '#' will still be used to denote a constant, in order to unambiguously differentiate constants from local stack locations.

The lexer recognizes hexadecimal numbers with the '0x' prefix as well. Examples are 0x1234, 0XBEEF.

Strings are specified exactly like Lua 5 strings. Strings may be single-quoted or double-quoted, and the [[...]] form (where the delimiter pair can be nested) can be used as well. Single- or double-quoted strings can contain the following escape sequences for control characters: \n, \a, \b, \f, \n, \r, \t, \v, the \ddd style works for 8-bit numbers in general. Single-quote and double quote characters can be escaped as well.

A '\' at the end of a line is a generic line continuation character, so the following are equivalent:

move 0 1
move \
    0 \
    1

A comment to the end of the line can use either '--' or ';':

-- a comment
; a comment

A block comment uses the Lua 5 '--[[...]]' style:

--[[ this is a block comment
     more comments here
--]]

Locals in the current stack frame, which works like registers in a register window, can use several styles. The following are equivalent:

1     ; local 1 on current stack frame
r1    ; use this if you like thinking in terms of registers
R1    ; a capital 'R' probably looks better
$1    ; the '$' prefix is used in some assemblers

Mnemonics cannot be used as user-declared symbols or labels. Any symbol with a '.' as the first character (e.g. .local) is considered to be an assembler directive. The directive names currently used by the assembler are:

.header
.function .func
.end
.param .local .upvalue .const

The assembler has three predefined keywords (case-sensitive) that are used as operands in certain instructions or directives:

nil true false

Labels are explicitly identified when a symbol has a ':' suffix. Labels can also be placed at the beginning of most statements without the ':' suffix. All labels or symbols must be defined before they can be used, except for labels used in jumps, which can be forward-referenced. Examples:

FOO: move R0 R1
BAR move $2 $3
BAZ .function

As of version 0.5.0, symbolic names occupies several tables associated with different kinds of objects. Thus it is possile to use identical names for different things. A single symbol table avoids such confusion. I consider the use of multiple symbol tables broken, and it will be fixed in a future release.


Assembler Directives

There are two kinds of directives:

As ChunkBake provide sane defaults, there is often little need to use header directives unless you are generating a non-standard binary chunk. Likewise, you can choose not to specify constants explicitly using resource directives and let the assembler handle it for you.

In the following sections, each directive will be described in more detail.


.header

.header [key=<string>|<number>[,]]...

The .header directive is an optional directive; you only need to use it to enforce a value for a particular global header parameter. If you are generating a generic binary chunk targeted towards the platform the assembler is running on, you will not need to use it at all.

Apart from being able to customize the global header elements of a Lua 5 binary chunk, the .header directive is also able to set a number of other parameters that governs the generation of the binary chunk, e.g. the maximum number of allowed upvalues can be changed.

Parameters are specified as key-value pairs, optionally separated by a comma. If a statement is very long, it can be split into multiple .header directives, or the line can be split into several physical lines using the '\' line continuation character. If a key is specified a number of times, the last value is taken as the final valid value. The .header directive can only be used before the definition of the first function prototype.

Keys are case-insensitive names, while values can be numbers (usually integers) or strings. For example:

.header signature="\27Foo"

The directive changes the binary chunk signature from the usual "\27Lua" to "\27Foo". In most cases, illegal values are flagged as errors. The assembler allows binary chunk signatures that are not 4 bytes in length; a warning message is generated instead of an error.

The parameters that can be changed are as follows:

Key
Type/Unit
Example
Default (x86)
signature string
signature = "ABCD" "\27Lua"
version
number
version = 0x55
0x50
endianness
number, 0 or 1
endianness = 0 1
int
number, bytes
int = 8
4
size_t
number, bytes size_t = 6
4
instruction number, bytes instruction = 6 4
size_op number, bits size_op = 8 6
size_a number, bits size_a = 9 8
size_b number, bits size_b = 10 9
size_c number, bits size_c = 10 9
number_type1 string, data type
number_type = "single" "double"
maxstack number maxstack = 100 250
maxvars number maxvars = 50 200
maxupvalues number maxupvalues = 10 32
maxparams number maxparams = 50 100
fpf number fpf = 16 32
1 accepts "double", "single", "int", "long long"

The first 11 keys in the list are almost equivalent to their counterpart fields in a standard Lua 5 binary chunk header. The last 5 keys are constants you probably shouldn't touch unless you are running a Lua binary that has non-standard constraints and you know what to do and what the keys mean.

Note that the Lua binary chunk format may change with different versions of Lua. The upcoming Lua 5.1 has a slightly different binary chunk from that of Lua 5.0.2, so beware when changing version.

Errors may be generated if you run afoul of internal constraints, e.g. the instruction opcode field SIZE_OP must exceed 6 bits. Here is an example of a custom .header directive:

.header signature = "Dude", \
        version = 0x57, \
        endianness = 0, \
        int = 6, size_t = 7, \
        instruction = 5, \
        SIZE_OP=8, SIZE_A=10, \
        SIZE_B=11, SIZE_C = 11, \
        number_type = "single"

If you don't want to change the detected default parameters, you don't need to use .header at all.


.function and .end

.func|.function [key=<string>|<number>[,]]...
.end

Each function must be started by a single .function directive and ended by a .end directive. .func, an alias of .function can also be used. There can be only one top-level function and functions must have at least one valid instruction. Thus the simplest assembly listing that can be written is:

.func
return 0 0
.end

The key-value pairs for customizing a function are:

Key
Type/Unit
Example
source_name string
source_name="foo.lua"
line_defined number
line_defined=10
numparams number
numparams=2
is_vararg number, 0 or 1
is_vararg=1
maxstacksize number
maxstacksize=10

Unlike .header parameters, you usually don't need to customize a .function directive. Normally, the assembler will fill in the parameters for you. The source_name parameter is set by the assembler for the top-level function, as is line_defined. numparams is automatically calculated when you use the .param directive, and the assembler keeps track of stack locations used in order to set maxstacksize, so usually it is wiser to leave them be.

For functions that accept a variable number of arguments, is_vararg should be set to 1, and the implicit parameter arg will be in stack position 0. An example of a function that accepts a variable number of arguments is printf.asm in the /test directory.

Here is an example of function header customization:

.function source_name = "Morpheus", \
          line_defined = 1234, \
          numparams=6, is_vararg=1, \
          maxstacksize=47

So, for normal use, you won't need to specify anything for the .function directive, except for is_vararg when you need functions that accept a variable number of arguments.

Functions can be named using labels, so that names can be used instead of numbers when writing a closure instruction. The following are two functions named ONE and TWO:

ONE .function
return 0 0
.end

TWO: .function
return 0 0
.end

Next comes directives that are used to declare named resources for a function.


.param

.param <symbol>

.param declares a register location as a function parameter, or argument, and assigns a name to the parameter. Since parameters occupy the lowest register numbers, the assembler assigns the locations for you automatically, starting from location 0. You are not allowed to declare any local variables before parameter declarations. The numparams field in the function header is automatically updated by the assembler.

.function
.param Var
.param Pos
add 2, Var, Pos
; more instructions here
.end

In the example above, Var is assigned register location 0 while Pos is assigned register location 1. The add instruction adds parameters Var and Pos and saves the result in register 2. You can still use 0 and 1 as the parameter locations in an instruction; the usage of symbolic names and numbers is up to the user.

A parameter declaration need not be placed immediately after .function. You only need to ensure that all local variables are declared after parameter declarations, and that symbolic names names are declared before they are used.


.local

.local <symbol> [[,] <number>]

.local declares local variable names. You can either let the assembler assign the next lowest available location that does not have a symbolic name associated with it (either from a .param or .local declaration.) Alternatively, the register number can be explicitly specified.

.function
.param Foo
.local Bar
.local Baz, 4
; rest of function goes here
.end

Since location 0 is occupied by Foo, the first .local directive associates the name Bar to location 1. The next .local directive explicitly assigns the name Baz to location 4. The comma separator is optional. Now, if there is another .local directive in the function without a specified location number, location 2 will be used.

Locals must be declared after all parameters have been declared. Other than that, locals can be declared anywhere inside a function, as long as symbolic names are declared before they are used.


.upvalue

.upvalue <symbol> [[,] <number>]

The syntax of .upvalue is exactly like the syntax of .local. Assignment of upvalue numbers can be automatic or explicit. The assembler sets the number of upvalues (which is a byte-sized field) in the function header for you.

.function
.upvalue Larry
.upvalue Moe 2
; rest of function goes here
.end

Upvalue number 0 is assigned the name Larry, while upvalue number 2 is assigned the name Moe. The operands of the directive can be optionally separated by a comma. When the upvalue list is written out, non-declared values in the upvalue list is assigned the name "(none)" so that loaders don't complain of zero-length names.


.const

.const [<symbol> [,]] <number>|<immediate>|<string>|nil [[,] <number>]

Constants are declared using the .const directive. For constants, both the constant number and the symbolic name of the constant is optional. Leaving out the constant number forces automatic constant number assignment. If no name is associated with a constant, it must be accessed using the constant number.

The constant may be a string, an immediate number (with a leading '#'), a number or nil. A number is interpreted as an immediate, but if you are not comfortable with the ambiguity, you can always declare numbers using the immediate syntax.

Here are some unnamed constant declarations with automatic constant number assignment:

.function
.const 7.8e-9         ; constant number 0, a number, 7.8e-9
.const #12345         ; constant number 1, a number, 12345
.const "Eastwood"     ; constant number 2, a string, "Eastwood"
.const nil            ; constant number 3, a nil
; rest of the function goes here

Constants can be declared anywhere within a function as long as symbolic names are declared before use. Here are more constant declarations:

.const FOO #12.34        ; FOO is a constant with value 12.34
.const BAR, "Eastwood"   ; BAR is a constant with value "Eastwood"
.const Beef, #0xBEEF, 3  ; constant number 3 is named Beef, value 0xBEEF
.const #0xDEAD 2         ; constant number 2 has value 0xDEAD

Commas between operands is optional. Constant declarations with identical values are not merged; as far as the assembler is concerned, it is dealing with separate constants.

In order to simplify the writing of Lua assembly code, it is possible to write code without a single .const declaration. The assembler allows constants to be used directly as operands wherever there is an RK(x) operand or a Kst(x) operand. This is also why numbers need to be disambiguated from numerical constants (immediates). When constants are used directly as instruction operands, the assembler will automatically assign constant numbers to the constants and duplicates constants will use the same constant number.

The user is given the ultimate control as to how constants are declared or used in functions in the assembly code. The intent of the handling system for constants is that important variables can be declared as named symbols, while trivial constants can be used directly as operands in instructions.


Labels and Symbols

Labels can be used for .param, .local, .upvalue and .const. Since you can name a function's resources using the declarations only, labels are a kind of aliasing mechanism. For example:

Neo: .const Anderson, "Thomas Anderson"

In the example above, both Anderson and Neo points to the same constant number that holds the string "Thomas Anderson". Although mechanism is available to the user, it is probably a good idea to use it only when there are good reasons to do so.


Instruction Operands

The following are descriptions of common Lua 5 virtual machine instruction operands:

The assembler tracks register usage by noting the R(x) and RK(x) operands and sets the maxstacksize field in a function header accordingly. So normally, the user does not need to do anything about the maxstacksize field. You can explicitly set maxstacksize too, but if your custom value is too small, then the assembler will flag an error.

Other operands are described within the descriptions of Lua 5 virtual machine instructions in the next section.


Lua 5 Instructions

Like most classic assemblers, instruction statements are written on a per-line basis. In the case of ChunkBake, 'per-line' refers to a logical line, since a '\' can be used as a line continuation character. Also, strings can occupy more than one line.

Operands are optionally separated by commas. Thus, in the syntax descriptions below, any comma can be omitted. This relaxed rule is to allow for different writing styles. It is recommended that the user stick to a single style in a program. In some cases, '..' can be optionally used to replace a comma to denote a range of registers or values. This helps to make the intent of the instruction clear, e.g. loadnil 0..2

Many illegal conditions are trapped by the assembler, however, writing illegal or incorrect instructions is by no means impossible. The assembler lacks a comprehensive warning system at the moment, so it is largely up to the user to enforce some discipline in coding.

The following describes only how instructions are written in the ChunkBake assembler, it does not describe what each instruction does. For help on the latter, please see the beginning of this document. Most of the examples in the following are copied from the automatic test script, TestRig.lua.


MOVE R(A) , R(B)

Moves values between register locations. Examples:

MOVE 0 1
move R3 R4
move $5, $6
move $7, R8
move FOO, BAR


LOADK R(A) , Kst(Bx)

Loads a constant into a register. Examples:

loadk $1, 0
LOADK R1, FOO
loadk 2, "foobar"
loadk $0 #1234


LOADBOOL R(A) , B(0|1|true|false) , C(0|1)

Loads a boolean into a register. The boolean value can be specified as a number (0 for false, 1 for true) or the true and false keywords can be used. The flag in field C for skipping the next instruction may be specified as a 0 (don't skip) or 1 (skip the next instruction.) Examples:

loadbool R0, 0, 0
loadbool $1, 1, 1
loadbool 2 true 0
loadbool 3 false, 1


LOADNIL R(A) (,|..) R(B)

Sets a range of registers to nil. In order to make the intent of the instruction clear, '..' can be used instead of a comma to separate the two operands, thus showing that the instruction is operating on the given register range. The second register must be greater than or equal to the first register. Examples:

loadnil 0,0           ; a single register
loadnil R1..R3        ; makes it clear R1,R2,R3 are set
loadnil 2 $4
loadnil FOO .. BAR


GETUPVAL R(A) , Upvalue[Bx]
SETUPVAL R(A) , Upvalue[Bx]

Loads or saves an upvalue. Examples:

getupval $1, 0
setupval R2 FOO


GETGLOBAL R(A) , Kst(Bx)
SETGLOBAL R(A) , Kst(Bx)

Loads or saves a global variable. Examples:

getglobal R0, FOO
setglobal $1, FOO
getglobal 1, 0
setglobal 2, "bar"
getglobal 3 #0xBEEF


GETTABLE R(A) , R(B) , RK(C)
GETTABLE R(A) , R(B) "[" RK(C) "]"

Gets a value from a table. The table key can be enclosed in square brackets for increased readability. Examples:

gettable 1 0 2
gettable 1, R2, 250
gettable $2 3 #0xBEEF
gettable R4 R5 "foobar"
gettable R0, $1[251]
gettable 1 2[3]
gettable $3, 2["trinity"]


SETTABLE R(A) , RK(B) , RK(C)
SETTABLE R(A) "[" RK(B) "]" , RK(C)

Sets the value of a given table key. The table key can be enclosed in square brackets for increased readability. Examples:

settable 1 2 3
settable R2, 250, $1
settable $1 #0xDEAD, 3
settable R4 "foobar" R5
settable R4 R2 "foobar"
settable $1[251], R0
settable 2["trinity"] $3


NEWTABLE R(A) , (<number>|<immediate>) , (<number>|<immediate>)

Creates a new table object and assigns it to the given register. The assembler accepts either encoded size values (as integer numbers) or raw size values (as immediate numbers). Examples:

newtable 2 0 0               ; table of: array=0, hash=0
newtable R1, 10, 4           ;
table of: array=4, hash=16
newtable 1 #200 #200         ;
table of: array=224, hash=256


SELF R(A) , R(B) , RK(C)
SELF R(A) , R(B) "[" RK(C) "]"

Sets up an object call. Part of the operation includes a table lookup, hence the syntax of self is similar to the syntax of gettable. Examples:

self R1 R2 R3
self 1,2,3
self $2 3[R4]
self R1 R0[FOO]
self R2 R1,250
self $1 BAR[BAR]
self BAR $1[#0xBEEF]
self BAR $1["morpheus"]


ADD R(A) , RK(B) , RK(C)
SUB R(A) , RK(B) , RK(C)
MUL R(A) , RK(B) , RK(C)
DIV R(A) , RK(B) , RK(C)
POW R(A) , RK(B) , RK(C)

Binary arithmetic operations. Examples:

add 0,0,0
add $0,R1,2
add FOO, 0, BAR
sub R1, #123, #456
mul $2, "foo", "bar"
div 3, FOO, 250
pow 0, R1, BAR


UNM R(A) , R(B)
NOT R(A) , R(B)

Unary minus and logical not operations. Examples:

unm 1, $2
not R3 R4
unm FOO, BAR
not BAR FOO


CONCAT R(A) , R(B) (,|..) R(C)

Concatenates a range of registers and assigns the result to a register. A '..' can be used instead of a comma to separate the operands that make up the range of registers. Examples:

concat $0 $1 $3
concat 0,1,1
concat R1 R2..R4
concat 0, FOO..BAR


EQ (0|1) , RK(B) , RK(C)
LT (0|1) , RK(B) , RK(C)
LE (0|1) , RK(B) , RK(C)

Relational operations. The A field can be 0 or 1, which signifies the result to be tested for. The next instruction is skipped if the comparison result is not the same as the expected result. Examples:

eq 0, 1, 2
eq 1 $2 $1
eq 0 R1 R2
eq 0 250 251
lt 0, FOO, BAR
lt 1 #123 #456
le 0 "foo" "bar"
le 1 #47 "47"


TEST R(A) , R(B) , (0|1)

Text operation. The C field can be 0 or 1, which signifies the result to be tested for. The next instruction is skipped if the comparison result is not the same as the expected result. Examples:

test 1, 2, 0
test $2 $1, 1
test R1 R2 0
test FOO BAR 0


JMP label|disp

Performs an unconditional jump. Jumps can be to absolute positions, to relative positions or to a label. Absolute jumps must be a positive integer, but bounds checking is not currently performed, so illegal instructions can be coded. Likewise for relative jumps. Also, the assembler does not currently check the case where the jmp instruction jumps to itself, creating an infinite loop. In future versions of ChunkBake, the user will be warned of such cases. Examples:

jmp #0         ; relative jumps
jmp #-10
jmp #10
jmp 3          ; absolute jump
jmp BAR        ; labeled jump


CALL R(A) , B , C

Calls a closure. Examples:

call $0, 0, 0
call R1, 2, 3
call 2 3 1


TAILCALL R(A) , B
RETURN R(A) , B

Returns to calling function or performs a tail call. Examples:

return $1 2
return R2 20
return 3, 4
return FOO, 8
tailcall $1 2
tailcall R2 20
tailcall 3, 4
tailcall BAR, 66


FORLOOP R(A) , label|disp

Performs an iteration of a numeric for loop. See the jmp instruction on the use and specification of jumps. Examples:

forloop 1 FOO
forloop R1, BAR


TFORLOOP R(A) , C

Performs an iteration of a generic for loop. Examples:

tforloop 0 0
tforloop $1, 4
tforloop R2, 10


TFORPREP R(A) , label|disp

Performs the initialization for a generic for loop. See the jmp instruction on the use and specification of jumps. Examples:

tforprep 1 FOO
tforprep R1, BAR


SETLIST R(A) , Bx
SETLIST R(A) , start [..] end
SETLISTO R(A) , Bx
SETLISTO R(A) , start [..] end

Sets a table with a list of values in a range of registers. You can either encode the range and specify a single number, or specify a range and let the assembler calculate the Bx value. setlisto is used when the final element of the list is a function call, which has an indeterminate number of return values. Examples:

setlist $1, 10             ; index 1 to 11
setlist R2, 40             ; index 33 to 41
setlist 0 1..10            ; index 1 to 10
setlist 1, 33..42          ; index 33 to 42
setlisto 1, 65..74         ; index 65 to top

The encoded number and the range specification must be valid, or else the assembler will report an error. Note that FIELDS_PER_FLUSH (or fpf in the .header directive) can be customized, so the index range calculations will be adjusted accordingly.


CLOSE R(A)

Closes local variables on the stack from the specified register onwards. Examples:

close 0
close FOO



CLOSURE R(A) , Bx

Creates a closure or instantiation of a function, assigning the result to a register. If the function prototype has a label, the label can be used to refer to it. Examples:

closure 1, 0
closure R2 1
closure FOO 2
closure $0 PRINTF           ; see the printf.asm test example


Examples

A number of examples can be found in the test directory of the ChunkBake distribution. The automatic test system, TestRig.lua in the misc directory, has a pretty complete coverage of both legal and illegal forms of all instructions.


This page Copyright © 2005 KHMan under the same terms as ChunkBake. See ChunkBake for license information (MIT-style). Last Revised: 2005-03-04. Personal: http://www.geocities.com/keinhong/ | Project URL: http://luaforge.net/projects/chunkbake/