, 6 min read

Performance Comparison Pallene vs. Lua 5.1, 5.2, 5.3, 5.4 vs. C

Installing Pallene is described in the previous post: Installing Pallene Compiler. In this post we test the performance of Pallene versus C, Lua 5.4, and LuaJIT. Furthermore we benchmark different Lua versions starting with Lua 5.1 up to 5.4.

1. Array Access. I checked a similar program as in Performance Comparison C vs. Lua vs. LuaJIT vs. Java.

function lua_perf(N:integer, S:integer)
        local t:{ {a:float, b:float, f:float} } = {}

        for i = 1, N do
                t[i] = {
                        a = 0.0,
                        b = 1.0,
                        f = i * 0.25
                }
        end

        for j = 1, S-1 do
                for i = 1, N-1 do
                        t[i].a = t[i].a + t[i].b * t[i].f
                        t[i].b = t[i].b - t[i].a * t[i].f
                end
                --io_write( t[1].a )
        end
end

This program, which does no I/O at all, runs in 0.14s, and therefore runs two times slower than the LuaJIT, which finishes in 0.07s. This clearly is somewhat disappointing. Lua 5.4, as part of Pallene, needs 0.75s. So Pallene is roughly five times faster than Lua.

$ time ./lua/src/lua lua_perf/main.lua   
        real 0.14s
        user 0.13s
        sys 0
        swapped 0
        total space 0

Here, the calling main.lua routine is just

package = require "lua_perf.lua_perf"
N = 4000
S = 1000
package.lua_perf(N,S)

Runtimes are now:

  1. C (no optimization flags): 0.09s
  2. C: 0.02s
  3. Lua 5.1: 1.34s
  4. Lua 5.2: 1.49s
  5. Lua 5.3: 0.95s
  6. Lua 5.4: 0.72s
  7. LuaJIT: 0.07s
  8. Pallene: 0.14s

As explained in Pallene: A companion language for Lua, array handling in LuaJIT is still more efficient than in Pallene.

The only benchmark where LuaJIT is substantially faster than Pallene is Matmul. We have found that this difference is due to memory access. LuaJIT uses the NaN-boxing technique to pack arbitrary Lua values and their type tags inside a single IEEE-754 floating-point number [28]. In particular, this means that in LuaJIT an array of floatingpoint numbers consumes only 8 bytes per number, against the 16 bytes used by Lua and Pallene. This results in a higher cache miss rate and worse performance for Pallene.

2. Integer Computation. Below program computes the number of configurations for multiple chess queens to be positioned without attacking each other. It is the well known n-queen problem. Often times this is solved in a recursive fashion, but the most obvious program is non-recursive and uses goto-statements. Below code is in xdamcnt2.c on GitHub.

#define abs(x)   ((x >= 0) ? x : -x)

/* Check if k-th queen is attacked by any other prior queen.
   Return nonzero if configuration is OK, zero otherwise.
*/
//inline        --> not faster, klm, 24-Apr-2020
int configOkay (int k, int a[]) {
        int z = a[k];

        for (int j=1; j<k; ++j) {
                int l = z - a[j];
                if (l == 0  ||  abs(l) == k - j) return 0;
        }
        return 1;
}

long solve (int N, int a[]) {  // return number of positions
        long cnt = 0;
        int k = a[1] = 1;
        int N2 = N;  //(N + 1) / 2;

        loop:
                if (configOkay(k,a)) {
                        if (k < N)  { a[++k] = 1;  goto loop; }
                        else ++cnt;
                }
                do
                        if (a[k] < N)  { a[k] += 1;  goto loop; }
                while (--k > 1);
                a[1] += 1;
                if (a[1] > N2) return cnt;
                k = 2 ,  a[2] = 1;
        goto loop;
}

Compiled the program with all optimization (-march=native -O3), then ran this program.

$ time xdamcnt2 1 12
 n-queens problem.
   2   4    6     8      10         12           14
 1 0 0 2 10 4 40 92 352 724 2680 14200 73712 365596

 D( 1) = 1
 D( 2) = 0
 D( 3) = 0
 D( 4) = 2
 D( 5) = 10
 D( 6) = 4
 D( 7) = 40
 D( 8) = 92
 D( 9) = 352
 D(10) = 724
 D(11) = 2680
 D(12) = 14200
        real 0.21s
        user 0.21s
        sys 0
        swapped 0
        total space 0

Above C code has been translated to Lua 5.1, which does not provide goto-statements. We deliberately chose Lua 5.1, because LuaJIT can only handle Lua 5.1 syntax. Only starting with Lua 5.2 the goto-statement is allowed in Lua. Not having a goto-statement required to use auxiliary variables, here called flag, then checking with if-statements. This file is called xdamcnt2.lua on GitHub.

-- Check if k-th queen is attacked by any other prior queen.
function configOkay (k, a)
        local z = a[k]
        local kmj
        local l

        for j=1, k-1 do
                l = z - a[j]
                kmj = k - j
                if (l == 0  or  l == kmj  or  -l == kmj) then
                        return false
                end
        end
        return true
end

function solve (N, a)   -- return number of positions
        local cnt = 0
        local k = 1
        local N2 = N  --(N + 1) / 2;
        local flag
        a[1] = 1

        while true do
                flag = 0
                if (configOkay(k,a)) then
                        if (k < N) then
                                k = k + 1;  a[k] = 1; flag = 1
                        else
                                cnt = cnt + 1; flag = 0
                        end
                end
                if (flag == 0) then
                        flag = 0
                        repeat
                                if (a[k] < N) then
                                        a[k] = a[k] + 1;  flag = 1; break;
                                end
                                k = k - 1
                        until (k <= 1)
                        if (flag == 0) then
                                a[1] = a[1] + 1
                                if (a[1] > N2) then return cnt; end
                                k = 2;  a[2] = 1;
                        end
                end
        end
end

The Lua code was then translated to Pallene by adding the required types to each variable. This translation is straightforward. File is called xdamcnt2.pln on GitHub.

-- Check if k-th queen is attacked by any other prior queen.
function configOkay (k:integer, a:{integer}):boolean
        local z:integer = a[k]
        local l:integer
        local kmj:integer 

        for j=1, k-1 do
                l = z - a[j]
                kmj = k - j
                if (l == 0  or  l == kmj  or  -l == kmj) then
                        return false
                end
        end
        return true
end

function solve (N:integer, a:{integer}):integer -- return number of positions
        local cnt:integer = 0
        local k:integer = 1
        local N2:integer = N  --(N + 1) / 2;
        local flag:integer
        a[1] = 1

        while true do
                flag = 0
                if (configOkay(k,a)) then
                        if (k < N) then
                                k = k + 1;  a[k] = 1; flag = 1
                        else
                                cnt = cnt + 1; flag = 0
                        end
                end
                if (flag == 0) then
                        flag = 0
                        repeat
                                if (a[k] < N) then
                                        a[k] = a[k] + 1;  flag = 1; break;
                                end
                                k = k - 1
                        until (k <= 1)
                        if (flag == 0) then
                                a[1] = a[1] + 1
                                if (a[1] > N2) then return cnt; end
                                k = 2;  a[2] = 1;
                        end
                end
        end
end

Runtimes are now:

  1. C (no optimization flags): 0.53s
  2. C: 0.2s
  3. Lua 5.1: 6.15s
  4. Lua 5.2: 7.2s
  5. Lua 5.3: 5.83s
  6. Lua 5.4: 3.92s
  7. LuaJIT: 0.58s
  8. Pallene: 0.32s
  9. Pallene with compiler optimization flags: 0.31s

This time Pallene is two times faster than LuaJIT, and it is 13-times faster than Lua 5.4. Additionally we observe that Lua 5.4 is faster than all previous versions by a large margin. Interesting enough, the C program only needs 60% of the Pallene running time.

3. Environment. All tests were done on an AMD FX-8120 Octacore processor with max. 3.1 GHz. This processor has 64 KiB L1 data cache.

$ lscpu
Model:                1
Model name:           AMD FX(tm)-8120 Eight-Core Processor
CPU max MHz:          3100.0000
CPU min MHz:          1400.0000
L1d cache:            64 KiB
L1i cache:            256 KiB
L2 cache:             8 MiB
L3 cache:             8 MiB

This CPU runs Arch Linux 5.6.6 and gcc 9.3.0.

Above runtime comparisons are in line with the benchmarks in Pallene: A companion language for Lua.

Added 17-Jul-2020: Cited in LWN: What's new in Lua 5.4.

Added 27-Nov-2022: A strong contender to replace LuaJIT might be Haoran Xu's LuaJIT Remake (LJR).

Added 30-Oct-2023: Referenced in Dragonfly + BullMQ = massive performance. Dragonfly is a faster Redis variant, BullMQ is a Node.js library with queuing.