## Playing with LISP and LISP vs C

Lisp is fun! Well, since I first knew about Lisp I was fascinated, but I have found it hard to learn Lisp and to play with it in a meaningful way. A few years ago I wrote about it here and here. As usual, the first steps of learning something new can be the hardest.

Occationally I use Hackerrank to find programming challanges to solve for fun. I particularly like the Project Euler competition. I find it particularly good for trying out new languages: you get a “meaningful” challenge, a simple environment prepared in your web browser, and automated test cases. So, this time I didn’t waste my time trying to find the right Lisp implementation for me, I just started hacking on Project Euler 38 and 39 on Hackerrank.

Problem 38 was quite simple, but 39 was more interesting. When I had solved it, I found my implementation was not at all fast enough, so I started experimenting locally (the Hackerrank environment is not optimal for tweaking, optimization and debugging).

Choosing a (Common) Lisp implementation
There are quite many Common Lisp implementations out there. The one Hackerrank uses is SBCL. That is clearly the Common Lisp implementation I would recommend (based on my little experience) if it is available for your platform.

I installed SBCL with apt-get in Ubuntu. I also downloaded binaries directly for my Mac OS X computer and my Raspberry Pi (v1) running Arch linux. Installation is a bit non-standard, but you can actually run it without installing (just execute run-sbcl.sh in downloaded folder).

I also tried clisp and ecl, none of these could deal with the memory usage (stack size) of my program. For clisp I found no way to manipulate stack sizes at all. For ecl I made some progress but I could not make it run my program.

SBCL is a Lisp compiler, and it produces fast and efficient code. I later compared it to C.

Project Euler 39
Project Euler 39 is basically about finding integer solutions to Pythagoras theorem. For a given, large, perimeter, how many right triangles are there? For example:

300000^2 + 400000^2 = 500000^2

This triangle has a perimeter of 300000+400000+500000=1200000. What other values for a and b so that

a + b = 700000
a^2 + b^2 = 500000^2

are there? The Hackerrank challenge requires you to work with perimeters up to 5000000. If you implement a solution, a few things to immediately note:

• The squares wont fit in a 32bit integer. They will fit with no loss of precision in the 53 bits of a 64 bit double and they will also fit in a 64 bit integer. (This matters not for Common Lisp)
• If you want to do recursion (and of course you want when you code Lisp) it will be millions of recursion steps, which will be a challenge to the stack size. (This also turned out not to matter for SBCL)

The Lisp implementation
It turned out that the SBCL compiler optimized the recursion is such a way that the memory usage was quite low. SBCL successfully runs my program on RPi/Arch, Intel/Ubuntu and Intel/OSX with quite reasonable memory usage.

Since this is about learing Lisp I wanted a 100% functional programming implementation. Only pure functions. A lot of my code is about generating, modifying and testing triangles. A triangle (a b c) can obviously be represented as a Lisp list (a b c) and this was my first implementation. Then if you want to read a, b or c from a list abc, or create the list from a, b and c, you can do:

a: (car abc)
b: (car (cdr abc))
c: (car (cdr (cdr abc)))

abc: (list a b c)

I found this cumbersome. It became a lot of list-to-variables and variables-to-list overhead (I didnt care so much about performance, more about my code readability). I learnt that Lisp functions can return multiple values using value and that you can bind them with multiple-value-bind and use them as arguments to a function using multiple-value-call. This felt functional and pure enough, and it made my code 25% faster than the car/cdr pattern above.:

; a (stupid) function returning a triangle as three values
(defun get-345-triangle ()
(values 3 4 5))

; a function calculating the perimeter of triangle (from a function)
(defun triangle-perimeter-1 (tri-func)
(multiple-value-bind (a b c) (funcall tri-func)
(+ a b c)))

; and in this case you dont need to bind, you can use + directly
(defun triangle-perimeter-2 (tri-func)
(multiple-value-call #'+ (funcall tri-func)))

; now this works
(triangle-perimeter-1 #'get-345-triangle)
(triangle-perimeter-2 #'get-345-triangle)

Since I am a very inexperienced Lisp programmer I appreciate suggestions for improvement.

Performance of Lisp
My final Hackerrank submission of Lisp code executes in about 4.5 seconds on my Intel i5/Ubuntu. It takes about the same time on the Hackerrank web page, which is fast enough to pass all tests. On the Raspberry Pi v1 (ARMv6 @700 MHz) it takes more than 700 seconds. My intuition told me that 4.5 seconds was very good. This made me ask two questions. How would Lisp compare to C? And why is the ARM more than 100 times slower, how would that compare in C?

The C implementation
My ambition was to rewrite Lisp to C line by line. So my C-program has exactly the same functions which take almost exactly the same arguments. All calculations are identical and performed in exactly the same order. The C-program relies entirely on recursion instead of loops (just like the Lisp program). However…

Functions in C can not return multiple variables. While Lisp had values I decided to use a reference to a struct in C:

(defun get-a-triangle()
(values x y z))

void get_a_triangle(struct triangle *t) {
t->a = x;
t->b = y;
t->c = z;
}

If the C-triangle struct is a local variable on the callers stack the difference is quite small (from a practical point of view, from a theoretic strict functional programming perspective its a different story).

Numbers in Lisp have arbitrary precision integers and floats make no difference. So, when porting to C, I had to pick numeric types. For most purposes, int32_t was good enough. But, for the purpose of calculating Pythagoras theorem higher precision was needed (as I wrote above, the 53 bits of double, or 64 bits of int64_t are good). So I ended up with 5 versions of the C-program (to compare performance):

1. All 64-bit integers
2. 32-bit integers, 64-bit for “triangles”
3. 32-bit integers, double for “triangles”
4. 32-bit integers, 64-bit only for pythagoras calc
5. 32-bit integers, double only for pythagoras calc

(In cases 2,3 the struct triangle has int64_t/doubles properties, and all manipulations and calculations on triangles use these datatypes. In cases 4,5 everything is int32_t, except the internals of a single function, which casts to higher precision before doing its calculations.)

The C-program requires a significant stack size. The stack size can be obtain and changed like (numbers in kb, all values given with ulimit -a):

\$ ulimit -s
8192

\$ ulimit -s 100000

For my program, a stack size much higher than 8192 is needed (see below). It seems impossible to get large stack than 64Mb in Mac OS X, so my C program could never run there.

Benchmark findings
All C-programs are compiled with gcc -O2.

CPU            MHZ      SBCL        64     32/64  32/double   32(64)  32(double)
==================================================================================
Time (s)
i5-4250U 1300-2600       4.5      1.52      1.52      1.60      1,54      1.58
ARMv6          700      ~715        85        83        45        42        39
ARMv7          900       357        23        21        13        12        10

Max Res (MB)
i5-4250U                  41       103       103       103       103       103
ARMv6                     50       220       210        79       110        76
ARMv7                     57       180       160        87        97        62

This is not too easy to interpret! The ironic thing is that the fastest thing on the x64-cpu (64-bit integers everywhere) is the slowest on the ARMv6. However, the fastest option on the ARMv6 (32-bit everywhere, and when absolutely needed, use double) is almost the worst on the i5 CPU.

When it comes to the 64-bit i5, it basically does not matter what datatypes you use.

When it comes to the ARMv6, the most important thing is to not store the triangles as int64_t. The strange thing here is the stack sizes. Why does it double (compared to x64) when triangles are stored as int64_t? And the doubles, why do they reduce stack size so much (where are all these doubles actually stored)?

The time command gives max resident memory usage. If I set ulimit -s 128 the first two programs fail (with Segmentation fault 11), and the last three ones succeed, on the ARMv6.

I have found before that the performance of the ARMv6 suffers because of its slow memory and small cache. It is quite possible that the poor performance of the ARMv6 compared to the i5 is related to its slow memory, and the recursion (and stack memory) heavy algorithm.

Finally, SBCL in x64 has very good performance even compared to C (however, an iterative C-implementation, fitting completely in cache, would probably be faster). Note that I am a novice Lisp programmer and this is a math heavy program where the generic number type of Lisp will come at a cost. On the ARMv6, Lisp performance suffers much more.

Windows stack size limit
For Windows, stack size limit is set in the binary, not in the shell. With Cygwin/GCC use the flag -Wl,–stack,1000000 for one million bytes. Note that these are options passed on to the linker.

Future investigations
And I am curious about how much faster a minimal-memory-footprint loop-based C-program would perform.

The source code
Since this code solves a problem in Hackerrank I hesitate to publish it. If you want it for any other reason than just running it on Hackerrank let me know.

1. Sam Griffith

I’m wondering did you use the type hinting on the Lisp program and also set the space/speed settings for the compiler or on each function. Oftentimes you can get huge speed ups just with those. Also depending on the Lisp, you can call the disassemble function and give it a function you want to see the generated assembly for and then optimize more by playing around with the type hinting etc.

2. Not much (type hinting and space/speed settings) actually. I have tried it before but found little difference.

For large array I have found it (type hinting) important!

3. I’m a Lisp beginner myself and likely know even less about performance when it comes to programming languages. Nevertheless, I did notice a few things. Not only can you skip the variables overhead by just nesting. You can also combine functions, like the car/cdr/cdr to caddr. Also, Lisp is big on recursion, which offers the option to just loop through the list to get the elements, instead of addressing each element one by one. So, it seems there is quite some room to find better options.

Aside from that, I do play with Raspberries (and other) a lot. I ran the same Python scripts on 3 different devices. And Raspberries don’t stand a chance in comparison with every devices like PC and notebook. Not even close. They take far longer for the same tasks.

However, is it really a suprise? For example, an SD card, even in a Raspberry Pi3, usually reads at like 15-18MB/s (around 22 in a Banana Pi M3) vs. even an HDD that internally does at least 100MB/s! So, that factor alone could easily 10x the time an application or script takes! Going from SD card to USB thumb drive would almost double the speed immediately. Unfortunately, according to experience, the same would likely get your Banana Pi M3 to overheat and shutdown. CPU, RAM… So, there is a lot of difference. The banana Pi M3 for example cuts the performance down from 2GHz to 1200MHz per core as soon as you start anything remotely looking like load, if you don’t use heavy duty cooling! So, the Banana Pi M3 for example is advertise with up to 2GHz per core but essentially does only 1200MHz per core, even with light cooling!

This site uses Akismet to reduce spam. Learn how your comment data is processed.