Monthly Archives: June 2015

Building Node.js for OpenWrt (mipsel)

Update 2015-10-11:See separate post for Node version 4 for different OpenWrt targets. Information about v4 added below.

I managed to build (and run) Node.js OpenWrt and my Archer C20i with a MIPS 24K Little Endian CPU, without FPU (target=ramips/mt7620).

Node.js v0.10.40
First edit (set to false):


    54      # Similar to vfp but on MIPS.
    55      'v8_can_use_fpu_instructions%': 'false',
    63      # Similar to the ARM hard float ABI but on MIPS.
    64      'v8_use_mips_abi_hardfloat%': 'false',

For 15.05 I use this script to run configure:

#!/bin/sh -e

export CSTOOLS_INC=${CSTOOLS}/include

export CC="mipsel-openwrt-linux-uclibc-gcc"
export CXX="mipsel-openwrt-linux-uclibc-g++"
export LD="mipsel-openwrt-linux-uclibc-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm

bash --norc

Then just “make”. I have uploaded the compiled binary node to DropBox.

Compilation for (DD) trunk (with musl rather than uclibc) fails for v0.10.40.

Node.js v4.1.2
Node.js v4 does not run without a FPU. Normally Linux emulates an FPU if it is not present, but this feature is disabled in OpenWRT. I built and published r47168 with FPU emulation and Node v4.1.2.

Node.js 4.1.2 is configured like:

#!/bin/sh -e

export CSTOOLS_INC=${CSTOOLS}/include

export CC="mipsel-openwrt-linux-musl-gcc"
export CXX="mipsel-openwrt-linux-musl-g++"
export LD="mipsel-openwrt-linux-musl-ld"

export CFLAGS="-isystem${CSTOOLS_INC}"
export CPPFLAGS="-isystem${CSTOOLS_INC}"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=mipsel --dest-os=linux --without-npm --with-mips-float-abi=soft

bash --norc

In order to run the node binary on OpenWrt you need to install:

# opkg update
# opkg install librt
# opkg install libstdcpp

The 64MB or RAM of my Archer C20i is not sufficient to run octane-benchmark (even if the node binary and the benchmark are stored on a USB drive). However, I have a Mandelbrot benchmark that I can run. For Archer C20i, timings are:

C/Soft Float                     48s
Lua                              82s
Node.js v0.10.40 (soft float)    65s
Node.js v4.1.2 (FPU emulation)  444s (63s user, 381s kernel)

Clearly, the OpenWrt developers have a good reason to leave FPU emulation out. However, for Node.js in the future, FPU emulation seems to be the only way. My Mandelbrot benchmark is of course ridiculously dependent on FPU performance. For more normal usage, perhaps the penalty is less significant.

Other MIPS?
The only other MIPS I have had the opportunity to try was my WDR3600, a Big Endian 74K. It does not work:

  • v0.10.38 does not build at all (big endian MIPS seems unsupported)
  • v0.12.* builds, but it does not run (floating point exceptions), despite I managed to build for Soft Float.

I need to try rebuilding OpenWRT with FPU emulation for ar71xx, then perhaps Node.js v4 will work.

Node.js performance of Raspberry Pi 1 sucks

In several previous posts I have studied the performance of the Raspberry Pi (version 1) and Node.js to find out why the Raspberry Pi underperforms so badly when running Node.js.

The first two posts indicate that the Raspberry Pi underperforms about 10x compared to an x86/x64 machine, after compensation for clock frequency is made. The small cache size of the Raspberry Pi is often mentioned as a cause for its poor performance. In the third post I examine that, but it is not that horribly bad: about 3x worse performance for big memory needs compared to in-cache-situations. It appears the slow SDRAM of the RPi is more of a problem than the small cache itself.

The Benchmark Program
I wanted to relate the Node.js slowdown to some other scripted language. I decided Lua is nice. And I was lucky to find Mandelbrot implementations in several languages!

I modified the program(s) slightly, increasing the resolution from 80 to 160. I also made a version that did almost nothing (MAX_ITERATIONS=1) so I could measure and substract the startup cost (which is signifacant for Node.js) from the actual benchmark values.

The Numbers
Below are the average of three runs (minus the average of three 1-iteration rounds), in ms. The timing values were very stable over several runs.

 (ms)                           C/Hard   C/Soft  Node.js     Lua
 QNAP TS-109 500MHz ARMv5                 17513    49376   39520
 TP-Link Archer C20i 560MHz MIPS          45087    65510   82450
 RPi 700MHz ARMv6 (Raspbian)       493             14660   12130
 RPi 700MHz ARMv6 (OpenWrt)        490    11040    15010   31720
 RPi2 900MHz ARMv7 (OpenWrt)       400     9130      770   29390
 Eee701 900MHz Celeron x86         295               500    7992
 3000MHz Athlon II X2 x64           56                59    1267

Notes on Hard/Soft floats:

  • Raspbian is armhf, only allowing hard floats (-mfloat-abi=hard)
  • OpenWrt is armel, allowing both hard floats (-mfloat-abi=softfp) and soft floats (-mfloat-abi=soft).
  • The QNAP has no FPU and generates runtime error with hard floats
  • The other targets produce linkage errors with soft floats

The Node.js versions are slightly different, and so are the Lua versions. This makes no significant difference.

Calculating the Mandelbrot with the FPU is basically “free” (<0.5s). Everything else is waste and overhead.

The cost of soft float is about 10s on the RPI. The difference between Node.js on Raspbian and OpenWrt is quite small – either both use the FPU, or none of them does.

Now, the interesting thing is to compare the RPi with the QNAP. For the C-program with the soft floats, the QNAP is about 1.5x slower than the RPi. This matches well with earlier benchmarks I have made (see 1st and 3rd link at top of post). If the RPi would have been using soft floats in Node.js, it would have completed in about 30 seconds (based on the QNAP 50 seconds). The only thing (I can come up with) that explains the (unusually) large difference between QNAP and RPi in this test, is that the RPi actually utilizes the FPU (both Raspbian and OpenWrt).

OpenWrt and FPU
The poor Lua performance in OpenWrt is probably due to two things:

  1. OpenWrt is compiled with -Os rather than -O2
  2. OpenWrt by default uses -mfloat-abi=soft rather than -mfloat-abi=softfp (which is essentially like hard).

It is important to notice that -mfloat-abi=softfp not only makes programs much faster, but also quite much smaller (10%), which would be valuable in OpenWrt.

Different Node.js versions and builds
I have been building Node.js many times for Raspberry Pi and OpenWrt. The above soft/softfp setting for building node does not affect performance much, but it does affect binary size. Node.js v0.10 is faster on Raspberry Pi than v0.12 (which needs some patching to build).

Apart from the un-optimized OpenWrt Lua build, Lua is consistently 20-25x slower than native for RPi/x86/x64. It is not like the small cache of the RPi, or some other limitation of the CPU, makes it worse for interpreted languages than x86/x64.

While perhaps not the best FPU in the world, the VFPv2 floating point unit of the RPi ARMv6 delivers quite decent performance (slightly worse per clock cycle) compared to x86 and x64. It does not seem like the VFPv2 is to be blamed for the poor performance of Node.js on ARM.

Conclusion and Key finding
While Node.js (V8) for x86/x64 is near-native-speed, on the ARM it is rather near-Lua-speed: just another interpreted language, mostly. This does not seem to be caused by any limitation or flaw in the (RPi) ARM cpu, but rather the V8 implementation for x86/x64 being superior to that for ARM (ARMv6 at least).

Effects of cache on performance

It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?

Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?

A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:

  • be reasonably slow/fast, so measuring execution time is practical and meaningful
  • have working data sets in sizes 10kb-10Mb
  • the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
  • be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely

Finally, I think it is fun if the program does something slightly meaningful.

I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:

Array to sort: G A F C B D H E   ( N=8 )
Sorted array:  A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1     = 28 (compares and conditional swaps)

I invented the following cache-optimized Bubble-Twist-Sort:

Array to sort:                G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist:                                 ( G>B : swap )
                              A C F B G D E H   ( D>F : swap )
                              A C D B G F E H   ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
                 + 4/2         (expected number of twist-swaps)
                 + 16/2 + 16/2 (second two bubbelsort)
                 = 34
Actual cost: 4*(3+2+1) + 2 = 26

Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.

Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.

I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).

The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:

	QNAP	wdr3600	ac20i	Rpi	Rpi 2	wdr4900	G4	Celeron	Xeon	Athlon	i5
								~2007   ~2010   ~2013
L1	32	32	32	16	?	32	64	32	32	128	32
L2				128	?	256	256	512	6M	1024	256
L3							1024				6M
Mhz	500	560	580	700	900	800	866	900	2800	3000	3100
CPU	ARMv5	Mips74K	Mips24K	ARMv6	ARMv7	PPC	PPC	x86	x64	x64	x64
OS	Debian	OpenWrt	OpenWrt	OpenWrt	OpenWrt	OpenWrt	Debian	Ubuntu	MacOSX	Ubuntu	Windows

Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.

The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.


	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
4000000	1248	1332	997	1120	396	833		507	120	104	93
2000000	1248	1332	994	1118	386	791	553	506	114	102	93
1000000	1274	1330	1009	1110	367	757	492	504	113	96	93
500000	1258	1194	959	1049	352	628	389	353	72	74	63
250000	1219	1116	931	911	351	445	309	276	53	61	48
125000	1174	1043	902	701	349	397	287	237	44	56	41
62500	941	853	791	573	349	373	278	218	38	52	37
31250	700	462	520	474	342	317	260	208	36	48	36
15625	697	456	507	368	340	315	258	204	35	49	35
7812	696	454	495	364	340	315	256	202	34	49	35
3906	696	455	496	364	340	315	257	203	34	47	35
1953	698	456	496	365	342	320	257	204	35	45	35


	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
1000000	1317	996	877	1056	446	468	296	255	30	45	19
31250	875	354	539	559	420	206	147	245	28	40	21
1953	874	362	520	457	422	209	149	250	30	41	23

Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.

Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.

I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.

I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.

I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.

The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.

What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.

Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:

void bubbelsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, jm1;
  uint32_t tmp;
  for ( i=len ; i>1 ; i-- ) {
    for ( j=1 ; j<i ; j++ ) {
      jm1 = j-1;
      if ( array[jm1] > array[j] ) {
        tmp = array[jm1];
        array[jm1] = array[j];
        array[j] = tmp;

void selectionsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, best;
  uint32_t tmp;
  for ( i=1 ; i<len ; i++ ) {
    best = i-1;
    for ( j=i ; j<len ; j++ ) {
      if ( array[best] > array[j] ) {
        best = j;
    tmp = array[i-1];
    array[i-1] = array[best];
    array[best] = tmp;

Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.

Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.

JavaScript: switch options

Is the nicest solution also the fastest?

Here is a little thing I ran into that I found interesting enough to test it. In JavaScript, you get a parameter (from a user, perhaps a web service), and depending on the parameter value you will call a particular function.

The first solution that comes to my mind is a switch:

function test_switch(code) {
  switch ( code ) {
  case 'Alfa':
  case 'Mike':

That is good if you know all the labels when you write the code. A more compact solution that allows you to dynamically add functions is to let the functions just be properties of an object:

x1 = {

function test_prop(code) {
  var f = x1[code];
  if ( f ) f();
  else call_default();

And as a variant – not really making sense in this simple example but anyway – you could loop over the properties (functions) until you find the right one:

function test_prop_loop(code) {
  var p;
  for ( p in x1 ) {
    if ( p === code ) {

And, since we are into loops, this construction does not make so much sense in this simple example, but anyway:

x2 = [
  { code:'Alfa'     ,func:call_alfa    },
  { code:'Bravo'    ,func:call_bravo   },
  { code:'Charlie'  ,func:call_charlie },
  { code:'Mike'     ,func:call_mike    }

function test_array_loop(code) {
  var i, o;
  for ( i=0 ; i<x2.length ; i++ ) {
    o = x2[i];
    if ( o.code === code ) {

Alfa, Bravo…, Mike and default
I created exactly 13 options, and labeled them Alfa, Bravo, … Mike. And all the test functions accept invalid code and falls back to a default function.

The loops should clearly be worse for more options. However it is not obvious what the cost is for more options in the switch case.

I will make three test runs: 5 options (Alfa to Echo), 13 options (Alfa to Mike) and 14 options (Alfa to November) where the last one ends up in default. For each run, each of the 5/13/14 options will be equally frequent.

Benchmark Results
I am benchmarking using Node.js 0.12.2 on a Raspberry Pi 1. The startup time for Nodejs is 2.35 seconds, and I have reduced that from all benchmark times. I also ran the benchmarks on a MacBook Air with nodejs 0.10.35. All benchmarks were repeated three times and the median has been used. Iteration count: 1000000.

(ms)       ======== RPi ========     ==== MacBook Air ====
              5      13      14         5      13      14
switch     1650    1890    1930        21      28      30
prop       2240    2330    2890        22      23      37
proploop   2740    3300    3490        31      37      38
loop       2740    4740    4750        23      34      36

Well, most notable (and again), the RPi ARMv6 is not fast running Node.js!

Using the simple property construction seems to make sense from a performance perspective, although the good old switch also fast. The loops have no advantages. Also, the penalty for the default case is quite heavy for the simple property case; if you know the “code” is valid the property scales very nicely.

It is however a little interesting that on the ARM the loop over properties is better than the loop over integers. On the x64 it is the other way around.

Variants of Simple Property Case
The following are essentially equally fast:

function test_prop(code) {
  var f = x1[code];   
  if ( f ) f();       
  else call_x();                        

function test_prop(code) {
  var f = x1[code];   
  if ( 'function' === typeof f ) f();
  else call_x();                        

function test_prop(code) {

So, it does not cost much to have a safety test and a default case (just in case), but it is expensive to use it. This one, however:

function test_prop(code) {
  try {
  } catch(e) {

comes at a cost of 5ms on the MacBook, when the catch is never used. If the catch is used (1 out of 14) the run takes a full second instead of 37ms!