Category Archives: Programming

Sort strings without case sensitivity

In JavaScript, I wanted to sort arrays of strings without caring about case. It was more complicated than I first thought.

The background is that I present lists like this in a GUI:

  • AMD
  • Apple
  • Gigabyte
  • IBM
  • Intel
  • Microsoft
  • MSI
  • Nokia
  • Samsung
  • Sony

I want AMD and MSI (spelled in all caps) to be sorted without respect to case. Standard sort() would put MSI before Microsoft.

Obviously I am not the first one wanting to do this and I found an article on stackoverflow. It suggests the following solution:

Use toLowerCase()
You can make your own string compare function that uses toLowerCase and send it as an argument to sort():

function cmpCaseless(a,b) {
    a = a.toLowerCase();
    b = b.toLowerCase();
    if ( a < b ) return -1;
    if ( a > b ) return  1;
    return 0;
}

myStringArray.sort(cmpCaseless);

This has a number of problems. The article above mentions that it is not stable. That is probably true in some cases but I was of course worried about performance: making two String objects for each compare should make the garbage collector quite busy, not to mention the waste of copying and lowercasing potentially quite long stings when usually the first character is enought. When I started experimenting I found another more critical flaw though: in Swedish we have three extra characters in the alphabet; Å,Ä,Ö, in that order. The above cmpCaseless orders Ä,Å,Ö, which sounds like a little problem, but it is simply unacceptable.

Use localeCompare
There is a more competent (or so I thought, read on) way to compare strings in JavaScript: the localeCompare function. This one simply treats A,Å,Ä and O,Ö as the same character, which is far more unacceptable than the toLowerCase problem.

However, it also has a “locales” option (a second optional argument). If I set it to ‘sv’ I get the sort order that I want, but performance is horrible. And I still have to use toLowerCase as well as localeCompare:

function localeCompare(a,b) {
    return a.toLowerCase().localeCompare(b.toLowerCase());
}

function localeCompare_sv(a,b) {
    return a.toLowerCase().localeCompare(b.toLowerCase(), 'sv');
}

localeCompare() has an extra options argument with a “sensitivity” parameter, but it is no good for our purpuses.

Rolling my own
Of course, I ended up building my own function to do caseless string compare. The strategy is to compare one character at a time, not making any new String objects, and fallback to localeCompare if both characters are above the 127 ASCII characters:

function custom(a,b) {
    var i, al, bl, l;
    var ac, bc;
    al = a.length;
    bl = b.length;
    l = al < bl ? al : bl;
        
    for ( i=0 ; i<l ; i++ ) {
        ac = a.codePointAt(i);  // or charCodeAt() for better compability
        bc = b.codePointAt(i);
        if ( 64 < ac && ac < 91 ) ac += 32;
        if ( 64 < bc && bc < 91 ) bc += 32;
        if ( ac !== bc ) { 
            if ( 127 < ac && 127 < bc ) {
                ac = a.substr(i,1).toLowerCase();
                bc = b.substr(i,1).toLowerCase();
                if ( ac !== bc ) return ac.localeCompare(bc);
            } else {
                return ac-bc;
            }
        }
    }
    return al-bl;
}

One fascinating thing is that here I can use localeCompare() without 'sv'.

Test for yourself
I built a simple webpage where you can test everything yourself.

Conclusion
Defining a string sort order is not trivial, when you dont just have ASCII characters. If you look at the ascii table you see that non alphabetic characters are spread out:

  • SPACE, #, 1-9, and many more come before both A-Z and a-z
  • Underscore: _, and a few other characters come after A-Z but before a-z
  • Pipe: | and a few other characters come after A-Z and a-z

When it comes to characters behind ASCII 127, it just gets more complicated: how do you sort european language latin letters, greek letters and arrows and other symbols?

For this reason, I think it makes sense to define your own sorting function and clearly define the behaviour for the characters you know that you care about. If it really matters in your application.

My function above is significantly faster than the options.

Disclaimer
These results can probably be inconsistent over different web browsers.

Angular.js Hello World program

Usually when writing a small utility I use the command line. But sometimes a simple web page with JavaScript gives a more functional UI.

I like AngularJS, but I don’t start from scratch very often, and when I do, I need to start searching for code to copy. So, here is my Angular Hello World template.

<!DOCTYPE html>
<html ng-app="theApplication" ng-controller="theController">
  <head>
    <meta charset="utf-8">
    <title>Hello World</title>
  </head>

  <script src="angular.js"></script>
  <script>
    angular.module('theApplication', []).
    controller('theController', ['$scope',function($scope) {
      $scope.name = 'Zo0ok';
    }]);
  </script>

  <body>
    Hello {{ name }}
  </body>
</html>

Obviously, if you don’t host angular.js yourself you need to use something like

<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.5/angular.min.js">
</script>

instead.

Playing with smart.js and V7

I have been playing quite much with Node.js lately, and I have put some effort into trying to build it for typical OpenWRT hardware. It turns out that Node.js/V8 is not, and will never be, suitable for hardware without an FPU and at least 128MB RAM.

My curiousity led me to smart.js which is based on the V7 javascript engine. Among the positive details were posix compability, ECMA script 5.1 support, HTTP support (not very different from Node.js, just much less of it), and supposed to be the fastest non-compiled JavaScript engine.

smart.js
smart.js seems to be the IoT-platform, while v7 is just the JavaScript engine. smart.js had some peculiar properties:

  • The source zip I downloaded from github failed to compiled becuase of a missing .git-directory. Using git clone solved this.
  • The binary always reads and executes smart.js from the same folder. If you give more command line arguments, it executes those after smart.js.
  • I found no way to pass command line arguments.

It comes with a little set of IoT-example files, and I found this a little confusing.

v7
I found out it was possible to use v7 standalone. It is a very nice v7.c file that is simply compiled:

$ gcc -O3 -DV7_EXE -o v7 v7.c -lm

This is very promising – a lot easier than building node.

I found that neither console.log or process.argv exist and found ways to work around this. My plan was to build a little benchmark suite and test different things. I did not get so far: the below program just creates 100 random objects (TestBit) 10 times and stores in arrays.

if ( 'undefined' === typeof process ) {
  process = {
    exit : exit,
    argv : [ 'v7', 'v7bench.c' ]  // faking command line arguments
  };
}

if ( 'undefined' === typeof console ) {
  console = {
    log : print
  }
}

var TestBit = function() {
  this.n  = Math.random();
  this.s  = '' + this.n;
  this.s1 = this.s.substr(2,1);
  this.s2 = this.s.substr(3,2);
  this.s3 = this.s.substr(5,3);
  this.b  = this.s > 0.5;
};

var generateTestBits = function(n) {
  var i;
  var r = new Array(n);
  for ( i=0 ; i<n ; i++ ) {
    r[i] = new TestBit();
  }
  return r;
}

var Timer = function() {
  this.start = Date.now();
  this.last  = this.start;

  this.split = function() {
    var r;
    var n = Date.now();
    r = n - this.last;
    this.last = n;
    return r;
  }
}

var timer = new Timer();

var i = 0;
var tests = [];
var tmp;

console.log('Start');

while(i<10) {
  tests.push(generateTestBits(100));
  i++;
  console.log('' + ( i*100) + ':' + timer.split());
}

This simple program unfortunately proved that performance is much worse than I could expect. Below are timings for generating the 1000 objects, in steps of 100 at a time. Benchmarks on a RPi2 900MHz ARMv7 CPU.

There is a flag (-vo) described as “object arena size” that I decided to play with (values 1, 10, 100 gave very similar results as 1000)

       default    -vo 1000    -vo 5000    -vo 10000
 100      0.2s       0.12s        6.8s          28s                         
 200      0.4s       0.16s        8.6s
 300      1.3s       0.23s       12  s
 400      7.4s       0.24s        2.6s
 500     13  s       0.37s       15  s
 600     35  s       0.93s       32  s
 700     47  s       2.9 s       48  s
 800     79  s       4.6 s       45  s
 900    160  s       5.9 s       36  s
1000    160  s       8.0 s       37  s

Well, generating 100 random objects on a RPi2 using a simple interpreter could take 0.2 seconds (the top left value): quite reasonable I guess. As the number of generated objects grows the performance is completey ruined. With default parameter, a full 1.6 seconds is spent to generate one single “TestBit”. Memory usage of the v7 process is insignificant.

I dont know what the “object arena size” setting does, but it obviously changes something. The 400-value for the vo=5000 series is actually reproducable.

Even for an IoT-platform I find this performance (and the lack of predictability) unacceptable. V7 officially aims to be the fastest interpreted JavaScript engine out there: it must be cabable of handling what is actually small arrays (10×100 elements) of small objects.

Effects of cache on performance

It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?

Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?

A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:

  • be reasonably slow/fast, so measuring execution time is practical and meaningful
  • have working data sets in sizes 10kb-10Mb
  • the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
  • be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely

Finally, I think it is fun if the program does something slightly meaningful.

I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:

Array to sort: G A F C B D H E   ( N=8 )
Sorted array:  A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1     = 28 (compares and conditional swaps)

I invented the following cache-optimized Bubble-Twist-Sort:

Array to sort:                G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist:                                 ( G>B : swap )
                              A C F B G D E H   ( D>F : swap )
                              A C D B G F E H   ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
                 + 4/2         (expected number of twist-swaps)
                 + 16/2 + 16/2 (second two bubbelsort)
                 = 34
Actual cost: 4*(3+2+1) + 2 = 26

Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.

Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.

I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).

The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:

	QNAP	wdr3600	ac20i	Rpi	Rpi 2	wdr4900	G4	Celeron	Xeon	Athlon	i5
								~2007   ~2010   ~2013
============================================================================================
L1	32	32	32	16	?	32	64	32	32	128	32
L2				128	?	256	256	512	6M	1024	256
L3							1024				6M
Mhz	500	560	580	700	900	800	866	900	2800	3000	3100
CPU	ARMv5	Mips74K	Mips24K	ARMv6	ARMv7	PPC	PPC	x86	x64	x64	x64
OS	Debian	OpenWrt	OpenWrt	OpenWrt	OpenWrt	OpenWrt	Debian	Ubuntu	MacOSX	Ubuntu	Windows

Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.

The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.

Bubblesort

	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
============================================================================================
4000000	1248	1332	997	1120	396	833		507	120	104	93
2000000	1248	1332	994	1118	386	791	553	506	114	102	93
1000000	1274	1330	1009	1110	367	757	492	504	113	96	93
500000	1258	1194	959	1049	352	628	389	353	72	74	63
250000	1219	1116	931	911	351	445	309	276	53	61	48
125000	1174	1043	902	701	349	397	287	237	44	56	41
62500	941	853	791	573	349	373	278	218	38	52	37
31250	700	462	520	474	342	317	260	208	36	48	36
15625	697	456	507	368	340	315	258	204	35	49	35
7812	696	454	495	364	340	315	256	202	34	49	35
3906	696	455	496	364	340	315	257	203	34	47	35
1953	698	456	496	365	342	320	257	204	35	45	35

Selectionsort

	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
============================================================================================
1000000	1317	996	877	1056	446	468	296	255	30	45	19
31250	875	354	539	559	420	206	147	245	28	40	21
1953	874	362	520	457	422	209	149	250	30	41	23

Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.

Findings
Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.

I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.

I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.

I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.

The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.

What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.

Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:

void bubbelsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, jm1;
  uint32_t tmp;
  for ( i=len ; i>1 ; i-- ) {
    for ( j=1 ; j<i ; j++ ) {
      jm1 = j-1;
      if ( array[jm1] > array[j] ) {
        tmp = array[jm1];
        array[jm1] = array[j];
        array[j] = tmp;
      }
    }
  }
}

void selectionsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, best;
  uint32_t tmp;
  for ( i=1 ; i<len ; i++ ) {
    best = i-1;
    for ( j=i ; j<len ; j++ ) {
      if ( array[best] > array[j] ) {
        best = j;
      }
    }
    tmp = array[i-1];
    array[i-1] = array[best];
    array[best] = tmp;
  } 
}

Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.

Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.

JavaScript: switch options

Is the nicest solution also the fastest?

Here is a little thing I ran into that I found interesting enough to test it. In JavaScript, you get a parameter (from a user, perhaps a web service), and depending on the parameter value you will call a particular function.

The first solution that comes to my mind is a switch:

function test_switch(code) {
  switch ( code ) {
  case 'Alfa':
    call_alfa();
    break;
  ...
  case 'Mike':
    call_mike();
    break;
  }
  call_default();
}

That is good if you know all the labels when you write the code. A more compact solution that allows you to dynamically add functions is to let the functions just be properties of an object:

x1 = {
  Alfa:call_alfa,
  Bravo:call_bravo,
  Charlie:call_charlie,
...
  Mike:call_mike
};

function test_prop(code) {
  var f = x1[code];
  if ( f ) f();
  else call_default();
}

And as a variant – not really making sense in this simple example but anyway – you could loop over the properties (functions) until you find the right one:

function test_prop_loop(code) {
  var p;
  for ( p in x1 ) {
    if ( p === code ) {
      x1[p]();
      return;
    }
  }
  call_default();
}

And, since we are into loops, this construction does not make so much sense in this simple example, but anyway:

x2 = [
  { code:'Alfa'     ,func:call_alfa    },
  { code:'Bravo'    ,func:call_bravo   },
  { code:'Charlie'  ,func:call_charlie },
...
  { code:'Mike'     ,func:call_mike    }
];

function test_array_loop(code) {
  var i, o;
  for ( i=0 ; i<x2.length ; i++ ) {
    o = x2[i];
    if ( o.code === code ) {
      o.func();
      return;
    }
  }
  call_default();
}

Alfa, Bravo…, Mike and default
I created exactly 13 options, and labeled them Alfa, Bravo, … Mike. And all the test functions accept invalid code and falls back to a default function.

The loops should clearly be worse for more options. However it is not obvious what the cost is for more options in the switch case.

I will make three test runs: 5 options (Alfa to Echo), 13 options (Alfa to Mike) and 14 options (Alfa to November) where the last one ends up in default. For each run, each of the 5/13/14 options will be equally frequent.

Benchmark Results
I am benchmarking using Node.js 0.12.2 on a Raspberry Pi 1. The startup time for Nodejs is 2.35 seconds, and I have reduced that from all benchmark times. I also ran the benchmarks on a MacBook Air with nodejs 0.10.35. All benchmarks were repeated three times and the median has been used. Iteration count: 1000000.

(ms)       ======== RPi ========     ==== MacBook Air ====
              5      13      14         5      13      14
============================================================
switch     1650    1890    1930        21      28      30
prop       2240    2330    2890        22      23      37
proploop   2740    3300    3490        31      37      38
loop       2740    4740    4750        23      34      36

Conclusions
Well, most notable (and again), the RPi ARMv6 is not fast running Node.js!

Using the simple property construction seems to make sense from a performance perspective, although the good old switch also fast. The loops have no advantages. Also, the penalty for the default case is quite heavy for the simple property case; if you know the “code” is valid the property scales very nicely.

It is however a little interesting that on the ARM the loop over properties is better than the loop over integers. On the x64 it is the other way around.

Variants of Simple Property Case
The following are essentially equally fast:

function test_prop(code) {
  var f = x1[code];   
  if ( f ) f();       
  else call_x();                        
}   

function test_prop(code) {
  var f = x1[code];   
  if ( 'function' === typeof f ) f();
  else call_x();                        
}   

function test_prop(code) {
  x1[code]();                          
}   

So, it does not cost much to have a safety test and a default case (just in case), but it is expensive to use it. This one, however:

function test_prop(code) {
  try {
    x1[code]();
  } catch(e) {
    call_x();
  }
}

comes at a cost of 5ms on the MacBook, when the catch is never used. If the catch is used (1 out of 14) the run takes a full second instead of 37ms!

Raspberry Pi (v1), OpenWrt (14.07) and Node.js (v0.10.35 & v0.12.2)

Since I gave up running NetBSD on my Raspberry pi I decided it was time to try OpenWrt. And, to my surprise I also managed to cross compile Node.js!

Install OpenWrt on Raspberry Pi (v1@700MHz)
I installed OpenWrt Barrier Breaker (the currently stable release) using the standard instructions.

After you have put the image on an SD-card with dd, it is quite easy to resize the root partition:

  1. copy the second partition to an image file using dd
  2. use fdisk to delete the second partition, and create a new, bigger
  3. format the new partition with mkfs.ext4
  4. mount the image file using mount -o loop
  5. mount the new second partition
  6. copy all data from image file to second partition using cp -a

If you want to, you can edit /etc/config/network while you are anyway working with the OpenWrt root partition:

#config interface 'lan'
#	option ifname 'eth0'
#	option type 'bridge'
#	option proto 'static'
#	option ipaddr '192.168.1.1'
#	option netmask '255.255.255.0'
#	option ip6assign '60'
#	option gateway '?.?.?.?'
#	option dns '?.?.?.?'
config interface 'lan'
	option ifname 'eth0'
	option proto 'dhcp'
	option macaddr 'XX:XX:XX:XX:XX:XX'
	option hostname 'rpiopenwrt'

Probably you want to disable dnsmasq, odhcpd and firewall too:

.../etc/init.d/$ chmod -x dnsmasq firewall odhcpd

OR (depending on your idea of what is the right way)

.../etc/rc.d$ sudo rm S60dnsmasq S35odhcpd K85odhcpd S19firewall

Also, it is a good idea to edit config.txt (on the DOS partition):

gpu_mem=1

I don’t know if 1 is really a legal value, but it worked for me, and I had much more memory available than when gpu_mem was not set.

Node.js4 added 2015-10-03
For Node.js, check Node.js 4 builds.

Building Node.js v0.12.2
I downloaded and built Node.js v0.12.2 on a Xubuntu machine with an x64 cpu. On such a machine you can download the standard OpenWrt toolchain for Raspberry Pi.

I replaced configure and cpu.cc in the standard sources with the files from This Page (they are meant for v0.12.1 but they work equally good for v0.12.2).

I then found an a gist that gave me a good start. I modified it, and ended up with:

#!/bin/sh -e

export STAGING_DIR=...path to your toolchain...

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib
export ARM_TARGET_LIB=$CSTOOLS_LIB

export TARGET_ARCH="-march=armv6j"

#Define the cross compilators on your system
export AR="arm-openwrt-linux-uclibcgnueabi-ar"
export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LINK="arm-openwrt-linux-uclibcgnueabi-g++"
export CPP="arm-openwrt-linux-uclibcgnueabi-gcc -E"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"
export AS="arm-openwrt-linux-uclibcgnueabi-as"
export CCLD="arm-openwrt-linux-uclibcgnueabi-gcc ${TARGET_ARCH} ${TARGET_TUNE}"
export NM="arm-openwrt-linux-uclibcgnueabi-nm"
export STRIP="arm-openwrt-linux-uclibcgnueabi-strip"
export OBJCOPY="arm-openwrt-linux-uclibcgnueabi-objcopy"
export RANLIB="arm-openwrt-linux-uclibcgnueabi-ranlib"
export F77="arm-openwrt-linux-uclibcgnueabi-g77 ${TARGET_ARCH} ${TARGET_TUNE}"
unset LIBC

#Define flags
export CXXFLAGS="-march=armv6j"
export LDFLAGS="-L${CSTOOLS_LIB} -Wl,-rpath-link,${CSTOOLS_LIB} -Wl,-O1 -Wl,--hash-style=gnu"
export CFLAGS="-isystem${CSTOOLS_INC} -fexpensive-optimizations -frename-registers -fomit-frame-pointer -O2"
export CPPFLAGS="-isystem${CSTOOLS_INC}"
export CCFLAGS="-march=armv6j"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux --without-npm

bash --norc

Run this script in the Node.js source directory. If everything goes fine it configures the Node.js build, and leaves you with a shell where you can simply run:

$ make

If compilation is fine, you find the node binary in the out/Release folder. Copy it to your OpenWrt Raspberry Pi.

Building Node.js v0.10.35
I first successfully built Node.js v0.10.35.

The (less refined) script for configuring that I used was:

#!/bin/sh -e

export STAGING_DIR=...path to your toolchain...

#Tools
export CSTOOLS="$STAGING_DIR"
export CSTOOLS_INC=${CSTOOLS}/include
export CSTOOLS_LIB=${CSTOOLS}/lib
export ARM_TARGET_LIB=$CSTOOLS_LIB
export GYP_DEFINES="armv7=0"

#Define our target device
export TARGET_ARCH="-march=armv6"
export TARGET_TUNE="-mfloat-abi=hard"

#Define the cross compilators on your system
export AR="arm-openwrt-linux-uclibcgnueabi-ar"
export CC="arm-openwrt-linux-uclibcgnueabi-gcc"
export CXX="arm-openwrt-linux-uclibcgnueabi-g++"
export LINK="arm-openwrt-linux-uclibcgnueabi-g++"
export CPP="arm-openwrt-linux-uclibcgnueabi-gcc -E"
export LD="arm-openwrt-linux-uclibcgnueabi-ld"
export AS="arm-openwrt-linux-uclibcgnueabi-as"
export CCLD="arm-openwrt-linux-uclibcgnueabi-gcc ${TARGET_ARCH} ${TARGET_TUNE}"
export NM="arm-openwrt-linux-uclibcgnueabi-nm"
export STRIP="arm-openwrt-linux-uclibcgnueabi-strip"
export OBJCOPY="arm-openwrt-linux-uclibcgnueabi-objcopy"
export RANLIB="arm-openwrt-linux-uclibcgnueabi-ranlib"
export F77="arm-openwrt-linux-uclibcgnueabi-g77 ${TARGET_ARCH} ${TARGET_TUNE}"
unset LIBC

#Define flags
export CXXFLAGS="-march=armv6"
export LDFLAGS="-L${CSTOOLS_LIB} -Wl,-rpath-link,${CSTOOLS_LIB} -Wl,-O1 -Wl,--hash-style=gnu"
export CFLAGS="-isystem${CSTOOLS_INC} -fexpensive-optimizations -frename-registers -fomit-frame-pointer -O2 -ggdb3"
export CPPFLAGS="-isystem${CSTOOLS_INC}"
export CCFLAGS="-march=armv6"

export PATH="${CSTOOLS}/bin:$PATH"

./configure --without-snapshot --dest-cpu=arm --dest-os=linux
bash --norc

Running node on the Raspberry Pi
Back on the Raspberry Pi you need to install a few packages:

# ldd ./node 
	libdl.so.0 => /lib/libdl.so.0 (0xb6f60000)
	librt.so.0 => not found
	libstdc++.so.6 => not found
	libm.so.0 => /lib/libm.so.0 (0xb6f48000)
	libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xb6f34000)
	libpthread.so.0 => not found
	libc.so.0 => /lib/libc.so.0 (0xb6edf000)
	ld-uClibc.so.0 => /lib/ld-uClibc.so.0 (0xb6f6c000)
# opkg update
# opkg install librt
# opkg install libstdcpp

That is all! Now you should be ready to run node. The node binary is about 13Mb (the v0.10.35 was 19Mb perhaps becuase of -ggdb3), so it is not optimal to deploy it to other typical OpenWrt hardware.

Final comments
I ran a few small programs to test, and they were fine. I guess some more testing would be appropriate. The performance is very comparable to Node.js built and executed on Raspbian.

I think RaspberryPi+OpenWrt+Node.js is a very interesting and competitive combination for microservices!

Faking a good goto in JavaScript

There are cases where gotos are good (most possible uses of gotos are not good). I needed to write JavaScript functions (for running in NodeJS) where I wanted to call the callback function just once in the end (to make things as clear as possible). In C that would be (this is a simplified example):

void withgoto(int x, void(*callback)(int) ) {
  int r;

  if ( (r = test1(x)) )
    goto done;

  if ( (r = test2(x)) )
    goto done;
 
  if ( (r = test3(x)) )
    goto done;
 
  r = 0;
done:
  (*callback)(r);
}

I think that looks nice! I mean the way goto controls the flow, not the syntax for function pointers.

JavaScript: multiple callbacks
The most obvious way to me to write this in JavaScript was:

var with4callbacks = function(x, callback) {
  var r

  if ( r = test1(x) ) {
    callback(r)
    return
  }

  if ( r = test2(x) ) {
    callback(r)
    return
  }

  if ( r = test3(x) ) {
    callback(r)
    return
  }
  r = 0
  callback(r)
}

This works perfectly, of course. But it is not nice with callback in several places. It is annoying (bloated) to always write return after callback. And in other cases it can be a little unclear if callback is called zero times, or more than one time… which is basically catastrophic. What options are there?

JavaScript: abusing exceptions
My first idea was to abuse the throw/catch-construction:

var withexceptions = function(x, callback) {
  var r

  try {
    if ( r = test1(x) )
      throw null

    if ( r = test2(x) )
      throw null

    if ( r = test3(x) )
      throw null

    r = 0
  } catch(e) {
  }
  callback(r)
}

This works just perfectly. In a more real world case you would probably put some code in the catch block. Is it good style? Maybe not.

JavaScript: an internal function
With an internal (is it called so?) function, a return does the job:

var withinternalfunc = function(x, callback) {
  var r
  var f

  f = function() {
    if ( r = test1(x) )
      return

    if ( r = test2(x) )
      return

    if ( r = test3(x) )
      return

    r = 0
  }
  f()
  callback(r)
}

Well, this looks like JavaScript, but it is not super clear.

JavaScript: an external function
You can also do with an external function (risking that you need to pass plenty of parameters to it, but in my simple example that is not an issue):

var externalfunc = function(x) {
  var r
  if ( r = test1(x) )
    return r

  if ( r = test2(x) )
    return r

  if ( r = test3(x) )
    return r

  return 0
}

var withexternalfunc = function(x, callback) {
  callback(externalfunc(x))
}

Do you think the readability is improved compared to the goto code? I don’t think so.

JavaScript: Break out of Block
Finally (and I got help coming up with this one), it is possible to do:

var withbreakblock = function(x, callback) {
  var r
  var f

myblock:
  {
    if ( r = test1(x) )
      break myblock

    if ( r = test2(x) )
      break myblock

    if ( r = test3(x) )
      break myblock

    r = 0
  }
  callback(r)
}

Well, that is at close to the goto construction I come with JavaScript. Pretty nice!

JavaScript: Multiple if(done)
Using a done-variable and multiple if statements is also possible:

var with3ifs = function(x, callback) {
  var r
  var done = false

  if ( r = test1(x) )
    done = true

  if ( !done ) {
    if ( r = test2(x) )
      done = true
  }

  if ( !done ) {
    if ( r = test3(x) )
      done = true
  }

  if ( !done ) {
    r = 0
  } 
  callback(r)
}

Hardly pretty, I think. The longer the code gets (the more sequential ifs there is), the higher the penalty for the ifs will be.

Performance
Which one I choose may depend on performance, if the difference is big. They should all be fast, but:

  • It is quite unclear what the cost of throwing (an exception) is
  • The internal function, is it recompiled and what is the cost?

I measured performance as (millions of) calls to the function per second. The test functions are rather cheap, and x is an integer in this case.

I did three test runs:

  1. The fall through case (r=0) is relatively rare (~8%)
  2. The fall through case is very common (~92%)
  3. The fall through case is extremely common (>99.99%)

In real applications fallthrough rate may be the most common case, with no error input data found. The benchmark environment is:

  • Mac Book Air Core i5@1.4GHz
  • C Compiler: Apple LLVM version 6.1.0 (clang-602.0.49) (based on LLVM 3.6.0svn)
  • C Flags: -O2
  • Node version: v0.10.35 (installed from pkgsrc.org, x86_64 version)

Performance was rather consistent over several runs (for 1000 000 calls each):

Fallthrough Rate           ~8%      ~92      >99.99%      
---------------------------------------------------------
     C: withgoto           66.7     76.9     83.3  Mops
NodeJS: with4callbacks     14.9     14.7     16.4  Mops
NodeJS: with exceptions     3.67     8.77    10.3  Mops
NodeJS: withinternalfunc    8.33     8.54     9.09 Mops
NodeJS: withexternalfunc   14.5     14.9     15.6  Mops
NodeJS: withbreakblock     14.9     15.4     17.5  Mops
NodeJS: with3ifs           15.2     15.6     16.9  Mops

The C code was row-by-row translated into the JavaScript code. The performance difference is between C/Clang and NodeJS, not thanks to the goto construction itself of course.

On Recursion
In JavaScript it is quite natural to do recursion when you deal with callbacks. So I decided to run the same benchmarks using recursion instead of a loop. Each recursion step involves three called ( function()->callback()->next()-> ). With this setup the maximum recursion depth was about 3×5300 (perhaps close to 16535?). That may sound much, but not enough to produce any benchmarks. Do I need to mention that C delivered 1000 000 recursive calls at exactly the same performance as the loop?

Conclusion
For real code 3.7 millions exceptions per second sounds pretty far fetched. Unless you are in a tight loop (which you probably are not, when you deal with callbacks), all solutions will perform well. However, the break out of a block is clearly the most elegant way and also the most efficient, second only to the real goto of course. I suspect the generally higher performance in the (very) high fallthrough case is because branch prediction gets more successful.

Any better ideas?

Build OpenWRT Toolchain on Mac OS X

A very quick guide to building the OpenWRT buildroot or toolchain on Mac OS X (10.10).

1. Install Xcode
Install Xcode from App Store (it is free).

2. Install pkgsrc
I have used fink, macports and homebrew, but now that I have tried pkgsrc I don’t think I will consider any of the others in a while. Install pkgsrc the standard way. Note: there is an x86_64 version for Mac OS X – it is probably what you want – just replace i386 with x86_64 in the download link.

Using pkgsrc, install these packages required by OpenWRT:

$ sudo pkgin install getopt coreutils gawk gtar findutils

3. Case sensitive filesystem
Your root filesystem on your Mac is probably case insensitive, and that is supposed to cause problems to building OpenWRT. Get yourself a USB disk or make a disk image an format it as case sensitive HFS+. If you do it from the command line you can avoid making it journaling:

$ diskutil eraseVolume hfsx OpenWRTdisk /dev/disk3s2

4. Building
This assumes you want the toolchain from that current stable build (14.07):

$ git clone git://git.openwrt.org/14.07/openwrt.git
$ cd openwrt
$ scripts/feeds update -a
$ scripts/feeds install -a
$ make menuconfig

In menuconfig I made just two changes: 1) setting my target platform, 2) asking toolchain to be built. Make all settings you want. I then ran:

$ make toolchain/install

You now find your toolchain is now in staging_dir 🙂
If you instead would have run just “make” the entire OpenWRT firmware would have been built.

Nodejs v0.12.0 on (unsupported) PowerPC G4

Nodejs can not be built for a G4 processor (PowerPC 7455, as found in pre-Intel Apple hardware) because of a few missing CPU instructions. IBM has made a Power/PowerPC-port of V8 (the JavaScript engine of Nodejs), but it does not work with the G4.

However, there is a quite simple workaround that can probably work for other unsupported platforms (PowerPC G3) as well, but ARMv5 failed.

This solution is to emulate a supported (i386) CPU using Qemu. Qemu is capable of emulating an entire computer (qemu-system-i386) or just emulate for a single program/process (qemu-i386). That is what I do.

I am running Debian 7 on my G4 computer, which comes with an old version of Qemu. It is old enough to not support the system call ‘futex’ (system call 240). My suggestion is to simply use debian backports to install a much more recent version of qemu.

# Add to /etc/apt/sources.list
deb http://http.debian.net/debian wheezy-backports main

# Then run
$ sudo apt-get update
$ sudo apt-get -t wheezy-backports install qemu-user

Now you can use the command qemu-i386 to run i386 binaries. Download the i386 binary linux version of nodejs and extract it somewhere. I extracted mine in /opt and made a symlink to /opt/node for convenience. Now:

zo0ok@sleipnir:~$ qemu-i386 /opt/node/bin/node 
/lib/ld-linux.so.2: No such file or directory

Unless you want to build your own statically linked nodejs binary, you need to get a few libraries from an i386 linux machine. I put these in /opt/node/bin/lib:

zo0ok@sleipnir:/opt/node/bin/lib$ ls -l
total 3320
-rw-r--r-- 1 zo0ok zo0ok  134380 mar  3 21:02 ld-linux.so.2
-rw-r--r-- 1 zo0ok zo0ok 1754876 mar  3 21:13 libc.so.6
-rw-r--r-- 1 zo0ok zo0ok   13856 mar  3 21:06 libdl.so.2
-rw-r--r-- 1 zo0ok zo0ok  113588 mar  3 21:12 libgcc_s.so.1
-rw-r--r-- 1 zo0ok zo0ok  280108 mar  3 21:11 libm.so.6
-rw-r--r-- 1 zo0ok zo0ok  134614 mar  3 21:12 libpthread.so.0
-rw-r--r-- 1 zo0ok zo0ok   30696 mar  3 21:05 librt.so.1
-rw-r--r-- 1 zo0ok zo0ok  922096 mar  3 21:08 libstdc++.so.6

For your convenience, I packed them for you:
https://dl.dropboxusercontent.com/u/9061436/code/linux-i386-lib.tgz
These are from Xubuntu 14.04.1 i386. The original symlinks are eliminated and the files come from different lib-folders. I packed exactly what you need to run the precompiled node-v0.12.0 binary.

Now you should be able to actually run nodejs:

$ zo0ok@sleipnir:~$ qemu-i386 -L /opt/node/bin/ /opt/node/bin/node --version
v0.12.0

To make it 100% convenient I created /usr/local/bin/nodejs:

zo0ok@sleipnir:~$ cat /usr/local/bin/nodejs 
#!/bin/sh
qemu-i386 -L /opt/node/bin /opt/node/bin/node "$@"

Dont forget to make it executable (chmod +x).

Performance is not amazing, but good enough for my purposes. It takes a few seconds to start nodejs, but when running it seems quite fast. I may post benchmarks in the future.

Nodejs v0.12.0 on Debian ARMv5/QNAP

I have written before about building NodeJS for ARMv5 (a QNAP TS-109 running Debian). Since nodejs 0.12.0 just came out, of course I wanted to build this version – but that did not go so well.

Just standard ./configure and make gave me this error after a while.

In file included from ../deps/v8/src/base/atomicops.h:146:0,
                 from ../deps/v8/src/base/once.h:55,
                 from ../deps/v8/src/base/lazy-instance.h:72,
                 from ../deps/v8/src/base/platform/mutex.h:8,
                 from ../deps/v8/src/base/platform/platform.h:29,
                 from ../deps/v8/src/assert-scope.h:9,
                 from ../deps/v8/src/v8.h:33,
                 from ../deps/v8/src/accessors.cc:5:
../deps/v8/src/base/atomicops_internals_arm_gcc.h:258:4: error: #error "Your CPU's ARM architecture is not supported yet"

This was quite expected though, since earlier versions (v0.10.25 was the last I built) did not build that easily. So I forced armv5t-architecture and tried again:

export CFLAGS='-march=armv5t'
export CXXFLAGS='-march=armv5t'
make
...
Segmentation fault
make[1]: *** [/home/kvaser/nodejs/node-v0.12.0/out/Release/obj.target/v8_snapshot/geni/snapshot.cc] Error 139
make[1]: Leaving directory `/home/kvaser/ndejs/node-v0.12.0/out'
make: *** [node] Error 2

It took almost 7 hours to get here. I stopped compiling and started reading instead.
It seems:

  • V8 is not supported on ARMv5 anymore (last supported version was 3.17.6 I think)
  • Building V8 as a shared library is not very easy
  • Even if I manage to build 3.17.6 as a shared library, there is no guarantee
    it would work with nodejs v0.12.0
  • Just replacing the v8 directory of v0.12.0 with an older version of v8 and hope everything just builds and runs perfectly seems… unlikely (but I have not tried and failed, yet)
  • The Raspberry Pi, with its ARMv6 CPU, is supposed to work with v0.12.0, but a little hack is required at this time (RPi 2 with ARMv7 seems safe)

The good thing is that nodejs (v0.10.29) can be installed in Debian 7 (wheezy) using backports. This is a rather nice and consistent way to install software not already in Debian Stable.

It is, after all, not strange that V8 is not maintained for an architecture that has not FPU. JavaScript uses 64-bit floats for all numbers, including integers.

Qemu Failed too
I tried running nodejs in Qemu (which works for a PowerPC G4), but this failed:

kvaser@kvaser:/opt/node-v0.12.0-linux-x86/bin$ qemu-i386 -L . ./node 
./node: error while loading shared libraries: rt.so.1: ncanoot  penrshaoed cbjeit f: No such file or directory

This is the actual result – not a copy-paste-mistake… so I believe something (byte order?) is seriously wrong.