Tag Archives: Performance

JavaScript: Sets, Objects and Arrays

JavaScript has a new (well well) fancy Set datastructure (that does not come with functions for union, intersection and the likes, but whatever). A little while ago I tested Binary Search (also not in the standard library) and I was quite impressed with the performance.

When I code JavaScript I often hesitate about using an Array or an Object. And I have not started using Set much.

I decided to make some tests. Lets say we have pseudo-random natural numbers (like 10000 of them). We then want to check if a number is among the 10000 numbers or not (if it is a member of the set). A JavaScript Set does exactly that. A JavaScript Object just requires you to do: set[314] = true and you are basically done (it gets converted to a string, though). For an Array you just push(314), sort the array, and then use binary search to see if the value is there.

Obviously, if you often add or remove value, (re)sorting the Array will be annoying and costly. But quite often this is not the case.

The test
My test consists of generating N=10000 random unique numbers (with distance 1 or 2 between them). I then insert them (in a kind of pseudo-random order) into an Array (and sorts it), into an Object, and into a Set. I measure this time as an initiation time (for each data structure).

I repeat. So now I have 2xArrays, 2xObjects, 2xSets.

This way I can test both iterating and searching with all combinations of data structures (and check that the results are the same and thus correct).

Output of a single run: 100 iterations, N=10000, on a Linux Intel i5 and Node.js 8.9.1 looks like this:

                         ====== Search Structure ======
(ms)                        Array     Object      Set
     Initiate                1338        192      282
===== Iterate =====    
        Array                 800         39       93
       Object                 853        122      170
          Set                1147         82      131

By comparing columns you can compare the cost of searching (and initiating the structure before searching it). By comparing rows you can compare the cost of iterating over the different data structures (for example, iterating over Set while searching Array took 1147ms).

These results are quite consistent on this machine.

Findings
Some findings are very clear (I guess they are quite consistent across systems):

  • Putting values in an Array, to sort it, and the search it, is much slower and makes little sense compared to using an Object (or a Set)
  • Iterating an Array is a bit faster than iterating an Object or Set, so if you are never going to search an Array is faster
  • The newer and more specialized Set offers little advantage to good old Objects

What is more unclear is why iterating over Objects is faster when searching Arrays, but iterating over Sets if faster when searching Objects or Sets. What I find is:

  • Sets seem to perform comparably to Objects on Raspberry Pi, ARMv7.
  • Sets seem to underperform more on Mac OS X

Obviusly, all this is very unclear and can vary depending on CPU-cache, Node-version, OS and other factors.

Smaller and Larger sets
These findings hold quite well for smaller N=100 and larger N=1000000. The Array, despite being O(n log n), does not get much more worse for N=1000000 than it already was for N=10000.

Conclusions and Recommendation
I think the conservative choice is to use Arrays when order is important or you know you will not look for a member based on its unique id. If members have unique IDs and are not ordered, use Object. I see no reason to use Set, especially if you target browsers (support in IE is still limited in early 2018).

The Code
Here follows the source code. Output is not quite as pretty as the table above.

var lodash = require('lodash');

function randomarray(size) {
  var a = new Array(size);
  var x = 0;
  var i, r;
  var j = 0;
  var prime = 3;

  if ( 50   < size ) prime = 31;
  if ( 500  < size ) prime = 313;
  if ( 5000 < size ) prime = 3109;

  for ( i=0 ; i<size ; i++ ) {
    r = 1 + Math.floor(2 * Math.random());
    x += r;
    a[j] = '' + x;
    j += prime;
    if ( size <= j ) j-=size;
  }
  return a;
}

var times = {
  arr : {
    make : 0,
    arr  : 0,
    obj  : 0,
    set  : 0
  },
  obj : {
    make : 0,
    arr  : 0,
    obj  : 0,
    set  : 0
  },
  set : {
    make : 0,
    arr  : 0,
    obj  : 0,
    set  : 0
  }
}

function make_array(a) {
  times.arr.make -= Date.now();
  var i;
  var r = new Array(a.length);
  for ( i=a.length-1 ; 0<=i ; i-- ) {
    r[i] = a[i];
  }
  r.sort();
  times.arr.make += Date.now();
  return r;
}

function make_object(a) {
  times.obj.make -= Date.now();
  var i;
  var r = {};
  for ( i=a.length-1 ; 0<=i ; i-- ) {
    r[a[i]] = true;
  }
  times.obj.make += Date.now();
  return r;
}

function make_set(a) {
  times.set.make -= Date.now();
  var i;
  var r = new Set();
  for ( i=a.length-1 ; 0<=i ; i-- ) {
    r.add(a[i]);
  }
  times.set.make += Date.now();
  return r;
}

function make_triplet(n) {
  var r = randomarray(n);
  return {
    arr : make_array(r),
    obj : make_object(r),
    set : make_set(r)
  };
}

function match_triplets(t1,t2) {
  var i;
  var m = [];
  m.push(match_array_array(t1.arr , t2.arr));
  m.push(match_array_object(t1.arr , t2.obj));
  m.push(match_array_set(t1.arr , t2.set));
  m.push(match_object_array(t1.obj , t2.arr));
  m.push(match_object_object(t1.obj , t2.obj));
  m.push(match_object_set(t1.obj , t2.set));
  m.push(match_set_array(t1.set , t2.arr));
  m.push(match_set_object(t1.set , t2.obj));
  m.push(match_set_set(t1.set , t2.set));
  for ( i=1 ; i<m.length ; i++ ) {
    if ( m[0] !== m[i] ) {
      console.log('m[0]=' + m[0] + ' != m[' + i + ']=' + m[i]);
    }
  }
}

function match_array_array(a1,a2) {
  times.arr.arr -= Date.now();
  var r = 0;
  var i, v;
  for ( i=a1.length-1 ; 0<=i ; i-- ) {
    v = a1[i];
    if ( v === a2[lodash.sortedIndex(a2,v)] ) r++;
  }
  times.arr.arr += Date.now();
  return r;
}

function match_array_object(a1,o2) {
  times.arr.obj -= Date.now();
  var r = 0;
  var i;
  for ( i=a1.length-1 ; 0<=i ; i-- ) {
    if ( o2[a1[i]] ) r++;
  }
  times.arr.obj += Date.now();
  return r;
}

function match_array_set(a1,s2) {
  times.arr.set -= Date.now();
  var r = 0;
  var i;
  for ( i=a1.length-1 ; 0<=i ; i-- ) {
    if ( s2.has(a1[i]) ) r++;
  }
  times.arr.set += Date.now();
  return r;
}

function match_object_array(o1,a2) {
  times.obj.arr -= Date.now();
  var r = 0;
  var v;
  for ( v in o1 ) {
    if ( v === a2[lodash.sortedIndex(a2,v)] ) r++;
  }
  times.obj.arr += Date.now();
  return r;
}

function match_object_object(o1,o2) {
  times.obj.obj -= Date.now();
  var r = 0;
  var v;
  for ( v in o1 ) {
    if ( o2[v] ) r++;
  }
  times.obj.obj += Date.now();
  return r;
}

function match_object_set(o1,s2) {
  times.obj.set -= Date.now();
  var r = 0;
  var v;
  for ( v in o1 ) {
    if ( s2.has(v) ) r++;
  }
  times.obj.set += Date.now();
  return r;
}

function match_set_array(s1,a2) {
  times.set.arr -= Date.now();
  var r = 0;
  var v;
  var iter = s1[Symbol.iterator]();
  while ( ( v = iter.next().value ) ) {
    if ( v === a2[lodash.sortedIndex(a2,v)] ) r++;
  }
  times.set.arr += Date.now();
  return r;
}

function match_set_object(s1,o2) {
  times.set.obj -= Date.now();
  var r = 0;
  var v;
  var iter = s1[Symbol.iterator]();
  while ( ( v = iter.next().value ) ) {
    if ( o2[v] ) r++;
  }
  times.set.obj += Date.now();
  return r;
}

function match_set_set(s1,s2) {
  times.set.set -= Date.now();
  var r = 0;
  var v;
  var iter = s1[Symbol.iterator]();
  while ( ( v = iter.next().value ) ) {
    if ( s2.has(v) ) r++;
  }
  times.set.set += Date.now();
  return r;
}

function main() {
  var i;
  var t1;
  var t2;

  for ( i=0 ; i<100 ; i++ ) {
    t1 = make_triplet(10000);
    t2 = make_triplet(10000);
    match_triplets(t1,t2);
    match_triplets(t2,t1);
  }

  console.log('TIME=' + JSON.stringify(times,null,4));
}

main();

When to (not) use Web Workers?

Web Workers is a mature, simple, standardised, compatible technology for allowing multithreaded JavaScript-applications in the web browser.

I am not going to write about how to use Web Worker (check the excellent MDN article). I am going to write a little about when and why to (not) use Web Worker.

First, Web Workers are about performance. And performance is typically not the best thing to think about first when you code something.

Second, when you have performance problems and you throw more cores at the problem your best speedup is x2, x4 or xN. In 2018 it is quite common with 4 cores and that means in the optimal case you can make your program 4 times faster by using Web Workers. Unfortunately, if it was not fast enough from the beginning chances are a 4x speedup is not going to help much. And the cost of 4x speedup is 4 times more heat is produced, the battery will drain faster, and perhaps other applications will be suffering. A more efficient algorithm can often produce 10-100 times speedup without making the maintainability of the program suffer too much (and there are very many ways to make a non-optimised program faster).

Let us say we have a web application. The user clicks “Show report”, the GUI locks/blocks for 10s and then the report displays. The user might accept that the GUI locks, if just for 1-2 seconds. Or the user might accept that the report takes 10s to compute, if it shows up little by little and the program does not appear hung. The way we could deal with this in JavaScript (which is single thread and asyncronous) is to break the 10s report calculation into small pieces (say 100 pieces each taking 100ms) and after calculating each piece calling window.setTimeout which allows the UI to update (among other things) before calculating another piece of the report. Perhaps a more common and practical approach is to divide the 10s job into logical parts: fetch data, make calculations, make report, but this would not much improve the locked GUI situation since some (or all) parts still take significant (blocking) time.

If we could send the entire 10s job to a Web Worker our program GUI would be completely responsive while the report is generated. Now the key limitation of a web worker (which is also what allows it to be simple and safe):

Data is copied to the Worker before it starts, and copied from the Worker when it has completed (rather than being passed by reference).

This means that if you already have a lot of data, it might be quite expensive to copy that data to the web worker, and it might actually be cheaper to just do the job where the data already is. In the same way, since there is some overhead in calling the Web Worker, you can’t send too many too small pieces of work to it, because you will occupy yourself with sending and receiving messages rather than just doing the job right away.

This leaves us with obvious candidates for web workers (you can use Google):

  • Expensive searches (like chess moves or travelling salesman solutions)
  • Encryption (but chances are you should not do it in JavaScript in the first place, for security reasons)
  • Spell and grammar checker (I don’t know much about this).
  • Background network jobs

This is not too useful in most cases. What would be useful would be to send packages of work (arrays), like streams in a functional programming way: map(), reduce(), sort(), filter().

I decided to write some Web Worker tests based on sort(). Since I can not (easily, and there are probably good reasons) write JavaScript in WordPress I wrote a separate page with the application. Check it out now:

So, for 5 seconds I try to do the following job as many times I can, while I keep track of how much the GUI is suffering:

  1. create an array of 10001 random numbers: O(n)
  2. sort it: O(n log n)
  3. get the median (array[5000]): O(1)

The expensive part is step 2, the sort (well, I actually have not measured 1 vs 2). If the ratio of amount of work done per byte being sent is high enough then it can be worth it to send the job to a Web Worker.

If you run the tests yourself I think you shall see that the first Web Worker tests that outsource all of 1-2-3 are quite ok. But this basically means giving the web worker no data at all and when it has done a significant amount of job, receiving just a few numbers. This is more Web Worker friendly than Chess where at least the board would need to be sent.

If you then run the tests that outsource just sort() you see significantly lower throughput. How suitable sort()? Well, sorting 10k ~ 2^13 elements should require each element to be compared (accessed) about 13 times. And there is no data sent that is not needed by the Web Worker. Just as a counter example: if you send an order to get back the sum of the lines most of the order data is ignored by the Web Worker, and it just needs to access each line value once; much much less suitable than sort().

Findings from tests
I find that sort(), being O(n log n), on an array of numbers is far too fast to be outsourced to a Web Worker. You need to find a much more “dense” problem to benefit of a Web Worker.

Islands of data
If you can design your application in such way that one Web Worker maintains its own full state and just shares small selected parts occationally, that could work. The good thing is that this would also be clean encapsulation of data and separation of responsibilites. The bad thing is that you probably need to design with the Web Worker in mind quite early, and this kind of premature optimization is often a bad idea.

This could be letting a Web Worker do all your I/O. But if most data that you receive is needed in your application, and most data you send comes straight from your application, the benefit is very questionable. An if most data you receive is not needed in your application, perhaps you should not receive so much data in the first place. Even if you process your incoming data quite much: validating, integrating with current state, precalculating I would not expect it to come very close to the computational intensity of my sort().

Conclusions
Unfortunately, the simplicity and safety of Web Worker is unfortunately also its biggest limitation. The primary reason for using a Web Worker should be performance and even for artificial problems it is hard to get any benefit.

Minification of real web Application

I have built and I maintain a reasonably large (AngularJS) web application and here follow a few notes on the effect of minification.

I start with the findings:

                            Uncompressed         GZIP     Minified    Min+GZIP

App 1:  Size        (kb)            1130         1130          843         841
        Transferred (kb)            1150          375          861         308
        Load time    (s)             2.8          1.6          2.7         1.7

App 2:  Size        (kb)             708          708          659         659
        Transferred (kb)             721          359          672         347
        Load time    (s)             4.0          3.5          3.1         3.5

Conclusions
You should always enable gzip on the server. It is faster to compress and send less data than to send the uncompressed data. The benefits of gzip are huge and there are no negative side effects.

Minification saves some bandwidth (and if unlike me you do it ahead of time, some loading time). But unless your code contains mostly comments the effects are marginal (although that might be a big saving if you use very much bandwidth or you are looking for fastest possible load times).

Also, gzip tends to be good at what minification can easily do, and while the effect of minification alone is quite significant, the effect of minification together with gzip is smaller.

Behind the figures
The figures above come from Firefox Load time over the internet.

  • App1: About 100 files are served, mostly .js (a few .html and .css)
  • App2: About 80 files are served, mostly .js (a few .html and .css)
  • App1: Angular is always pre-minified 165kb, gzipped to 67kb.
  • App2: Angular+modules is always pre-minified 298kb, gzipped to 127kb
  • App2 contains a few fonts which are neither minified nor gzipped (142kb)
  • Files served by Node.js
  • Files minified by custom Node.js code in real time
  • Files gzipped by nginx in real time
  • Not everything is initiated when Load is complete (more html-files are loaded dynamically as user navigates, and data is loaded from APIs on demand)

Implications of minification
Minification (and possibly packaging of code) has more implications than gzip. Possible negative side effects are:

  • A build process is not strictly needed for web development, but minification is often done as part of a build process, increasing complexity of development, testing and deployment.
  • Testing and development is made harder when debugging minified code (although there are tools to mitigate this).
  • More aggressive minification can have unexpected results

The minification code I run in Node.js, when I serve a file, basically just:

  • Removes all white space in the beginning and end of lines
  • Removes all comments

This nice thing about this simple minification strategy is that everything that is obviously just waste is removed at a low cost, but the code is for all practical purposes completely unchanged (even line numbers are preserved to not complicate debugging). Also, developers should feel free to write as many comments as they like in the code, yet comments should never be served in a public facing application. More powerful minification comes at higher costs, and the effects are probably mostly lost after gzip.

I guess every project and system have a sweet spot when it comes to minification and I think my simple minification strategy makes sense for my needs.

Syncthing v0.14.40, Raspberry Pi, 100% CPU

I think Syncthing is an amazing piece of software, but I ran into problem last week.

I have a library of 10 different folders, 120000 files, 42000 directories and 428GB of data.

I thought that was a little bit too much for my RPi V1 (Syncthing 0.14.40, Arch Linux), because it constantly ran at 100%. I raised Rescan Interval to several hours (so it would finish before staring over).

After startup it took about 10-15 min to get the web GUI up, and about an hour to scan all folders for the first time. Well, that is ok, but after that it still constantly used 100% CPU despite all folders were “up to date”.

It turned out it crashed and started over. I found panic logs in .config/syncthing and error messages in ./config/syncthing/index-v0.14.0.db/LOG.

Some errors indicated Bad Magic Number and Checksum Corruption. The usual reason for this seems to be hardware problem (!?!).

I upgraded my RPi V1 to an RPi V2, with little success. Then I found that I had similar problems on another RPi V2. So after shutting down Syncthing I tried the quite scary:

  $ syncthing -reset-database      ( does not start syncthing )      
  $ syncthing                      ( start syncthing )

After several hours of scanning everything seems to work perfectly!
Let us see how long that lasts.

Peculiar Compiler Optimizations

My teacher in a High Performance Computing class once told me not to confuse the compiler. This was in the late 90s, and SGI C and Fortran compilers were supposed to replace entire blocks of code with highly optimised implementations. As long as the compiler understood your intentions.

I have never discovered this, but yesterday perhaps! Read on.

I have been playing around with LISP, solving Project Euler challenges on Hackerrank. For problems 44 and 45 I decided to do Binary Search (which afterwards turned out not to be so smart, but that is another story) and took the implementation from Rosetta Code (the iterative one).

Binary search is about finding an element in a sorted array by starting in the middle, and jumping left or right, cutting the remaning array in half each time.

In my case I decided just a part of the array was worth searching so instead of searching the entire array [0..length] I wanted to search up to the Nth element [0..N]. Searching fewer elements should be faster, so I improved the binary search function to take an additional argument: hi. For SBCL (Steel Bank Common Lisp), this surprisingly had horrible effect on performance.

Benchmark Results
The results for different algorithms, different machines and different LISP implementations follow. The RPi V1 runs Arch Linux, Clisp comes with Arch, SBCL is downloaded from SBCL webpage. The RPi V2 runs Raspbian and SBCL that comes with the distribution. The Celeron runs Ubuntu that comes with SBCL. The MacBook Air runs OS X and SBCL is downloaded separately.

 (times in seconds)             Array Size Standard Optimized Recursive    C -O2
================================================================================
RPi V1  ARMv6 700MHz  Clisp 2.49      5000    640       633       720
                      SBCL  1.3.12    5000     15.6      27        34       0.95
RPi V2  ARMv7 900MHz  SBCL  1.2.4     5000      6.2      16        17       0.31
                                     20000    110       293       300       5.8
NUC Celeron J3455     Clisp 2.49     20000    765       762       720   
                      SBCL  1.3.3    20000      8.3      16.7      18.0     1.0
MacBook Air i5        SBCL  1.2.11   20000      4.0      11.5      12.3     0.75

A very slight “optimization” turns out to have very negative impact on performance for the quite fast (compiled) SBCL. I can’t imagine any other explanation than SBCL replaces the standard binary search with optimized code rather than executing my program. For Clisp the optimization actually works quite as would be expected and the recursive code is actually the fastest. On the Celeron, Clisp and SBCL behaves completely opposite.

Comparing to C
The other week I had the feeling (SBCL) LISP was fast and decided to compare LISP to C. This time I had the feeling that LISP was rather slow so I ported my test program to C (basically line by line). Well, I found that SBCL is actually pretty fast (especially on x86/x64), and C was faster only thanks to -O2 on some systems. -O2 actually made the C-program more than 5 times faster: perhaps also the C-compiler replace the entire binary search?

The Test Program
The code follows. The only difference between Standard and Optimized is the single line that is commented out (with ; in LISP) selecting which binary search to run (the length of the function name does not explain the performance difference).

The program creates an array of length N and populates it with values by the formula n(n+1)/2. This takes little time. It then checks for values 10,20,30… if the values are found in the array (using binary search). In this program the entire array is always searched, not taking advantage of the extra parameter (although the optimized version does not need to find the length of the array every time called).

(defun binary-search (value array)                       ; Standard 2 lines
    (let ((low 0) (high (1- (length array))))            ;
        (do () ((< high low) nil)
            (let ((middle (floor (+ low high) 2)))
                (cond ((> (aref array middle) value)
                       (setf high (1- middle)))
                      ((< (aref array middle) value)
                       (setf low (1+ middle)))
                      (t (return middle)))))))

(defun binary-search-optimized (value array hi)          ; Optimized 2 lines
    (let ((low 0) (high hi))                             ;
        (do () ((< high low) nil)
            (let ((middle (floor (+ low high) 2)))
                (cond ((> (aref array middle) value)
                       (setf high (1- middle)))
                      ((< (aref array middle) value)
                       (setf low (1+ middle)))
                      (t (return middle)))))))

(defun binary-search-r (value
                        array
                        &optional (low 0)
                        (high (1- (length array))))
  (if (< high low)
      nil
      (let ((middle (floor (+ low high) 2)))
        (cond ((> (aref array middle) value)
               (binary-search-r value array low (1- middle)))
              ((< (aref array middle) value)
               (binary-search-r value array (1+ middle) high))
              (t middle)))))

(defun formula (n)
    (/ (* n (+ n 1)) 2))

(defun init-array (n)
    (let ((arr (make-array n)))
        (loop for i from 0 to (1- n) do
            (setf (aref arr i) (formula (1- i))))
        arr))

(defun solve (arr n max)
    (let ((ret 0))
        (loop for i from 10 to max by 10 do
            (if (binary-search i arr)                     ; Toggle code used
;           (if (binary-search-r i arr)                   ;
;           (if (binary-search-optimized i arr n)         ;
                (incf ret)
                Nil))
        ret))
            
(defun main ()
    (let ((n (read)))
        (let ((arr (init-array n)))
            (format T "~D~%" (solve arr (1- n) (aref arr (1- n)))))))

(main)

Since I am a very novice LISP programmer I appreciate any feedback. The code above does not solve Project Euler 44 or 45, it is much simplified to test binary search. Initially I wrote code that relied on recursion rather than loops but I exceeded the stack size and ended up with loops (according to what I read, loops rather than recursion is the preferred style of Common Lisp).

Conclusion
Well... optimization is hard, and dont make any assumptions. As I have found many times before, what makes code faster on some platforms can make it slower on others. When it comes to optimizing SBCL and compiled LISP much experience is required, and dont forget to measure!

Playing with LISP and LISP vs C

Lisp is fun! Well, since I first knew about Lisp I was fascinated, but I have found it hard to learn Lisp and to play with it in a meaningful way. A few years ago I wrote about it here and here. As usual, the first steps of learning something new can be the hardest.

Occationally I use Hackerrank to find programming challanges to solve for fun. I particularly like the Project Euler competition. I find it particularly good for trying out new languages: you get a “meaningful” challenge, a simple environment prepared in your web browser, and automated test cases. So, this time I didn’t waste my time trying to find the right Lisp implementation for me, I just started hacking on Project Euler 38 and 39 on Hackerrank.

Problem 38 was quite simple, but 39 was more interesting. When I had solved it, I found my implementation was not at all fast enough, so I started experimenting locally (the Hackerrank environment is not optimal for tweaking, optimization and debugging).

Choosing a (Common) Lisp implementation
There are quite many Common Lisp implementations out there. The one Hackerrank uses is SBCL. That is clearly the Common Lisp implementation I would recommend (based on my little experience) if it is available for your platform.

I installed SBCL with apt-get in Ubuntu. I also downloaded binaries directly for my Mac OS X computer and my Raspberry Pi (v1) running Arch linux. Installation is a bit non-standard, but you can actually run it without installing (just execute run-sbcl.sh in downloaded folder).

I also tried clisp and ecl, none of these could deal with the memory usage (stack size) of my program. For clisp I found no way to manipulate stack sizes at all. For ecl I made some progress but I could not make it run my program.

SBCL is a Lisp compiler, and it produces fast and efficient code. I later compared it to C.

Project Euler 39
Project Euler 39 is basically about finding integer solutions to Pythagoras theorem. For a given, large, perimeter, how many right triangles are there? For example:

300000^2 + 400000^2 = 500000^2

This triangle has a perimeter of 300000+400000+500000=1200000. What other values for a and b so that

a + b = 700000
a^2 + b^2 = 500000^2

are there? The Hackerrank challenge requires you to work with perimeters up to 5000000. If you implement a solution, a few things to immediately note:

  • The squares wont fit in a 32bit integer. They will fit with no loss of precision in the 53 bits of a 64 bit double and they will also fit in a 64 bit integer. (This matters not for Common Lisp)
  • If you want to do recursion (and of course you want when you code Lisp) it will be millions of recursion steps, which will be a challenge to the stack size. (This also turned out not to matter for SBCL)

The Lisp implementation
It turned out that the SBCL compiler optimized the recursion is such a way that the memory usage was quite low. SBCL successfully runs my program on RPi/Arch, Intel/Ubuntu and Intel/OSX with quite reasonable memory usage.

Since this is about learing Lisp I wanted a 100% functional programming implementation. Only pure functions. A lot of my code is about generating, modifying and testing triangles. A triangle (a b c) can obviously be represented as a Lisp list (a b c) and this was my first implementation. Then if you want to read a, b or c from a list abc, or create the list from a, b and c, you can do:

  a: (car abc)
  b: (car (cdr abc))
  c: (car (cdr (cdr abc)))

abc: (list a b c)

I found this cumbersome. It became a lot of list-to-variables and variables-to-list overhead (I didnt care so much about performance, more about my code readability). I learnt that Lisp functions can return multiple values using value and that you can bind them with multiple-value-bind and use them as arguments to a function using multiple-value-call. This felt functional and pure enough, and it made my code 25% faster than the car/cdr pattern above.:

; a (stupid) function returning a triangle as three values
(defun get-345-triangle ()
  (values 3 4 5))

; a function calculating the perimeter of triangle (from a function)
(defun triangle-perimeter-1 (tri-func)
  (multiple-value-bind (a b c) (funcall tri-func)
    (+ a b c)))

; and in this case you dont need to bind, you can use + directly
(defun triangle-perimeter-2 (tri-func)
  (multiple-value-call #'+ (funcall tri-func)))

; now this works
(triangle-perimeter-1 #'get-345-triangle)
(triangle-perimeter-2 #'get-345-triangle)

Since I am a very inexperienced Lisp programmer I appreciate suggestions for improvement.

Performance of Lisp
My final Hackerrank submission of Lisp code executes in about 4.5 seconds on my Intel i5/Ubuntu. It takes about the same time on the Hackerrank web page, which is fast enough to pass all tests. On the Raspberry Pi v1 (ARMv6 @700 MHz) it takes more than 700 seconds. My intuition told me that 4.5 seconds was very good. This made me ask two questions. How would Lisp compare to C? And why is the ARM more than 100 times slower, how would that compare in C?

The C implementation
My ambition was to rewrite Lisp to C line by line. So my C-program has exactly the same functions which take almost exactly the same arguments. All calculations are identical and performed in exactly the same order. The C-program relies entirely on recursion instead of loops (just like the Lisp program). However…

Functions in C can not return multiple variables. While Lisp had values I decided to use a reference to a struct in C:

(defun get-a-triangle()
  (values x y z))

void get_a_triangle(struct triangle *t) {
  t->a = x;
  t->b = y;
  t->c = z;
}

If the C-triangle struct is a local variable on the callers stack the difference is quite small (from a practical point of view, from a theoretic strict functional programming perspective its a different story).

Numbers in Lisp have arbitrary precision integers and floats make no difference. So, when porting to C, I had to pick numeric types. For most purposes, int32_t was good enough. But, for the purpose of calculating Pythagoras theorem higher precision was needed (as I wrote above, the 53 bits of double, or 64 bits of int64_t are good). So I ended up with 5 versions of the C-program (to compare performance):

  1. All 64-bit integers
  2. 32-bit integers, 64-bit for “triangles”
  3. 32-bit integers, double for “triangles”
  4. 32-bit integers, 64-bit only for pythagoras calc
  5. 32-bit integers, double only for pythagoras calc

(In cases 2,3 the struct triangle has int64_t/doubles properties, and all manipulations and calculations on triangles use these datatypes. In cases 4,5 everything is int32_t, except the internals of a single function, which casts to higher precision before doing its calculations.)

The C-program requires a significant stack size. The stack size can be obtain and changed like (numbers in kb, all values given with ulimit -a):

$ ulimit -s
8192

$ ulimit -s 100000

For my program, a stack size much higher than 8192 is needed (see below). It seems impossible to get large stack than 64Mb in Mac OS X, so my C program could never run there.

Benchmark findings
All C-programs are compiled with gcc -O2.

 CPU            MHZ      SBCL        64     32/64  32/double   32(64)  32(double)
==================================================================================
Time (s)
 i5-4250U 1300-2600       4.5      1.52      1.52      1.60      1,54      1.58
 ARMv6          700      ~715        85        83        45        42        39
 ARMv7          900       357        23        21        13        12        10

Max Res (MB)
 i5-4250U                  41       103       103       103       103       103
 ARMv6                     50       220       210        79       110        76
 ARMv7                     57       180       160        87        97        62

This is not too easy to interpret! The ironic thing is that the fastest thing on the x64-cpu (64-bit integers everywhere) is the slowest on the ARMv6. However, the fastest option on the ARMv6 (32-bit everywhere, and when absolutely needed, use double) is almost the worst on the i5 CPU.

When it comes to the 64-bit i5, it basically does not matter what datatypes you use.

When it comes to the ARMv6, the most important thing is to not store the triangles as int64_t. The strange thing here is the stack sizes. Why does it double (compared to x64) when triangles are stored as int64_t? And the doubles, why do they reduce stack size so much (where are all these doubles actually stored)?

The time command gives max resident memory usage. If I set ulimit -s 128 the first two programs fail (with Segmentation fault 11), and the last three ones succeed, on the ARMv6.

I have found before that the performance of the ARMv6 suffers because of its slow memory and small cache. It is quite possible that the poor performance of the ARMv6 compared to the i5 is related to its slow memory, and the recursion (and stack memory) heavy algorithm.

Finally, SBCL in x64 has very good performance even compared to C (however, an iterative C-implementation, fitting completely in cache, would probably be faster). Note that I am a novice Lisp programmer and this is a math heavy program where the generic number type of Lisp will come at a cost. On the ARMv6, Lisp performance suffers much more.

Windows stack size limit
For Windows, stack size limit is set in the binary, not in the shell. With Cygwin/GCC use the flag -Wl,–stack,1000000 for one million bytes. Note that these are options passed on to the linker.

Future investigations
And I am curious about how much faster a minimal-memory-footprint loop-based C-program would perform.

The source code
Since this code solves a problem in Hackerrank I hesitate to publish it. If you want it for any other reason than just running it on Hackerrank let me know.

All JavaScript objects are not equally fast

One thing I like with JavaScript and NodeJS is to have JSON in the entire stack. I store JSON on disk, process JSON data server side, send JSON over HTTP, process JSON data client side, and the web GUI can easily present JSON (I work with Angular).

As a result of this, all objects are not created the same. Lets say I keep track of Entries, I have an Entry-constructor that initiates new objects with all fields (no more no less). At the same time I receive Entry-objects as JSON-data over the network.

A strategy is needed:

  1. Have mix of raw JSON-Entries and Objects that are instanceof Entry
  2. Create real Entry-objects from all JSON-data
  3. Only work with raw JSON-Entries

Note that if you don’t go with (2) you can’t use prototype, expect objects to have functions or use instanceof to identify objects.

Another perhaps not obvious aspect is that performance is not the same. When you create a JavaScript object using new the runtime actually creates a class with fast to access properties. Such object properties are faster than

  • an empty object {} with properties set afterwards
  • an object created with JSON.parse()

I wrote a program to test this. The simplified explanation is that I obtained an array of objects that I then sorted/calculated a few (6) times. For a particular computer and problem size I got these results:

TIME   PARAMETER   DESCRIPTION
3.3s       R       Produce random objects using "new"
4.4s       L       Load objects from json-file using JSON.parse()
3.0s       L2      json-file, JSON.parse(), send raw objects to constructor
3.2s       L3      load objects using require() from a js-file

I will be honests and say that the implementation of the compare-function sent to sort() matters. Some compare functions suffered more or less from different object origins. Some compare functions are more JIT-optimised and faster the second run. However, the consistent finding is that raw JSON-objects are about 50% slower than objects created with new and a constructor function.

What is not presented above is the cost of parsing and creating objects.

My conclusion from this is that unless you have very strict performance requirements you can use the raw JSON-objects you get over the network.

Below is the source code (for Node.js). Apart from the parameters R, L, L2 and L3 there is also a S(tore) parameter. It creates the json- and js-files used by the Load options. So typically run the program with the S option first, and then the other options. A typicall run looks like this:

$ node ./obj-perf.js S
Random: 492ms
Store: 1122ms

$ node ./obj-perf.js R
Random: 486ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3350ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3361ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3346ms

$ node ./obj-perf.js L
Load: 376ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 4382ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 4408ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 4453ms

$ node ./obj-perf.js L2
Load: 654ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3018ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 2974ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 2890ms

$ node ./obj-perf.js L3
Load: 1957ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3436ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3264ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3199ms

The colums with numbers (110511) are checksums calculated between the sorts. They should be equal, otherwise they dont matter.

const nodeFs = require('fs');

function Random(seed) {
  this._seed = seed % 2147483647;
  if (this._seed <= 0) this._seed += 2147483646;
}

Random.prototype.next = function () {
  return this._seed = this._seed * 16807 % 2147483647;
};

function Timer() {
  this.time = Date.now();
}

Timer.prototype.split = function() {
  var now = Date.now();
  var ret = now - this.time;
  this.time = now;
  return ret;
};

function Point() {
  this.a = -1;
  this.b = -1;
  this.c = -1;
  this.d = -1;
  this.e = -1;
  this.f = -1;
  this.x =  0;
}

function pointInit(point, rand) {
  var p;
  for ( p in point ) {
    point[p] = rand.next() % 100000;
  }
}

function pointLoad(json) {
  var p;
  var point = new Point();
  for ( p in point ) {
    point[p] = json[p];
  }
  return point;
}

function pointCmp(a,b) {
  return pointCmpX[a.x](a,b,a.x);
}

function pointCmpA(a,b) {
  if ( a.a !== b.a ) return a.a - b.a;
  return pointCmpB(a,b);
}

function pointCmpB(a,b) {
  if ( a.b !== b.b ) return a.b - b.b;
  return pointCmpC(a,b);
}

function pointCmpC(a,b) {
  if ( a.c !== b.c ) return a.c - b.c;
  return pointCmpD(a,b);
}

function pointCmpD(a,b) {
  if ( a.d !== b.d ) return a.d - b.d;
  return pointCmpE(a,b);
}

function pointCmpE(a,b) {
  if ( a.e !== b.e ) return a.e - b.e;
  return pointCmpF(a,b);
}

function pointCmpF(a,b) {
  if ( a.f !== b.f ) return a.f - b.f;
  return pointCmpA(a,b);
}

var pointCmpX = [pointCmpA,pointCmpB,pointCmpC,pointCmpD,pointCmpE,pointCmpF];

function pointDist(a,b) {
  return Math.min(
    (a.a-b.a)*(a.a-b.a),
    (a.b-b.b)*(a.b-b.b),
    (a.c-b.c)*(a.c-b.c),
    (a.d-b.d)*(a.d-b.d),
    (a.e-b.e)*(a.e-b.e),
    (a.f-b.f)*(a.f-b.f)
  );
}

function getRandom(N) {
  var i;
  var points = new Array(N);
  var rand   = new Random(14);

  for ( i=0 ; i<N ; i++ ) {
    points[i] = new Point();
    n = pointInit(points[i], rand);
  }
  return points;
}

function test(points) {
  var i,j;
  var dist;
  var dists = [];

  for ( i=0 ; i<6 ; i++ ) {
    dist = 0;
    for ( j=0 ; j<points.length ; j++ ) {
      points[j].x = i;
    }
    points.sort(pointCmp);
    for ( j=1 ; j<points.length ; j++ ) {
      dist += pointDist(points[j-1],points[j]);
    }
    dists.push(dist);
  }
  return 'DISTS=' + dists.join(', ');
}

function main_store(N) {
  var timer = new Timer();
  points = getRandom(N);
  console.log('Random: ' + timer.split() + 'ms');
  nodeFs.writeFileSync('./points.json', JSON.stringify(points));
  nodeFs.writeFileSync('./points.js', 'exports.points=' +
                                      JSON.stringify(points) + ';');
  console.log('Store: ' + timer.split() + 'ms');
}

function main_test(points, timer) {
  var i, r;
  for ( i=0 ; i<3 ; i++ ) {
    r = test(points);
    console.log(r + ' : ' + timer.split() + 'ms');
  }
}

function main_random(N) {
  var timer = new Timer();
  var points = getRandom(N);
  console.log('Random: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main_load() {
  var timer = new Timer();
  var points = JSON.parse(nodeFs.readFileSync('./points.json'));
  console.log('Load: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main_load2() {
  var timer = new Timer();
  var points = JSON.parse(nodeFs.readFileSync('./points.json')).map(pointLoad);
  console.log('Load: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main_load3() {
  var timer = new Timer();
  var points = require('./points.js').points;
  console.log('Load: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main() {
  var N = 300000;
  switch ( process.argv[2] ) {
  case 'R':
    main_random(N);
    break;
  case 'S':
    main_store(N);
    break;
  case 'L':
    main_load();
    break;
  case 'L2':
    main_load2();
    break;
  case 'L3':
    main_load3();
    break;
  default:
    console.log('Unknown mode=' + process.argv[2]);
    break;
  }
}

main();

Review: NUC vs Raspberry Pi

I like small, cheap, quiet computers… perhaps a little too much. For a long time I have used a Raspberry Pi V2 (QuadCore@900MHz and 1GB RAM) as a workstation. To be honest, I have not used it for web browsing, that is just too painful. But I have used it for programming and running multiple Node.js services, and a few other things.

Despite there are so many single board computers it is hard to find really good alternatives to the Raspberry Pi. And when I look into it, I find that Intel NUCs are very good options. So, I just decided to replace my RPi2 workstation with the cheapest NUC that money can currently buy: the NUC6CAY with a Celeron J3455 CPU. It sounds cheap, particularly for something server like. The interesting thing with the J3455 CPU is that it is actually Quad Core, with no hyper threading. To me it sounds amazing!

I also have an older NUC, a 54250WYKH with an i5 CPU.

Raspberry Pi V2:   ARMv7    4 Cores      900MHz                  1GB RAM
NUC                Celeron  4 Cores      1500MHz (2300 burst)    8GB RAM
NUC                i5       2 Cores (HT) 1300MHz (2600 burst)   16GB RAM

I/O is obviously superior for the NUCs (both using SSD) versus the RPI v2 having a rotating disk connected to USB. But for my purposes I think I/O and (amount of) RAM makes little difference. I think it is more about raw CPU power.

Node.js / JavaScript
When it comes to different Node.js applications, it seems the older i5 is about twice as fast as the newer Celeron (for one Core and one thread). I would say this is slightly disappointing (for the Celeron). On the other hand the Celeron is about 10x faster than the RPi V2 when it comes to Node.js code, and that is a very good reason to use a NUC rather than a Raspberry PI.

Update 2018-02-11: after a few months
I came back to my RPi2 from my cheap NUC. The difference is… everything. I really like Raspberry PIs. I have built cases for them, bought cases for them, worked on them, made servers of them. But I really must say that a NUC makes more sense: it contains everything nicely and it is so much more powerful.

You can get a Celeron NUC with 2GB RAM and a 2.5′ disk for quite little money. And from there you can go to Core i7, 32GB RAM and two hard drives: M.2 + 2.5′. And check out the Hades Canyon NUC.

I feel sorry there is basically nothing in the market like a NUC with ARM, AMD, PowerPC or Mips. The only competition is the 4 year old MacMini, which is completely an Intel machine. If you find something cool, NUC-like, not Intel, feel free to post below.

Update 2018-02-28
I ran into a new problem on my RPi. It could be anything. My guess, that I will never be able to prove, is that it is a glitch made possible by using an SD-card as root device (and possibly questionable drivers/hardware for SD on the RPi).

Update 2018-04-09
Premier Farnell has introduced a Desktop Pi. Especially promising is that together with a recent RPi you can get rid of the SD-card entirely, and only use SSD/HDD or even mSATA (over USB i presume).

Lodash Performance Sucks!

To continue my Functional Programming Sucks series of posts I will have a closer look at reduce().

I complained with Lodash (and Underscore) for different reasons. One complaint was performance, but I just read the code and presumed it was going to be slow without measuring. Then I complained with the performance of Functional Programming in general.

I thought it would be interesting to “improve” the Functional code with Lodash functions, and to my surprise (I admit I was both wrong and surprised) I found Lodash made it faster! After reading a little more about it I discovered this is a well known fact.

So, here are four different implementations of a function that checks if the elements (numbers) in an array are ordered (cnt is incremented if the array is sorted, such was the original problem).

// Standard reduce()
    this.test = function(p) {
        if ( false !== p.reduce(function(acc,val) {
            if ( false === acc || val < acc ) return false;
            return val;
        }, -1)) cnt++;
    };

// Lodash reduce(), and some other Lodash waste
    this.test = function(p) {
        if ( false !== LO.reduce(p,function(acc,val) {
            if ( false === acc || val < acc ) return false;
    //      if ( !LO.isNumber(acc) || val < acc ) return false;
            return val;
        }, -1)) cnt++;
    };

// My own 4 minute to implement simpleReduce(), see below
    this.test = function(p) {
        if ( false !== simpleReduce(p,function(acc,val) {
            if ( false === acc || val < acc ) return false;
            return val;
        }, -1)) cnt++;
    };

// A simple imperative version
    this.test = function(p) {
        var i;
        for ( i=1 ; i < p.length ; i++ ) {
            if ( p[i] < p[i-1] ) return;
        }
        cnt++;
    };

// my own implementation reduce()
    function simpleReduce(array, func, initval) {
         var i;
         var v = initval;
         for ( i=0 ; i<array.length ; i++ ) {
             v = func(v, array[i]);
         }
         return v;
    }

The interesting thing here is that the standard library reduce() is the slowest.
However, my simpleReduce is faster than Lodash reduce().

(seconds) reduce()
Std Lib Lodash Simple Imperative
Raspberry Pi v1 (ARMv6 @ 700) 21 13 9.3 4.8
MacBook Air (Core i5 @ 1400) 0.46 0.23 0.19 0.16

Conclusion
The conclusion is that from a performance perspective Functional Programming sucks. Lodash sucks too, but a little bit less so than the standard library (however, if you decorate all your code with isEmpty, isString, isNumber and that crap it will get worse).

That said, the generic nature of Lodash comes at a cost. The most simpleReduce() imaginable outperforms Lodash. As I see it, this leaves Lodash in a pretty bad (or small) place:

  • Compared to the standard library it is an extra dependency with limited performance benefits
  • The generic nature of Lodash comes at both a performance cost and it allows for sloppy coding
  • A hand written reduce() outperforms Lodash and is a good excercise for anyone to write. I expect this is quite true also for other functions like take or takeRight.
  • For best performance, avoid Functional Programming (and in this case the imperative version is arguably more readable than the FP reduce() versions)

Whats up with the Standard Library???
JavaScript is a scripted language (interpreted with a JIT compiler) that has a standard library written in C++. How can anything written in JavaScript execute faster than anything in the standard library that does the same thing?

First, kudos to the JIT designers! Amazing job! Perhaps the standard library people can learn from you?

I can imagine the standard library functions are doing some tests or validations that are somehow required by the standard, and that a faster and less strict version of reduce() would possibly break existing code (although this sounds far fetched).

I can (almost not) imagine that there is a cost of going from JS to Native and back to JS: that function calls to native code comes with overhead. Like going from user space to kernel space. It sounds strange.

I have read that there are optimizations techniques applied to Lodash (like lazy evaluation), but I certainly didn’t do anything like that in my simpleReduce().

For Node.js optimizing the standard library truly would make sense. In the standard library native code of a single-threaded server application every cycle counts.

UPDATE: I tried replacing parts of the above code: 1) the lambda function that is passed to reduce(), 2) the imperative version, with native code. That is, I wrote C++ code for V8 and used it instead of JavaScript code. In both cases this was slower! Obviously there is some overhead in going between native and JavaScript JIT, and for rather small functions this overhead makes C++ “slower” than JavaScript. My idea was to write a C++ reduce() function but I think the two functions I wrote are enough to show what is happening here. Conclusion: don’t write small native C++ functions for performance, and for maximum performance it can be worth to rewrite the standard library in JavaScript (although this is insane to do)!

All FP-sucks related articles
Functional Programming Sucks)
Underscore.js sucks! Lodash sucks!
Functional Programming Sucks! (it is slow)
Lodash Performance Sucks! (this one)

Functional Programming Sucks! (it is slow)

Update 2017-12-05: I added a new test in the end that came from real code.
It is both true that functional code is slower and that Node.js v8 is tightening the gap.

Update 2017-07-17: Below i present numbers showing that functional code is slower than imperative code. It seems this has changed with newer versions of Node.js: functional code has not turned faster but imperative code has become slower. You can read a little more about it in the comments. I will look more into this. Keep in mind that the below findings may be more accurate for Node.js v4-6 than for v8.

Functional programming is very popular with contemporary JavaScript programmers. As I have written before, Functional programming sucks and functional libraries for JavaScript also suck.

In this post I will explain more why Functional Programming sucks. I will start with the conclusion. Read on as long as you want more details.

Functional Programming practices are bad for performance
It is very popular to feed lamda-functions to map(), reduce(), filter() and others. If you do this carelessly the performance loss is significant.

It is also popular to work with immutable data. That is, you avoid functions that change (mutate) current state (side effects) and instead you produce a new state (a pure function). This puts a lot of pressure on the garbage collector and it can destroy performance.

The Benchmark Problem
Sometimes I entertain myself solving problems on Hackerrank.com. I particularly like the mathematical challenges in the Project Euler section (the Project Euler is also an independent organisation – HackerRank uses the challenges in Project Euler to create programming challenges).

This article refers to Project Euler 32. I will not go into details, but the solution is basically:

  1. Generate all permutations of the numbers 1,2,3,4,5,6,7,8,9 (there are 9! of them)
  2. For each permutation, check if it is “good” (very few are)
  3. Print the sum of the good instances

The first two steps give good benchmark problems. I have made different implementations of (1) and (2) and then compared the results.

Benchmark Results
I have three different permutation generators (all recursive functions):

  1. Pure function, immutable data (it may not be strictly pure)
  2. Function that mutates its own internal state, but not its input
  3. Function that mutates shared data (no allocation/garbace collection)

I also have three different test functions:

  1. Tests the orginal Project Euler problem
  2. Simplified test using reduce() and lamda function
  3. Simplified test implemented a standard loop

I benchmarked on two different systems using Node.js version 6. I have written elsewhere that Node.js performance on Raspberry Pi sucks.

(seconds) Project Euler Test Simplified Test
Test Function: Functional Imperative
Permutation Gen: Pure Semi Shared Shared Shared Pure
Raspberry Pi v1 (ARMv6 @ 700) 69 23 7.4 21 3.7 62
MacBook Air (Core i5 @ 1400) 0.77 0.29 0.13 0.40 0.11 0.74

Comparing columns 1-2-3 shows the performance of different generators (for Project Euler test)
Comparing columns 4-5 shows the performance of two different test functions (using fast generator)
Comparing columns 5-6 shows the performance of two different generators (for fast simple test)

This shows that the benefit of using shared/mutable data (not running the garbage collector) instead of immutable data is 5x performance on the Intel CPU and even more on the ARM. Also, the cost of using reduce() with a lamda function is more than 3x overall performance on the Intel CPU, and even more on the ARM.

For both the test function and permutation generation, making any of them functional-slow significantly slows down the entire program.

The conclusion of this is that unless you are quite sure your code will never be performance critical you should avoid functional programming practices. It is a lot easier to write imperative code than to later scale out your architecture when your code does not perform.

However, the pure immutable implementation of the permutation generator is arguably much simpler than the iterative (faster) counterpart. When you look at the code you may decide that the performance penalty is acceptable to you. When it comes to the reduce() with a lamda function, I think the imperative version is easier to read (and much faster).

Please notice that if your code consists of nice testable, replaceble parts without side effects you can optimize later on. The functional principles are more valuable at a higher level. If you define your functions in a way that they behave like nice FP functions it does not matter if they are implemented using imperative principles (for performance).

Generating Permutations
I used the following simple method for generating permutations. I start with two arrays and I send them to my permute-function:

  head = [];
  tail = [1,2,3,4];

  permute(head,tail);

My permute-function checks if tail is empty, and then: test/evalute head.
Otherwise it generates 4 (one for each element in tail) new sets of head and tail:

  permute( [1] , [2,3,4] )
  permute( [2] , [1,3,4] )
  permute( [3] , [1,2,4] )
  permute( [4] , [1,2,3] )

The difference in implementation is:

  • Pure version generates all the above 8 arrays as new arrays using standard array functions
  • Semi pure version generates its own 2 arrays (head and tail) and then uses a standard loop to change the values of the arrays between the (recursive) calls to permute.
  • Shared version simply creates a single head-array and 9 tail-arrays (one for each recursion step) up front. It then reuses these arrays throughout the 9! iterations. (It is not global variables, they are hidden and private to the permutation generator)

The simplified test
The simplified test checks if the array is sorted: [1,2,3,4]. Of all permutations, there is always exactly one that is sorted. It is a simple test to implement (especially with a loop).

// These functions are part of a "test-class" starting like:
function testorder1() {
    var cnt = 0;

// Functional test
    this.test = function(p) {
        if ( false !== p.reduce(function(acc,val) {
            if ( false === acc || val < acc ) return false;
            return val;
        }, -1)) cnt++;
    };

// Iterative test (much faster)
    this.test = function(p) {
        var i;
        for ( i=1 ; i<p.length ; i++ ) {
            if ( p[i] < p[i-1] ) return;
        }
        cnt++;
    };

I tried to optimise the functional reduce() version by breaking out a named function. That did not help. I also tried to let the function always return the same type (now it returns false OR a number) but that also made no difference at all.

All the code
For those who want to run this themselves or compare the permutation functions here is the entire program.

As mentioned above, the slowest (immutable data) permutation function is a lot smaller and easier to understand then the fastest (shared data) implementation.


'use strict';

// UTILITIES

function arrayToNum(p, s, e) {
    var r = 0;
    var m = 1;
    var i;
    for ( i=e-1 ; s<=i ; i-- ) {
        r += m * p[i];
        m *= 10;
    }
    return r;
}

function arrayWithZeros(n) {
    var i;
    var a = new Array(n);
    for ( i=0 ; i<a.length ; i++ ) a[i] = 0;
    return a;
}


// PERMUTATION ENGINES

function permutations0(n, callback) {
}

// IMMUTABLE (SLOWEST)

function permutations1(n, callback) {
    var i;
    var numbers = [];
    for ( i=1 ; i<=n ; i++ ) numbers.push(i);
    permute1([],numbers,callback);
}

function permute1(head, tail, callback) {
    if ( 0 === tail.length ) {
        callback(head);
        return;
    }

    tail.forEach(function(t, i, a) {
        permute1( [t].concat(head),
                  a.slice(0,i).concat(a.slice(i+1)),
                  callback);

    });
}

// MUTATES ITS OWN DATA, BUT NOT ITS ARGUMENTS

function permutations2(n, callback) {
    var i;
    var numbers = [];
    for ( i=1 ; i<=n ; i++ ) numbers.push(i);
    permute2([],numbers,callback);
}

function permute2(head, tail, callback) {
    if ( 0 === tail.length ) {
        callback(head);
        return;
    }
    var h2 = [tail[0]].concat(head);
    var t2 = tail.slice(1);
    var i  = 0;
    var tmp;
    
    while (true) {
        permute2(h2, t2, callback);
        if ( i === t2.length ) return;
        tmp   = h2[0];
        h2[0] = t2[i];
        t2[i] = tmp;
        i++;
    }
}

// MUTATES ALL DATA (INTERNALLY) (FASTEST)

function permutations3(n, callback) {
    var i;
    var head  = arrayWithZeros(n);
    var tails = new Array(n+1);

    for ( i=1 ; i<=n ; i++ ) {
        tails[i] = arrayWithZeros(i);
    }

    for ( i=1 ; i<=n ; i++ ) {
        tails[n][i-1] = i;
    }

    function permute3(x) {
        var j;
        var tail_this;
        var tail_next;
        var tmp;
        if ( 0 === x ) {
            callback(head);
            return;
        }
        tail_this = tails[x];
        tail_next = tails[x-1];

        for ( j=1 ; j<x ; j++ ) {
            tail_next[j-1] = tail_this[j];
        }

        j=0;
        while ( true ) {
            head[x-1] = tail_this[j];
            permute3(x-1);
             
            j++;
            if ( j === x ) return;

            tmp            = head[x-1];
            head[x-1]      = tail_next[j-1];
            tail_next[j-1] = tmp;
        }
    }

    permute3(n);
}

// TEST FUNCTIONS

function testprint() {
    this.test = function(p) {
        console.log(JSON.stringify(p));
    };

    this.done = function() {
        return 'Done';
    };
}

// CHECKS IF PERMUTATION IS ORDERED - FUNCTIONAL (SLOWEST)

function testorder1() {
    var cnt = 0;

    this.test = function(p) {
        if ( false !== p.reduce(function(acc,val) {
            if ( false === acc || val < acc ) return false;
            return val;
        }, -1)) cnt++;
    };

    this.done = function() {
        return cnt;
    };
}

// CHECKS IF PERMUTATION IS ORDERED - IMPERATIVE (FASTEST)

function testorder2() {
    var cnt = 0;

    this.test = function(p) {
        var i;
        for ( i=1 ; i<p.length ; i++ ) {
            if ( p[i] < p[i-1] ) return;
        }
        cnt++;
    };

    this.done = function() {
        return cnt;
    };
}

// TEST FUNCTION FOR PROJECT EULER 32

function testeuler() {
    var sums = {};

    this.test = function(p) {
        var w1, w2, w;
        var m1, m2, mx;
        w =  Math.floor(p.length/2);
        w1 = 1;
        w2 = p.length - w - w1;
    
        while ( w1 <= w2 ) {
            m1 = arrayToNum(p,     0, w1      );
            m2 = arrayToNum(p,    w1, w1+w2   );
            mx = arrayToNum(p, w1+w2, p.length);
        
            if ( m1 < m2 && m1 * m2 === mx ) {
                sums['' + mx] = true;
            }
        
            w1++;
            w2--;
        }
    };

    this.done = function() {
        var i;
        var r = 0;
        for ( i in sums ) {
            r += +i;
        }
        return r;
    };
}

// MAIN PROGRAM BELOW

function processData(input, parg, targ) {
    var r;

    var test = null;
    var perm = null;

    switch ( parg ) {
    case '0':
        perm = permutations0;
        break;
    case '1':
        perm = permutations1;
        break;
    case '2':
        perm = permutations2;
        break;
    case '3':
        perm = permutations3;
        break;
    }

    switch ( targ ) {
    case 'E':
        test = new testeuler;
        break;
    case 'O1':
        test = new testorder1;
        break;
    case 'O2':
        test = new testorder2;
        break;
    case 'P':
        test = new testprint();
        break;
    }


    r = perm(+input, test.test);
    console.log(test.done());
} 

function main() {
    var input = '';
    var parg = '1';
    var targ = 'E';
    var i;

    for ( i=2 ; i<process.argv.length ; i++ ) {
        switch ( process.argv[i] ) {
        case '0':
        case '1':
        case '2':
        case '3':
            parg = process.argv[i];
            break;
        case 'E':
        case 'O1':
        case 'O2':
        case 'P':
            targ = process.argv[i];
            break;
        }
    }
    

    process.stdin.resume();
    process.stdin.setEncoding('ascii');
    process.stdin.on('data', function (s) {
        input += s;
    });

    process.stdin.on('end', function () {
       processData(input, parg, targ);
    });
}

main();

This is how I run the code (use a lower value than 9 to have fewer than 9! permutations)

### Project Euler Test: 3 different permutation generators ###
$ echo 9 | time node projecteuler32.js 3 E
45228
8.95user ...
b$ echo 9 | time node projecteuler32.js 2 E
45228
25.03user ...
$ echo 9 | time node projecteuler32.js 1 E
45228
70.34user ...

### Simple check-order test, two different versions. Fastest permutations.
b$ echo 9 | time node projecteuler32.js 3 O1
1
23.71user ...
$ echo 9 | time node projecteuler32.js 3 O2
1
4.72user ...

(the timings here may not exactly match the above figures)

Update 2017-12-05
Admittedly, I sometimes find map(), filter() handy and I try to use them when it makes code more clear. I came to a situation where I want to split a list in two lists (one list with valid objects and one with invalid). This is a simple if/else with a push() in each. Or it would be two calls to filter(). Then it turned out that I wanted to split the valid objects into two lists: good and ugly. The slightly simplified code is:

function goodBadUgly_1(list) {
  var i, c;
  var ret = {
    good : [],
    bad  : [],
    ugly : []
  }
  for ( i=0 ; i<list.length ; i++ ) {
    c = list[i];
    if ( !validateItem(c) )
      ret.bad.push(c);
    else if ( uglyItem(c) )
      ret.ugly.push(c);
    else
      ret.good.push(c);
  }
  return ret;
}

function goodBadUgly_2(list) {
  return {
    good : list.filter(function(c) {
                         return validateItem(c) && !uglyItem(c);
                      }),
    bad  : list.filter(function(c) {
                         return !validateItem(c);
                      }),
    ugly : list.filter(function(c) {
                         return  validateItem(c) && uglyItem(c);
                      })
  };
}

On my not too powerful x64 CPU, and a list of about 1000 items the non-FP version took 6ms and the FP version took 16ms (second run, to allow the JIT to do its job). This was with Node 8.9.1. For Node 6.11.3 the FP version was slower but the non-FP version was almost same speed (quite consistent with my comment in the beginning from 2017-07-17).

You may think that of course the FP code is slower: it calls validateItem twice (always) and uglyItem twice for all valid items. Yes, that is true, and that is also my point! When you do FP you avoid (storing intermediate results in) variables. This results in extra work being done a lot of the time. How would YOU implement this in FP style?

This is 10 ms: does it matter? Well, first it is only 1000 objects.

If you do this in a Web GUI when a user clicks a button, the user will wait 10ms longer for everything to be updated. 10ms is not a lot. But if this multiplies (because you have a longer list) or adds up (because you are doing other things in a slower-than-necessary way) the UX will suffer.

If you do this server side, 10ms is a lot. In Node.js you have just 1 thread. So this overhead is 1% of all available performance each second. If you get 10 requests per second 10% CPU is wasted only because you prefer FP style.

This is one of those cases when FP has the same computational complexity, but its kind of a constant factor slower. Sometimes it can be even worse.

All FP-sucks related articles
Functional Programming Sucks)
Underscore.js sucks! Lodash sucks!
Functional Programming Sucks! (it is slow) (this one)
Lodash Performance Sucks!