Monthly Archives: September 2017

Upgrade windows hard drive

So, you have a Windows 10 (this probably applies to Windows 7 and forward) computer and you want to replace your system drive because you have a bad feeling about the health of the current drive, or you just want a larger SSD.

If you start searching on Google (or Bing) for this, you will get so many crappy answers: plenty of companies want to sell you tools, and Microsoft provides so much information that is hard to make any sense of. This is a simple and (quite) safe procedure that worked for me.

What you need

  1. A new hard drive, larger than the previous one
  2. A USB key, at least 8GB
  3. A USB hard drive, with enough space for backing up all your files

All drives should contain nothing of value as they will all be reformatted in the process.

Steps

  1. Create a Recovery Drive to the USB key (2)
  2. Create a System Image to the USB hard drive (3)
  3. Turn your computer off
  4. Replace your current system drive with the new one (1)
  5. For safety, unplug drives that you don’t want restored
  6. Start computer from your Recovery Drive (2)
  7. Choose Advanced
  8. Plug in System Image drive (3)
  9. Restore System Image
  10. Restore to your new hard drive (1)
  11. Profit!

This is a safe way to upgrade hard drive because:

  • You always have your original system intact and you can put your old hard drive back if anything goes wrong.
  • You completely only use standard Windows tools. From Windows/Microsoft perspective your original hard drive broke, you were smart enough to have proper backup, and you simply restored your backup.
  • It is anyway a good thing to always have (2) and (3) in case of disaster.

Problems, Annoyances, Caveats
The process is quite annoying.

  1. Just finding out where in Windows to create Recovery Image and System Image is a pain.
  2. Creating the Recovery Drive is very slow, despite you have a fast USB Key
  3. I did not click Advanced after starting the Recovery Drive. That ended up formatting a non-system-drive that was plugged in (I realised my mistake too late).

References
Howtogeek

Playing with LISP and LISP vs C

Lisp is fun! Well, since I first knew about Lisp I was fascinated, but I have found it hard to learn Lisp and to play with it in a meaningful way. A few years ago I wrote about it here and here. As usual, the first steps of learning something new can be the hardest.

Occationally I use Hackerrank to find programming challanges to solve for fun. I particularly like the Project Euler competition. I find it particularly good for trying out new languages: you get a “meaningful” challenge, a simple environment prepared in your web browser, and automated test cases. So, this time I didn’t waste my time trying to find the right Lisp implementation for me, I just started hacking on Project Euler 38 and 39 on Hackerrank.

Problem 38 was quite simple, but 39 was more interesting. When I had solved it, I found my implementation was not at all fast enough, so I started experimenting locally (the Hackerrank environment is not optimal for tweaking, optimization and debugging).

Choosing a (Common) Lisp implementation
There are quite many Common Lisp implementations out there. The one Hackerrank uses is SBCL. That is clearly the Common Lisp implementation I would recommend (based on my little experience) if it is available for your platform.

I installed SBCL with apt-get in Ubuntu. I also downloaded binaries directly for my Mac OS X computer and my Raspberry Pi (v1) running Arch linux. Installation is a bit non-standard, but you can actually run it without installing (just execute run-sbcl.sh in downloaded folder).

I also tried clisp and ecl, none of these could deal with the memory usage (stack size) of my program. For clisp I found no way to manipulate stack sizes at all. For ecl I made some progress but I could not make it run my program.

SBCL is a Lisp compiler, and it produces fast and efficient code. I later compared it to C.

Project Euler 39
Project Euler 39 is basically about finding integer solutions to Pythagoras theorem. For a given, large, perimeter, how many right triangles are there? For example:

300000^2 + 400000^2 = 500000^2

This triangle has a perimeter of 300000+400000+500000=1200000. What other values for a and b so that

a + b = 700000
a^2 + b^2 = 500000^2

are there? The Hackerrank challenge requires you to work with perimeters up to 5000000. If you implement a solution, a few things to immediately note:

  • The squares wont fit in a 32bit integer. They will fit with no loss of precision in the 53 bits of a 64 bit double and they will also fit in a 64 bit integer. (This matters not for Common Lisp)
  • If you want to do recursion (and of course you want when you code Lisp) it will be millions of recursion steps, which will be a challenge to the stack size. (This also turned out not to matter for SBCL)

The Lisp implementation
It turned out that the SBCL compiler optimized the recursion is such a way that the memory usage was quite low. SBCL successfully runs my program on RPi/Arch, Intel/Ubuntu and Intel/OSX with quite reasonable memory usage.

Since this is about learing Lisp I wanted a 100% functional programming implementation. Only pure functions. A lot of my code is about generating, modifying and testing triangles. A triangle (a b c) can obviously be represented as a Lisp list (a b c) and this was my first implementation. Then if you want to read a, b or c from a list abc, or create the list from a, b and c, you can do:

  a: (car abc)
  b: (car (cdr abc))
  c: (car (cdr (cdr abc)))

abc: (list a b c)

I found this cumbersome. It became a lot of list-to-variables and variables-to-list overhead (I didnt care so much about performance, more about my code readability). I learnt that Lisp functions can return multiple values using value and that you can bind them with multiple-value-bind and use them as arguments to a function using multiple-value-call. This felt functional and pure enough, and it made my code 25% faster than the car/cdr pattern above.:

; a (stupid) function returning a triangle as three values
(defun get-345-triangle ()
  (values 3 4 5))

; a function calculating the perimeter of triangle (from a function)
(defun triangle-perimeter-1 (tri-func)
  (multiple-value-bind (a b c) (funcall tri-func)
    (+ a b c)))

; and in this case you dont need to bind, you can use + directly
(defun triangle-perimeter-2 (tri-func)
  (multiple-value-call #'+ (funcall tri-func)))

; now this works
(triangle-perimeter-1 #'get-345-triangle)
(triangle-perimeter-2 #'get-345-triangle)

Since I am a very inexperienced Lisp programmer I appreciate suggestions for improvement.

Performance of Lisp
My final Hackerrank submission of Lisp code executes in about 4.5 seconds on my Intel i5/Ubuntu. It takes about the same time on the Hackerrank web page, which is fast enough to pass all tests. On the Raspberry Pi v1 (ARMv6 @700 MHz) it takes more than 700 seconds. My intuition told me that 4.5 seconds was very good. This made me ask two questions. How would Lisp compare to C? And why is the ARM more than 100 times slower, how would that compare in C?

The C implementation
My ambition was to rewrite Lisp to C line by line. So my C-program has exactly the same functions which take almost exactly the same arguments. All calculations are identical and performed in exactly the same order. The C-program relies entirely on recursion instead of loops (just like the Lisp program). However…

Functions in C can not return multiple variables. While Lisp had values I decided to use a reference to a struct in C:

(defun get-a-triangle()
  (values x y z))

void get_a_triangle(struct triangle *t) {
  t->a = x;
  t->b = y;
  t->c = z;
}

If the C-triangle struct is a local variable on the callers stack the difference is quite small (from a practical point of view, from a theoretic strict functional programming perspective its a different story).

Numbers in Lisp have arbitrary precision integers and floats make no difference. So, when porting to C, I had to pick numeric types. For most purposes, int32_t was good enough. But, for the purpose of calculating Pythagoras theorem higher precision was needed (as I wrote above, the 53 bits of double, or 64 bits of int64_t are good). So I ended up with 5 versions of the C-program (to compare performance):

  1. All 64-bit integers
  2. 32-bit integers, 64-bit for “triangles”
  3. 32-bit integers, double for “triangles”
  4. 32-bit integers, 64-bit only for pythagoras calc
  5. 32-bit integers, double only for pythagoras calc

(In cases 2,3 the struct triangle has int64_t/doubles properties, and all manipulations and calculations on triangles use these datatypes. In cases 4,5 everything is int32_t, except the internals of a single function, which casts to higher precision before doing its calculations.)

The C-program requires a significant stack size. The stack size can be obtain and changed like (numbers in kb, all values given with ulimit -a):

$ ulimit -s
8192

$ ulimit -s 100000

For my program, a stack size much higher than 8192 is needed (see below). It seems impossible to get large stack than 64Mb in Mac OS X, so my C program could never run there.

Benchmark findings
All C-programs are compiled with gcc -O2.

 CPU            MHZ      SBCL        64     32/64  32/double   32(64)  32(double)
==================================================================================
Time (s)
 i5-4250U 1300-2600       4.5      1.52      1.52      1.60      1,54      1.58
 ARMv6          700      ~715        85        83        45        42        39
 ARMv7          900       357        23        21        13        12        10

Max Res (MB)
 i5-4250U                  41       103       103       103       103       103
 ARMv6                     50       220       210        79       110        76
 ARMv7                     57       180       160        87        97        62

This is not too easy to interpret! The ironic thing is that the fastest thing on the x64-cpu (64-bit integers everywhere) is the slowest on the ARMv6. However, the fastest option on the ARMv6 (32-bit everywhere, and when absolutely needed, use double) is almost the worst on the i5 CPU.

When it comes to the 64-bit i5, it basically does not matter what datatypes you use.

When it comes to the ARMv6, the most important thing is to not store the triangles as int64_t. The strange thing here is the stack sizes. Why does it double (compared to x64) when triangles are stored as int64_t? And the doubles, why do they reduce stack size so much (where are all these doubles actually stored)?

The time command gives max resident memory usage. If I set ulimit -s 128 the first two programs fail (with Segmentation fault 11), and the last three ones succeed, on the ARMv6.

I have found before that the performance of the ARMv6 suffers because of its slow memory and small cache. It is quite possible that the poor performance of the ARMv6 compared to the i5 is related to its slow memory, and the recursion (and stack memory) heavy algorithm.

Finally, SBCL in x64 has very good performance even compared to C (however, an iterative C-implementation, fitting completely in cache, would probably be faster). Note that I am a novice Lisp programmer and this is a math heavy program where the generic number type of Lisp will come at a cost. On the ARMv6, Lisp performance suffers much more.

Windows stack size limit
For Windows, stack size limit is set in the binary, not in the shell. With Cygwin/GCC use the flag -Wl,–stack,1000000 for one million bytes. Note that these are options passed on to the linker.

Future investigations
And I am curious about how much faster a minimal-memory-footprint loop-based C-program would perform.

The source code
Since this code solves a problem in Hackerrank I hesitate to publish it. If you want it for any other reason than just running it on Hackerrank let me know.

All JavaScript objects are not equally fast

One thing I like with JavaScript and NodeJS is to have JSON in the entire stack. I store JSON on disk, process JSON data server side, send JSON over HTTP, process JSON data client side, and the web GUI can easily present JSON (I work with Angular).

As a result of this, all objects are not created the same. Lets say I keep track of Entries, I have an Entry-constructor that initiates new objects with all fields (no more no less). At the same time I receive Entry-objects as JSON-data over the network.

A strategy is needed:

  1. Have mix of raw JSON-Entries and Objects that are instanceof Entry
  2. Create real Entry-objects from all JSON-data
  3. Only work with raw JSON-Entries

Note that if you don’t go with (2) you can’t use prototype, expect objects to have functions or use instanceof to identify objects.

Another perhaps not obvious aspect is that performance is not the same. When you create a JavaScript object using new the runtime actually creates a class with fast to access properties. Such object properties are faster than

  • an empty object {} with properties set afterwards
  • an object created with JSON.parse()

I wrote a program to test this. The simplified explanation is that I obtained an array of objects that I then sorted/calculated a few (6) times. For a particular computer and problem size I got these results:

TIME   PARAMETER   DESCRIPTION
3.3s       R       Produce random objects using "new"
4.4s       L       Load objects from json-file using JSON.parse()
3.0s       L2      json-file, JSON.parse(), send raw objects to constructor
3.2s       L3      load objects using require() from a js-file

I will be honests and say that the implementation of the compare-function sent to sort() matters. Some compare functions suffered more or less from different object origins. Some compare functions are more JIT-optimised and faster the second run. However, the consistent finding is that raw JSON-objects are about 50% slower than objects created with new and a constructor function.

What is not presented above is the cost of parsing and creating objects.

My conclusion from this is that unless you have very strict performance requirements you can use the raw JSON-objects you get over the network.

Below is the source code (for Node.js). Apart from the parameters R, L, L2 and L3 there is also a S(tore) parameter. It creates the json- and js-files used by the Load options. So typically run the program with the S option first, and then the other options. A typicall run looks like this:

$ node ./obj-perf.js S
Random: 492ms
Store: 1122ms

$ node ./obj-perf.js R
Random: 486ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3350ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3361ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3346ms

$ node ./obj-perf.js L
Load: 376ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 4382ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 4408ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 4453ms

$ node ./obj-perf.js L2
Load: 654ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3018ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 2974ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 2890ms

$ node ./obj-perf.js L3
Load: 1957ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3436ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3264ms
DISTS=110463, 110621, 110511, 110523, 110591, 110515 : 3199ms

The colums with numbers (110511) are checksums calculated between the sorts. They should be equal, otherwise they dont matter.

const nodeFs = require('fs');

function Random(seed) {
  this._seed = seed % 2147483647;
  if (this._seed <= 0) this._seed += 2147483646;
}

Random.prototype.next = function () {
  return this._seed = this._seed * 16807 % 2147483647;
};

function Timer() {
  this.time = Date.now();
}

Timer.prototype.split = function() {
  var now = Date.now();
  var ret = now - this.time;
  this.time = now;
  return ret;
};

function Point() {
  this.a = -1;
  this.b = -1;
  this.c = -1;
  this.d = -1;
  this.e = -1;
  this.f = -1;
  this.x =  0;
}

function pointInit(point, rand) {
  var p;
  for ( p in point ) {
    point[p] = rand.next() % 100000;
  }
}

function pointLoad(json) {
  var p;
  var point = new Point();
  for ( p in point ) {
    point[p] = json[p];
  }
  return point;
}

function pointCmp(a,b) {
  return pointCmpX[a.x](a,b,a.x);
}

function pointCmpA(a,b) {
  if ( a.a !== b.a ) return a.a - b.a;
  return pointCmpB(a,b);
}

function pointCmpB(a,b) {
  if ( a.b !== b.b ) return a.b - b.b;
  return pointCmpC(a,b);
}

function pointCmpC(a,b) {
  if ( a.c !== b.c ) return a.c - b.c;
  return pointCmpD(a,b);
}

function pointCmpD(a,b) {
  if ( a.d !== b.d ) return a.d - b.d;
  return pointCmpE(a,b);
}

function pointCmpE(a,b) {
  if ( a.e !== b.e ) return a.e - b.e;
  return pointCmpF(a,b);
}

function pointCmpF(a,b) {
  if ( a.f !== b.f ) return a.f - b.f;
  return pointCmpA(a,b);
}

var pointCmpX = [pointCmpA,pointCmpB,pointCmpC,pointCmpD,pointCmpE,pointCmpF];

function pointDist(a,b) {
  return Math.min(
    (a.a-b.a)*(a.a-b.a),
    (a.b-b.b)*(a.b-b.b),
    (a.c-b.c)*(a.c-b.c),
    (a.d-b.d)*(a.d-b.d),
    (a.e-b.e)*(a.e-b.e),
    (a.f-b.f)*(a.f-b.f)
  );
}

function getRandom(N) {
  var i;
  var points = new Array(N);
  var rand   = new Random(14);

  for ( i=0 ; i<N ; i++ ) {
    points[i] = new Point();
    n = pointInit(points[i], rand);
  }
  return points;
}

function test(points) {
  var i,j;
  var dist;
  var dists = [];

  for ( i=0 ; i<6 ; i++ ) {
    dist = 0;
    for ( j=0 ; j<points.length ; j++ ) {
      points[j].x = i;
    }
    points.sort(pointCmp);
    for ( j=1 ; j<points.length ; j++ ) {
      dist += pointDist(points[j-1],points[j]);
    }
    dists.push(dist);
  }
  return 'DISTS=' + dists.join(', ');
}

function main_store(N) {
  var timer = new Timer();
  points = getRandom(N);
  console.log('Random: ' + timer.split() + 'ms');
  nodeFs.writeFileSync('./points.json', JSON.stringify(points));
  nodeFs.writeFileSync('./points.js', 'exports.points=' +
                                      JSON.stringify(points) + ';');
  console.log('Store: ' + timer.split() + 'ms');
}

function main_test(points, timer) {
  var i, r;
  for ( i=0 ; i<3 ; i++ ) {
    r = test(points);
    console.log(r + ' : ' + timer.split() + 'ms');
  }
}

function main_random(N) {
  var timer = new Timer();
  var points = getRandom(N);
  console.log('Random: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main_load() {
  var timer = new Timer();
  var points = JSON.parse(nodeFs.readFileSync('./points.json'));
  console.log('Load: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main_load2() {
  var timer = new Timer();
  var points = JSON.parse(nodeFs.readFileSync('./points.json')).map(pointLoad);
  console.log('Load: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main_load3() {
  var timer = new Timer();
  var points = require('./points.js').points;
  console.log('Load: ' + timer.split() + 'ms');
  main_test(points, timer);
}

function main() {
  var N = 300000;
  switch ( process.argv[2] ) {
  case 'R':
    main_random(N);
    break;
  case 'S':
    main_store(N);
    break;
  case 'L':
    main_load();
    break;
  case 'L2':
    main_load2();
    break;
  case 'L3':
    main_load3();
    break;
  default:
    console.log('Unknown mode=' + process.argv[2]);
    break;
  }
}

main();

Review: NUC vs Raspberry Pi

I like small, cheap, quiet computers… perhaps a little too much. For a long time I have used a Raspberry Pi V2 (QuadCore@900MHz and 1GB RAM) as a workstation. To be honest, I have not used it for web browsing, that is just too painful. But I have used it for programming and running multiple Node.js services, and a few other things.

Despite there are so many single board computers it is hard to find really good alternatives to the Raspberry Pi. And when I look into it, I find that Intel NUCs are very good options. So, I just decided to replace my RPi2 workstation with the cheapest NUC that money can currently buy: the NUC6CAY with a Celeron J3455 CPU. It sounds cheap, particularly for something server like. The interesting thing with the J3455 CPU is that it is actually Quad Core, with no hyper threading. To me it sounds amazing!

I also have an older NUC, a 54250WYKH with an i5 CPU.

Raspberry Pi V2:   ARMv7    4 Cores      900MHz                  1GB RAM
NUC                Celeron  4 Cores      1500MHz (2300 burst)    8GB RAM
NUC                i5       2 Cores (HT) 1300MHz (2600 burst)   16GB RAM

I/O is obviously superior for the NUCs (both using SSD) versus the RPI v2 having a rotating disk connected to USB. But for my purposes I think I/O and (amount of) RAM makes little difference. I think it is more about raw CPU power.

Node.js / JavaScript
When it comes to different Node.js applications, it seems the older i5 is about twice as fast as the newer Celeron (for one Core and one thread). I would say this is slightly disappointing (for the Celeron). On the other hand the Celeron is about 10x faster than the RPi V2 when it comes to Node.js code, and that is a very good reason to use a NUC rather than a Raspberry PI.