Tag Archives: Performance

Sort strings without case sensitivity

In JavaScript, I wanted to sort arrays of strings without caring about case. It was more complicated than I first thought.

The background is that I present lists like this in a GUI:

  • AMD
  • Apple
  • Gigabyte
  • IBM
  • Intel
  • Microsoft
  • MSI
  • Nokia
  • Samsung
  • Sony

I want AMD and MSI (spelled in all caps) to be sorted without respect to case. Standard sort() would put MSI before Microsoft.

Obviously I am not the first one wanting to do this and I found an article on stackoverflow. It suggests the following solution:

Use toLowerCase()
You can make your own string compare function that uses toLowerCase and send it as an argument to sort():

function cmpCaseless(a,b) {
    a = a.toLowerCase();
    b = b.toLowerCase();
    if ( a < b ) return -1;
    if ( a > b ) return  1;
    return 0;
}

myStringArray.sort(cmpCaseless);

This has a number of problems. The article above mentions that it is not stable. That is probably true in some cases but I was of course worried about performance: making two String objects for each compare should make the garbage collector quite busy, not to mention the waste of copying and lowercasing potentially quite long stings when usually the first character is enought. When I started experimenting I found another more critical flaw though: in Swedish we have three extra characters in the alphabet; Å,Ä,Ö, in that order. The above cmpCaseless orders Ä,Å,Ö, which sounds like a little problem, but it is simply unacceptable.

Use localeCompare
There is a more competent (or so I thought, read on) way to compare strings in JavaScript: the localeCompare function. This one simply treats A,Å,Ä and O,Ö as the same character, which is far more unacceptable than the toLowerCase problem.

However, it also has a “locales” option (a second optional argument). If I set it to ‘sv’ I get the sort order that I want, but performance is horrible. And I still have to use toLowerCase as well as localeCompare:

function localeCompare(a,b) {
    return a.toLowerCase().localeCompare(b.toLowerCase());
}

function localeCompare_sv(a,b) {
    return a.toLowerCase().localeCompare(b.toLowerCase(), 'sv');
}

localeCompare() has an extra options argument with a “sensitivity” parameter, but it is no good for our purpuses.

Rolling my own
Of course, I ended up building my own function to do caseless string compare. The strategy is to compare one character at a time, not making any new String objects, and fallback to localeCompare if both characters are above the 127 ASCII characters:

function custom(a,b) {
    var i, al, bl, l;
    var ac, bc;
    al = a.length;
    bl = b.length;
    l = al < bl ? al : bl;
        
    for ( i=0 ; i<l ; i++ ) {
        ac = a.codePointAt(i);  // or charCodeAt() for better compability
        bc = b.codePointAt(i);
        if ( 64 < ac && ac < 91 ) ac += 32;
        if ( 64 < bc && bc < 91 ) bc += 32;
        if ( ac !== bc ) { 
            if ( 127 < ac && 127 < bc ) {
                ac = a.substr(i,1).toLowerCase();
                bc = b.substr(i,1).toLowerCase();
                if ( ac !== bc ) return ac.localeCompare(bc);
            } else {
                return ac-bc;
            }
        }
    }
    return al-bl;
}

One fascinating thing is that here I can use localeCompare() without 'sv'.

Test for yourself
I built a simple webpage where you can test everything yourself.

Conclusion
Defining a string sort order is not trivial, when you dont just have ASCII characters. If you look at the ascii table you see that non alphabetic characters are spread out:

  • SPACE, #, 1-9, and many more come before both A-Z and a-z
  • Underscore: _, and a few other characters come after A-Z but before a-z
  • Pipe: | and a few other characters come after A-Z and a-z

When it comes to characters behind ASCII 127, it just gets more complicated: how do you sort european language latin letters, greek letters and arrows and other symbols?

For this reason, I think it makes sense to define your own sorting function and clearly define the behaviour for the characters you know that you care about. If it really matters in your application.

My function above is significantly faster than the options.

Disclaimer
These results can probably be inconsistent over different web browsers.

Raspberry PI performance and freezes

On a daily basis I use a Raspberry Pi v2 (4x900MHz) with Raspian as a work station and web server. It is connected to a big display, I edit multiple files and it runs multiple Node.js instances. These Node.js processes serve HTTP and access (both read and write) local files.

I experienced regular freezes. Things that could take 2-3 seconds were listing files in a directory, opening a file, saving a file and so on.

I moved my working directory from my (high performance) SD-card to a regular spinning USB hard drive. That completely solved the problem. I experience zero freezes now, compared to plenty before.

My usual experience with Linux is that the block caching layer is highly effective: things get synced to disk when there is time to do so. I dont know if Linux handles SD-cards fundamentally different from other hard drives (syncing more often) or if the SD card (or the Raspberry Pi SD card hardware) is just slower.

So, for making real use of a Raspberry Pi I would clearly recommend a harddrive.

Node.js performance of Raspberry Pi 1 sucks

In several previous posts I have studied the performance of the Raspberry Pi (version 1) and Node.js to find out why the Raspberry Pi underperforms so badly when running Node.js.

The first two posts indicate that the Raspberry Pi underperforms about 10x compared to an x86/x64 machine, after compensation for clock frequency is made. The small cache size of the Raspberry Pi is often mentioned as a cause for its poor performance. In the third post I examine that, but it is not that horribly bad: about 3x worse performance for big memory needs compared to in-cache-situations. It appears the slow SDRAM of the RPi is more of a problem than the small cache itself.

The Benchmark Program
I wanted to relate the Node.js slowdown to some other scripted language. I decided Lua is nice. And I was lucky to find Mandelbrot implementations in several languages!

I modified the program(s) slightly, increasing the resolution from 80 to 160. I also made a version that did almost nothing (MAX_ITERATIONS=1) so I could measure and substract the startup cost (which is signifacant for Node.js) from the actual benchmark values.

The Numbers
Below are the average of three runs (minus the average of three 1-iteration rounds), in ms. The timing values were very stable over several runs.

 (ms)                           C/Hard   C/Soft  Node.js     Lua
=================================================================
 QNAP TS-109 500MHz ARMv5                 17513    49376   39520
 TP-Link Archer C20i 560MHz MIPS          45087    65510   82450
 RPi 700MHz ARMv6 (Raspbian)       493             14660   12130
 RPi 700MHz ARMv6 (OpenWrt)        490    11040    15010   31720
 RPi2 900MHz ARMv7 (OpenWrt)       400     9130      770   29390
 Eee701 900MHz Celeron x86         295               500    7992
 3000MHz Athlon II X2 x64           56                59    1267

Notes on Hard/Soft floats:

  • Raspbian is armhf, only allowing hard floats (-mfloat-abi=hard)
  • OpenWrt is armel, allowing both hard floats (-mfloat-abi=softfp) and soft floats (-mfloat-abi=soft).
  • The QNAP has no FPU and generates runtime error with hard floats
  • The other targets produce linkage errors with soft floats

The Node.js versions are slightly different, and so are the Lua versions. This makes no significant difference.

Findings
Calculating the Mandelbrot with the FPU is basically “free” (<0.5s). Everything else is waste and overhead.

The cost of soft float is about 10s on the RPI. The difference between Node.js on Raspbian and OpenWrt is quite small – either both use the FPU, or none of them does.

Now, the interesting thing is to compare the RPi with the QNAP. For the C-program with the soft floats, the QNAP is about 1.5x slower than the RPi. This matches well with earlier benchmarks I have made (see 1st and 3rd link at top of post). If the RPi would have been using soft floats in Node.js, it would have completed in about 30 seconds (based on the QNAP 50 seconds). The only thing (I can come up with) that explains the (unusually) large difference between QNAP and RPi in this test, is that the RPi actually utilizes the FPU (both Raspbian and OpenWrt).

OpenWrt and FPU
The poor Lua performance in OpenWrt is probably due to two things:

  1. OpenWrt is compiled with -Os rather than -O2
  2. OpenWrt by default uses -mfloat-abi=soft rather than -mfloat-abi=softfp (which is essentially like hard).

It is important to notice that -mfloat-abi=softfp not only makes programs much faster, but also quite much smaller (10%), which would be valuable in OpenWrt.

Different Node.js versions and builds
I have been building Node.js many times for Raspberry Pi and OpenWrt. The above soft/softfp setting for building node does not affect performance much, but it does affect binary size. Node.js v0.10 is faster on Raspberry Pi than v0.12 (which needs some patching to build).

Lua
Apart from the un-optimized OpenWrt Lua build, Lua is consistently 20-25x slower than native for RPi/x86/x64. It is not like the small cache of the RPi, or some other limitation of the CPU, makes it worse for interpreted languages than x86/x64.

RPi ARMv6 VFPv2
While perhaps not the best FPU in the world, the VFPv2 floating point unit of the RPi ARMv6 delivers quite decent performance (slightly worse per clock cycle) compared to x86 and x64. It does not seem like the VFPv2 is to be blamed for the poor performance of Node.js on ARM.

Conclusion and Key finding
While Node.js (V8) for x86/x64 is near-native-speed, on the ARM it is rather near-Lua-speed: just another interpreted language, mostly. This does not seem to be caused by any limitation or flaw in the (RPi) ARM cpu, but rather the V8 implementation for x86/x64 being superior to that for ARM (ARMv6 at least).

Effects of cache on performance

It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?

Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?

A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:

  • be reasonably slow/fast, so measuring execution time is practical and meaningful
  • have working data sets in sizes 10kb-10Mb
  • the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
  • be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely

Finally, I think it is fun if the program does something slightly meaningful.

I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:

Array to sort: G A F C B D H E   ( N=8 )
Sorted array:  A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1     = 28 (compares and conditional swaps)

I invented the following cache-optimized Bubble-Twist-Sort:

Array to sort:                G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist:                                 ( G>B : swap )
                              A C F B G D E H   ( D>F : swap )
                              A C D B G F E H   ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
                 + 4/2         (expected number of twist-swaps)
                 + 16/2 + 16/2 (second two bubbelsort)
                 = 34
Actual cost: 4*(3+2+1) + 2 = 26

Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.

Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.

I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).

The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:

	QNAP	wdr3600	ac20i	Rpi	Rpi 2	wdr4900	G4	Celeron	Xeon	Athlon	i5
								~2007   ~2010   ~2013
============================================================================================
L1	32	32	32	16	?	32	64	32	32	128	32
L2				128	?	256	256	512	6M	1024	256
L3							1024				6M
Mhz	500	560	580	700	900	800	866	900	2800	3000	3100
CPU	ARMv5	Mips74K	Mips24K	ARMv6	ARMv7	PPC	PPC	x86	x64	x64	x64
OS	Debian	OpenWrt	OpenWrt	OpenWrt	OpenWrt	OpenWrt	Debian	Ubuntu	MacOSX	Ubuntu	Windows

Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.

The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.

Bubblesort

	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
============================================================================================
4000000	1248	1332	997	1120	396	833		507	120	104	93
2000000	1248	1332	994	1118	386	791	553	506	114	102	93
1000000	1274	1330	1009	1110	367	757	492	504	113	96	93
500000	1258	1194	959	1049	352	628	389	353	72	74	63
250000	1219	1116	931	911	351	445	309	276	53	61	48
125000	1174	1043	902	701	349	397	287	237	44	56	41
62500	941	853	791	573	349	373	278	218	38	52	37
31250	700	462	520	474	342	317	260	208	36	48	36
15625	697	456	507	368	340	315	258	204	35	49	35
7812	696	454	495	364	340	315	256	202	34	49	35
3906	696	455	496	364	340	315	257	203	34	47	35
1953	698	456	496	365	342	320	257	204	35	45	35

Selectionsort

	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
============================================================================================
1000000	1317	996	877	1056	446	468	296	255	30	45	19
31250	875	354	539	559	420	206	147	245	28	40	21
1953	874	362	520	457	422	209	149	250	30	41	23

Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.

Findings
Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.

I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.

I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.

I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.

The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.

What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.

Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:

void bubbelsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, jm1;
  uint32_t tmp;
  for ( i=len ; i>1 ; i-- ) {
    for ( j=1 ; j<i ; j++ ) {
      jm1 = j-1;
      if ( array[jm1] > array[j] ) {
        tmp = array[jm1];
        array[jm1] = array[j];
        array[j] = tmp;
      }
    }
  }
}

void selectionsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, best;
  uint32_t tmp;
  for ( i=1 ; i<len ; i++ ) {
    best = i-1;
    for ( j=i ; j<len ; j++ ) {
      if ( array[best] > array[j] ) {
        best = j;
      }
    }
    tmp = array[i-1];
    array[i-1] = array[best];
    array[best] = tmp;
  } 
}

Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.

Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.

JavaScript: switch options

Is the nicest solution also the fastest?

Here is a little thing I ran into that I found interesting enough to test it. In JavaScript, you get a parameter (from a user, perhaps a web service), and depending on the parameter value you will call a particular function.

The first solution that comes to my mind is a switch:

function test_switch(code) {
  switch ( code ) {
  case 'Alfa':
    call_alfa();
    break;
  ...
  case 'Mike':
    call_mike();
    break;
  }
  call_default();
}

That is good if you know all the labels when you write the code. A more compact solution that allows you to dynamically add functions is to let the functions just be properties of an object:

x1 = {
  Alfa:call_alfa,
  Bravo:call_bravo,
  Charlie:call_charlie,
...
  Mike:call_mike
};

function test_prop(code) {
  var f = x1[code];
  if ( f ) f();
  else call_default();
}

And as a variant – not really making sense in this simple example but anyway – you could loop over the properties (functions) until you find the right one:

function test_prop_loop(code) {
  var p;
  for ( p in x1 ) {
    if ( p === code ) {
      x1[p]();
      return;
    }
  }
  call_default();
}

And, since we are into loops, this construction does not make so much sense in this simple example, but anyway:

x2 = [
  { code:'Alfa'     ,func:call_alfa    },
  { code:'Bravo'    ,func:call_bravo   },
  { code:'Charlie'  ,func:call_charlie },
...
  { code:'Mike'     ,func:call_mike    }
];

function test_array_loop(code) {
  var i, o;
  for ( i=0 ; i<x2.length ; i++ ) {
    o = x2[i];
    if ( o.code === code ) {
      o.func();
      return;
    }
  }
  call_default();
}

Alfa, Bravo…, Mike and default
I created exactly 13 options, and labeled them Alfa, Bravo, … Mike. And all the test functions accept invalid code and falls back to a default function.

The loops should clearly be worse for more options. However it is not obvious what the cost is for more options in the switch case.

I will make three test runs: 5 options (Alfa to Echo), 13 options (Alfa to Mike) and 14 options (Alfa to November) where the last one ends up in default. For each run, each of the 5/13/14 options will be equally frequent.

Benchmark Results
I am benchmarking using Node.js 0.12.2 on a Raspberry Pi 1. The startup time for Nodejs is 2.35 seconds, and I have reduced that from all benchmark times. I also ran the benchmarks on a MacBook Air with nodejs 0.10.35. All benchmarks were repeated three times and the median has been used. Iteration count: 1000000.

(ms)       ======== RPi ========     ==== MacBook Air ====
              5      13      14         5      13      14
============================================================
switch     1650    1890    1930        21      28      30
prop       2240    2330    2890        22      23      37
proploop   2740    3300    3490        31      37      38
loop       2740    4740    4750        23      34      36

Conclusions
Well, most notable (and again), the RPi ARMv6 is not fast running Node.js!

Using the simple property construction seems to make sense from a performance perspective, although the good old switch also fast. The loops have no advantages. Also, the penalty for the default case is quite heavy for the simple property case; if you know the “code” is valid the property scales very nicely.

It is however a little interesting that on the ARM the loop over properties is better than the loop over integers. On the x64 it is the other way around.

Variants of Simple Property Case
The following are essentially equally fast:

function test_prop(code) {
  var f = x1[code];   
  if ( f ) f();       
  else call_x();                        
}   

function test_prop(code) {
  var f = x1[code];   
  if ( 'function' === typeof f ) f();
  else call_x();                        
}   

function test_prop(code) {
  x1[code]();                          
}   

So, it does not cost much to have a safety test and a default case (just in case), but it is expensive to use it. This one, however:

function test_prop(code) {
  try {
    x1[code]();
  } catch(e) {
    call_x();
  }
}

comes at a cost of 5ms on the MacBook, when the catch is never used. If the catch is used (1 out of 14) the run takes a full second instead of 37ms!

Node.js Benchmark on Raspberry Pi (v1)

I have experimented a bit with Node.js and Raspberry Pi lately, and I have found the performance… surprisingly bad. So I decided to run some standard tests: benchmark-octane (v9).

Octane is essentially run like:

$ npm install benchmark-octane
$ cd node_modules/benchmark-octane
$ node run.js

The distilled result of Octane is a total run time and a score. Here are a few results:

                         OS             Node.js                   Time    Score
QNAP TS-109 500MHz       Debian        v0.10.29 (Debian)         3350s      N/A
Raspberry Pi v1 700MHz   OpenWrt BB    v0.10.35 (self built)     2267s      140
Raspberry Pi v1 700MHz   Raspbian       v0.6.19 (Raspbian)       2083s      N/A
Raspberry Pi v1 700MHz   Raspbian       v0.12.2 (self built)     2176s      104
Eee701 Celeron 900Mhz    Xubuntu       v0.10.25 (Ubuntu)          171s     1655
Athlon II X2@3Hz         Xubuntu       v0.10.25 (Ubuntu)           49s     9475
MacBook Air i5@1.4Ghz    Mac OS X      v0.10.35 (pkgsrc)           47s    10896
HP 2560p i7@2.7Ghz       Xubuntu       v0.10.25 (Ubuntu)           41s    15450

Score N/A means that one test failed and there was no final score.

When I first saw the RPi performance I thought I had done something wrong building (using a cross compiler) Node.js myself for RPi and OpenWRT. However Node.js with Raspbian is basically not faster, and also RPi ARMv6 with FPU is not much faster than the QNAP ARMv5 without FPU.

I think the Eee701 serves as a good baseline here. At first glance, possible reasons for the RPi underperformance relative to the Celeron are:

  • Smaller cache (16kb of L1 cache and L2 only available to GPU, i Read) compared to Celeron (512k)
  • Bad or not well utilised FPU (but there at least is one on the RPi)
  • Node.js (V8) less optimized for ARM

I found that I have benchmarked those to CPUs against each other before. That time the Celeron was twice as fast as the RPi, and the FPU of the RPi performed decently. Blaming the small cache makes more sense to me, than blaming the people who implemented ARM support in V8.

The conclusion is that Raspberry Pi (v1 at least) is extremely slow running Node.js. Other benchmarks indicate that RPi v2 is significantly faster.

Storage and filesystem performance test

I have lately been curious about performance for low-end storage and asked myself questions like:

  1. Raspberry Pi or Banana Pi? Is the SATA of the Banana Pi a deal breaker? Especially now when the Raspberry Pi has 4 cores, and I don’t mind if one of them is mostly occupied with USB I/O overhead.
  2. For a Chromebook or a Mac Book Air where internal storage is fairly limited (or very expensive), how practical is it to use USB storage?
  3. Building OpenWRT buildroot requires a case sensitive filesystem (disqualifying the standard Mac OS X filesystem) – is it feasible to use a USB device?
  4. The journalling feature of HFS+ and ext4 is probably a good idea. How does it affect performance?
  5. For USB drives and Memory cards, what filesystems are better?
  6. Theoretical maximum throughput is usually not that interesting. I am more interested in actual performance (time to accomplish tasks), and I believe this is often limited by latency and overhead than throughput. Is it so?

Building OpenWRT on Mac Book Air
I tried building OpenWRT on a USB drive (with case sensitive HFS+), and it turned out to be very slow. I did some structured testing by checked out the code, putting it in a tarball, and repeating:

   $ cd /external/disk
1  $ time cp ~/openwrt.tar . ; time sync
2  $ time tar -xf ~/openwrt.tar ; time sync   (total 17k files)
   $ make menuconfig - not benchmarked)
3  $ time make tools/install                  (+38k files, +715MB)

I did this on the internal SSD (this first step of OpenWRT buildroot was not case sensitive-dependent), on an external old rotating 2.5 USB drive and on a cheap USB drive. I tried a few different filesystem combinations:

$ diskutil eraseVolume hfsx  NAME /dev/diskXsY   (non journaled case sensitive)
$ diskutil eraseVolume jhfsx NAME /dev/diskXsY   (journaled case sensitive)
$ diskutil eraseVolume ExFAT NAME /dev/diskXsY   (Microsoft ExFAT)

The results were (usually just a single run):

Drive and Interface Filesystem time cp time tar time make
Internal 128GB SSD Journalled HFS+ 5.4s 16m13s
2.5′ 160GB USB2 HFS+ 3.1s 7.0s 17m44s
2.5′ 160GB USB2 Journalled HFS+ 3.1s 7.1s 17m00s
Sandisk Extreme
16GB USB Drive USB3
HFS+ 2.0s 6.9s 18m13s
Kingston DTSE9H
8GB USB Drive USB2
HFS+ 20-30s 1m40s-2m20s 1h
Kingston DTSE9H
8GB USB Drive USB2
ExFAT 28.5s 15m52s N/A

Findings:

  • Timings on USB drives were quite inconsistent over several runs (while internal SSD and hard drive were consistent).
  • The hard drive is clearly not the limiting factor in this scenario, when comparing internal SSD to external 2.5′ USB. Perhaps a restart between “tar xf” and “make” would have cleared the buffer caches and the internal SSD would have come out better.
  • When it comes to USB drives: WOW, you get what you pay for! Turns out the Kingston is among the slowest USB drive that money can buy.
  • ExFAT? I don’t think so!
  • For HFS+ and OS X, journalling is not much of a problem

Building OpenWRT in Linux
I decided to repeat the tests on a Linux (Ubuntu x64) machine, this time building using two CPUs (make -j 2) to stress the storage a little more. The results were:

Drive and Interface Filesystem real time user time system time
Internal SSD ext4 9m40s 11m53s 3m40s
2.5′ 160GB USB2 ext2 8m53s 11m54s 3m38s
2.5′ 160GB USB2 (just after reboot) ext2 9m24s 11m56s 3m31s
Kingston DTSE9H
8GB USB Drive USB2
ext2 11m36s
+3m48s (sync)
11m57s 3m44s

Findings:

  • Linux block device layer almost eliminates the performance differences of the underlying storage.
  • The worse real time for the SSD is probably because of other processes taking CPU cycles

My idea was to test connecting the 160GB drive directly via SATA, but given the results I saw no point in doing so.

More reading on flash storage performance
I found this very interesting article (linked to by the Gentoo people of course). I think it explains a lot of what i have measured. I think, even the slowest USB drives and Memory cards would often be fast enough, if the OS handles them properly.

Conclusions
The results were not exactly what I expected. Clearly the I/O load during build is too low to affect performance in a siginficant way (except for Mac OS X and a slow USB drive). Anyway, USB2 itself has not proved to be the weak link in my tests.

Using float and double as integer

Traditionally computers work with integer types of different sizes. For scientific applications, media, gaming and other applications floating point numbers are needed. In old computers floating point numbers where handled in software, by special libraries, making them much slower than integers, but nowadays most CPUs have an FPU that can make fast float calculations.

Until recently I was under impression that integers were still faster than floats and that floats have precision/rounding issues, making the integer datatype the natural and only sane choice for representing mathematical integers. Then I came to learn two things:

  1. In JavaScript, all numbers are 64bit floats (double), effectively allowing 52bit integers when used correctly.
  2. OpenSSL uses the double datatype instead of int in some situations (big numbers) for performance reasons.

Both these applications exploit the fact that if the cost of 64bit float operations is (thanks to the FPU) roughly equal to the cost of 32bit integer operations, then a double can be a more powerful representation of big integers than an int. It is also important to understand that (double) floating point numbers have precision problems only handling decimal points (ex 0.1) and very big numbers, but handle real world integers just fine.

Apart from this, there could be other possible advantages of using float instead of int:

  • If the FPU can execute instructions somewhat in parallell with the ALU/CPU using floats when possible could benefit performance.
  • If there are dedicated floating point registers, making use of them could free up integer registers.

Well, I decided to make a test. I have a real world application:

  • written in C
  • that does calculations on integers (mostly in the range 0-1000000)
  • that has automated tests, so I can modify the program and confirm that it still works
  • that has built in performance/time measurement

Since I had used int to represent a real-world-measurement (length in mm), I decided nothing is really lost if I use float or double instead of int. The values were small enough that a 32bit float would probably be sufficiently precise (otherwise my automated tests would complain). While the program is rather computation heavy, it is not extremely calculation-intense, and the only mathematical operations I use are +,-,>,=,<. That is, even if float-math was for "free" the program would still be heavy but faster. In all cases gcc is used with -O2 -ffast-math. The int column shows speed relative to the first line (Celeron 630MHz is my reference/baseline). The float/double columns show speed relative to the int speed of the same machine. Higher is better.

Machine int float double Comment
Eee701 Celeron 630MHz / Lubuntu 1.0 0.93 0.93
AMD Athlon II 3Ghz / Xubuntu 5.93 1.02 0.97
PowerBook G4 PPC 867MHz / Debian 1.0 0.94 0.93
Linksys WDR4900 PPC 800MHz / OpenWRT 1.12 0.96 (0.87) 0.41 (0.89) Values in parenthesis using -mcpu=8548
Raspberry Pi ARMv6 700MHz / Raspbian 0.52 0.94 0.93
QNAP TS-109 ARMv5 500MHz / Debian 0.27 0.61 0.52
WRT54GL Mips 200MHz / OpenWRT 0.17 0.20 0.17

A few notes on this:

I have put together quite many measurements and runs to eliminate outliers and variance, to produce the figures above.

There was something strange about the results from the PowerBook G4, and the performance is not what should be expected. I dont know if my machine underperforms, or if there is something wrong with the time measurements. Nevertheless, I believe the int vs float performance is still valid.

The Athlon is much faster than the other machines, giving shorter execution times, and the variance between different runs was bigger than for other machines. The 1.02/0.97 could very well be within error margin of 1.0.

The QNAP TS-109 ARM CPU does not have an FPU, which explains the lower performance for float/double. Other machines displayed similar float/double performance with “-msoft-float”.

The Linksys WDR4900 has an FPU that is capable of both single/double float precision. But with OpenWRT BB RC3 toolchain, gcc defaults to -mcpu=8540, which falls back to software float for doubles. With -mcpu=8548 the FPU is used also for doubles, but for some reason this lowers the single float performance.

Not tested
The situation could possibly change when the division operator is used, but division should be avoided anyway when it comes to optimization.

All tests are done on Linux and with GCC: it would surprise me much if results where very different on other platforms.

More tests could be made on more modern hardware, but precision advantage of double over int is lost for 64-bit machines with native 64-bit long int support.

Conclusion
As a rule of thumb, integers are faster than floats, and replacing integers with floats does not improve performance. Use the datatype that describes your data the best!

Exploiting the 52-bit integer capacity of a double should be considered advanced and platform dependent optimization, and not a good idea in the general case.

Simple Integer Factorization and RSA key sizes

I have been playing with RSA lately, you know the encryption protocol where you generate two primes, multiply them, give the product away to be used for encryption by others and keeping the primes for yourself since they are needed for decryption (that was very simplified).

I have read that the largest known number to ever have been factorized was 768 bits long. Such a number looks like this (in hex):

6d733229f36b1fe086f39b5ae05d0155e7658480fe7c987fe5da51734b6b8c02
a7b6b22e5da0afc0867d7315405bddd9062da121a3b0b2663397ae8a65085d53
927279c1c330247ff20c2f1bd4b4f95375a74beefac29c519a8193e2614ffc19

For encryption to be safe for the next decades, keys that are 2048 or 4096 bits long are used. Or even longer.

One feature of RSA is that the output of encryption is never smaller than the size of the key (well, again, very simplified). So, imagine you want to encrypt 4-digit pin codes, one-by-one, using RSA with 1024-bit key, each pin code would be several hundred bytes, instead of just a few characters. For almost obvious reasons, you can not make a stream cipher of RSA or some other smart mode of operation to work around this problem (please let me know if I am wrong). This makes me want to use RSA with a small enough key to be secure enough for my purposes.

My questions

  1. What key sizes are trivial to break?
  2. What sizes require some qualified effort?
  3. How hard is it really to factorize big integers?

I found a tool called Yafu. It appears to be rather competent. It would require years of effort and advanced skills in math to write a better tool. For integers 320 bits and larger, Yafu requires GGNFS – a seriously very complicated piece of software that also hard to compile. Luckily there are nice windows binaries from Jeff Gilchrist. I also downloaded a binary version of Yafu for Windows. The examples below use Cygwin to have access to some Unix tools (bc, tr, openssl).

Generating a Prime product and factorizing it
There is a very nice JavaScript project for RSA. Set the bit size to whatever you want (I use 128 in this example), click generate, and obtain “Modulus”:

Modulus (hex): 81653c1536c42501a815431dac804899

Convert to upper case using tr:

$ echo 81653c1536c42501a815431dac804899 | tr '[:lower:]' '[:upper:]'
81653C1536C42501A815431DAC804899

Then use bc to convert to decimal:

$ bc -q
ibase=16
81653C1536C42501A815431DAC804899
171996052064283111843964589052488861849
quit

Finally, factorize using yafu:

$ echo "factor(171996052064283111843964589052488861849)" | ./yafu-x64.exe


06/14/14 13:24:28 v1.34.5 @ TOR, System/Build Info:
Using GMP-ECM 6.3, Powered by GMP 5.1.1
detected         Intel(R) Core(TM) i5-2400 CPU @ 3.10GHz
detected L1 = 32768 bytes, L2 = 6291456 bytes, CL = 64 bytes
measured cpu frequency ~= 3108.870930
using 20 random witnesses for Rabin-Miller PRP checks

===============================================================
======= Welcome to YAFU (Yet Another Factoring Utility) =======
=======             bbuhrow@gmail.com                   =======
=======     Type help at any time, or quit to quit      =======
===============================================================
cached 78498 primes. pmax = 999983


>>
fac: factoring 171996052064283111843964589052488861849
fac: using pretesting plan: normal
fac: no tune info: using qs/gnfs crossover of 95 digits
div: primes less than 10000
fmt: 1000000 iterations
rho: x^2 + 3, starting 1000 iterations on C39
rho: x^2 + 2, starting 1000 iterations on C39
rho: x^2 + 1, starting 1000 iterations on C39
pm1: starting B1 = 150K, B2 = gmp-ecm default on C39
ecm: 30/30 curves on C39, B1=2K, B2=gmp-ecm default

starting SIQS on c39: 171996052064283111843964589052488861849

==== sieving in progress (1 thread):     624 relations needed ====
====           Press ctrl-c to abort and save state           ====
546 rels found: 293 full + 253 from 2026 partial, (41408.50 rels/sec)

SIQS elapsed time = 0.0740 seconds.
Total factoring time = 0.2690 seconds


***factors found***

P20 = 14624642445740394983
P20 = 11760701343804754303

ans = 1

Now I wrote a little bash script that does everything:

#!/bin/bash

INTSIZE=$1

rm -f key.*
rm -f *.log
rm siqs.dat

BC_LINE_LENGTH=9999
export BC_LINE_LENGTH

openssl genrsa -out key.pem $INTSIZE
openssl rsa -in key.pem -noout -modulus | cut -f 2 -d '=' > key.hex
echo "ibase=16 ; $( cat key.hex )" | bc > key.dec
echo "factor($( cat key.dec ))" | ./yafu-x64.exe -threads 4

I am using yafu with default settings – very hard to imagine I can do anything better than the yafu author. For 320 bits and above, the GGNFS library is required. Performance for different sizes of integers to factorize (using the QuadCore Intel i5-2400 from the output above):

Bits Time Notes
128 0.28s
160 0.33s
192 1.86s
224 8.02s
256 52.6s
288 265s
320 3649s ~30Mb Temp files
352 9291s
384 27261s ~660Mb Temp files
512 73 days http://lukenotricks.blogspot.se/2009/08/solo-desktop-factorization-of-rsa-512.html

Well, this means that a normal Windows desktop computer can break 512 bit RSA within days or weeks. Below 512 bits, brute force is not much of a challenge, and using 256-bit RSA for encrypting short messages is (unfortunately, since it was what I ultimately wanted to explore) just ridiculous.

As keys get larger, more temp disk space is required, somewhere between 512-768 bits it gets seriously complex (I claim this, as a 768 bit integer is the largest to have been known to ever been factorized into two primes). You can read about General Number Field Sieves to get some background.

Not everyone is capable of extracting the Modulus integer from en encrypted file or network stream, installing Yafu, waiting for a little while and then use the two obtained primes to actually generate a fake certificate or actually decrypt anything. So, if you want to encrypt data to prevent your boss or your wife from reading it, you can probably use any key size you like – or why not use an encrypted zip file or an encrypted MS Word file?

If you have a skilled and motivated enemy, who are willing to put some effort into breaking your encryption, I would not use anything close to 512 bits. I assume the police, or FRA (in Sweden) or NSA can break 512 bit RSA within hours or faster when they need to.

I am not giving any sources for any of my claims here. I am not an expert in factorizing large-prime-products, and I am certainly not an expert in Quantum Computers. But as I understand it; 1024 bits should be fine, but perhaps in 10-20 years using even larger keys may make sense, and I don’t expect to see Quantum computers practically breaking real RSA keys in the next 50 years.

It is fascinating that a 128-bit AES key is completely beyond hope to brute force for any power on earth, while 128-bit RSA keys are worthless.

I now wonder, are there other asymmetric ciphers that are secure with significantly shorter keys than RSA?

Testing ownCloud performance

Update 2014-04-28: Upgrading to ownCloud 7.0.1 has not changed the performance of the platform at all.

Update 2014-04-28: I have found that downloading files is quite fast. At about 10% CPU load, the server can saturate my 10MBit/s internet connection, when I download my files to another computer, over https. When uploading files, top shows mostly waiting time. When downloading, top shows mostly idle time. I suspect the SQL communication/overhead is limiting upload performance, and that ownCloud keeps a lot of book keeping in the database. If it does so for a good reason, and download is reasonably fast, I can live with it. I anyway keep my original article below (on upload performance), but I find the performance quite acceptable for my real-world application, on my not so powerful hardware.

Update 2014-04-26: Tried FastCGI/mod-fcgid, see below.

Ubuntu announced that they will cancel the Ubuntu One service, and Condoleezza Rice will start working for Dropbox. So, how am I going to share my files among different computers and devices?

ownCloud appears like a nice option. It is like Dropbox, but I can run it myself, and it works not only for files, but also for contacts/calenders and smartphones.

Buying ownCloud as a service is possible, but as soon as I want to put my pictures (and perhaps some video and music) it gets pretty expensive. If I host myself several hundreds of GB of disk is no problem.

So, I installed ownCloud (6.0.2) on my QNAP TS 109 running Debian (7.4). Horrible performance – it took a minute to log in. Ok – the QNAP has a 500MHz ARM, but even worse, just 128MB of RAM and quite slow disk access. What device to put ownCloud on? A new nice QNAP (TS-221) is quite pricy, and a Raspberry Pi accesses both disk and network over its USB bus. I came to think of buying a used G4 Mac Mini – they are really cheap! Then I came to think of my old Titanium PowerBook G4 that has been gathering dust last year, and I decided to try running ownCloud on it. Perhaps not as a long term solution, but as a learning/testing machine it could work fine.

ownCloud Server configuration
CPU: G4 866MHz
RAM: 1024Mb
HD: 320GB ATA
OS: Debian, 7.4 (PPC) fresh install on entire hard drive
DB: mysql 5.5 (the std version for Debian)
https: apache 2.2 (the std version for Debian)

To improve performance, I enabled APC for PHP, and disabled full text search.

Performance measurements
For the performance tests, I tried to transfer 1x100MB, 10x10Mb and 100x1Mb files. I measured the times with a regular stopwatch, and occationaly I repeated a test when the result was strange. The below measurements are not exactly accurate, but the big picture is there.

Transfers are made from a Windows 7 PC over a Gbit network.

1x100Mb 10x10Mb 100x1Mb
Encryption and checksum on G4 / server
(1): ssl encrypt aes ; sync 7s
(2): md5sum 1s
File transfer using other methods
(3): ftp/Filezilla 3s 3s 4s
(4): sftp/Filezilla 14s 15s 17s
ownCloud
(5): No SSL, NO APC 15s 32s 234s
(6): No SSL, APC 16s 27s 197s
(7): SSL, APC 34s 43s 263s
(8): SSL, APC, encryption 46s 69s 438s

Comments on Performance
(1): tells me that the server is capable of encrypting 100Mb of data, and sync output to local disk, in 7 seconds. The sync is less than a second.
(2): tells me that the server is capable of processing 100Mb of data in a second.
(3): tells me that simply sending the files over the network with a proven protocol takes 3-4 seconds, slightly slower for more smaller files.
(4): tells me that sending the files in an encrypted stream with a proven protocol over the network takes about 15 seconds, slightly slower for more smaller files.
(5): shows that the overhead for many files in ownCloud is massive.
(6): shows that APC helps, but not in a very significant way.
(7): shows the extra cost of SSL (transferring over a secure channel).
(8): shows the extra cost of encrypting the files for the user, on the server (using openssl/AES, according to ownCloud documentation.

It makes sense to compare row (3) and (6), indicating that with no encryption whatsoever the overhead of ownCloud is 5-50x the actual work. Or, the resources used for actually transferring and storing files are 20%-2%, the rest of the resources, 80%-98% are “wasted”. Now ownCloud has some syncroniziation and error handling capacities not found in FTP, but I dont think that justifies this massive overhead.

In the same way it makes sense to compare row (4) and (7), indicating a waste of 60%-94% of resources, for using a secure channel (and I believe that SSH uses stronger encryption than TLS).

For average file size smaller than 1MB, the waste will be even bigger.

I suspect the cost is related to executing php for each and every file. It could also be the use of the database for each file that is expensive. Somewhere I read that there are “hundreds” of database calls for each web server request handled by ownCloud.

Suggestions
It is of course a bit arrogant to suggest solutions to a problem in an Open Source project that I have never contributed to, without even reading the code. Anyway, here we go:

  • Find a way to upload small directories (<10MB, or <60s transfer) as tarballs or zipfiles. This should of course happen transparantly to the user (and only work via the client, not the web). This way hundreds or thousands of small files could be uploaded in a few seconds instead of very long time - and the load on the server would decrease a lot.
  • Similar to the first suggestion, allow files to be uploaded in fragments, to allow upload of 2GB+ files on all server platforms (it is ridiculus that an ARM server, like a QNAP, can not handle 2GB+ files, as I have read in the documentation is the case).
  • Alternatively, allow ownCloud to use ssh/sftp as transfer protocol. It will not work in all situations, but when client and server are allowed to communicate on port 22, and ownCloud is installed on a server with ssh enabled, it could be an option.

I kind of presume that the problem is one-file-per-request and WebDav limitations. Perhaps it is the database that is the problem? Nevertheless, I think som kind of batch-handling of uploads/downloads is the solution in that case too.

LAMP
ownCloud is built on LAMP, and I doubt the performance problems are related to the LA in LAMP. Also, I dont think that the M should be the problem if the databas calls are kept at a reasonable level. The problem must be with P(HP). I understand and appreciate that PHP is simple and productive, and probably 95% of ownCloud can be perfectly written in PHP. But perhaps there are a few things that should be written in something more high-performing (I am talking about C, of course)?

Conclusion
I really like the ambition of ownCloud, and mostly, the software is very nice. The server has many features, and the clients are not only nice, but also available for several platforms.

ownCloud is a quite mature product, at version 6. I wish some effort is put into improving performance. I believe there are possible strategies that would not require very much rewriting, and not need to brake compability. And I also believe it makes much sense to optimize the ownCloud server code: not only because people may run it on Raspberry Pis, QNAPs or old hardware, but also because it would improve the usefulness on more powerful servers.

2014-04-26: FastCGI / mod-fcgid
I decided to try PHP via FastCGI to see if it could improve performance. Very little difference – I disabled it and got back to “recommended” configuration. For details, read on.

I mostly followed this guide (as apache2-mod-fastcgi seems to be replaced by apache2-mod-fcgid lately, other high-ranking guides were out of date). The following options need to be added to /etc/apache2/apache2.conf:

FcgidFixPathinfo 1               (not in the site definition as suggested in guide)
FcgidMaxRequestLen 536870912     (effectively limits maximum file size)