Category Archives: Raspberry Pi

Syncthing crashes on RPi and Arch Linux

One of my Syncthing servers started crashing (again). It is Rapsberry Pi v2 running Arch Linux. Syncthing was 0.14.44.? I upgraded and got 0.14.48.1. Still not stable.

So I downloaded the Syncthing binary from Syncthing instead of using the one that comes with Arch Linux. That seems to work better.

During trying different things I did a database reset:

$ syncthing -reset-database     (does not start syncthing)
$ syncthing

This is not the first time Syncthing misbehaves on Raspberry Pi and I am beginning to question if it is so smart to store my files on a Raspberry Pi with a USB drive.

Raspbian – kerberos not found

I have this very strange error on my RPi V2 with Raspbian (8.0). I suspect I will throw away the memory card and never fix it, but I will document the error for future reference.

My problem was that curl, ssh, sshd suddenly did not work. When I start the web browser I get “I/O error”. This screenshot shows (at least a symptom of) the problem.

I tried to reinstall ssh and curl:

$ apt-get install --reinstall curl

and that did not help.

Apart from this the system works ok. It shutdowns and starts properly. No I/O errors from dmesg. I doubt I will ever figure this one out. It seems to me the system is corrupt at disk level, probably an SD-card problem, and that a new install on a new SD-card is the only way forward.

Syncthing v0.14.40, Raspberry Pi, 100% CPU

I think Syncthing is an amazing piece of software, but I ran into problem last week.

I have a library of 10 different folders, 120000 files, 42000 directories and 428GB of data.

I thought that was a little bit too much for my RPi V1 (Syncthing 0.14.40, Arch Linux), because it constantly ran at 100%. I raised Rescan Interval to several hours (so it would finish before staring over).

After startup it took about 10-15 min to get the web GUI up, and about an hour to scan all folders for the first time. Well, that is ok, but after that it still constantly used 100% CPU despite all folders were “up to date”.

It turned out it crashed and started over. I found panic logs in .config/syncthing and error messages in ./config/syncthing/index-v0.14.0.db/LOG.

Some errors indicated Bad Magic Number and Checksum Corruption. The usual reason for this seems to be hardware problem (!?!).

I upgraded my RPi V1 to an RPi V2, with little success. Then I found that I had similar problems on another RPi V2. So after shutting down Syncthing I tried the quite scary:

  $ syncthing -reset-database      ( does not start syncthing )      
  $ syncthing                      ( start syncthing )

After several hours of scanning everything seems to work perfectly!
Let us see how long that lasts.

Review: NUC vs Raspberry Pi

I like small, cheap, quiet computers… perhaps a little too much. For a long time I have used a Raspberry Pi V2 (QuadCore@900MHz and 1GB RAM) as a workstation. To be honest, I have not used it for web browsing, that is just too painful. But I have used it for programming and running multiple Node.js services, and a few other things.

Despite there are so many single board computers it is hard to find really good alternatives to the Raspberry Pi. And when I look into it, I find that Intel NUCs are very good options. So, I just decided to replace my RPi2 workstation with the cheapest NUC that money can currently buy: the NUC6CAY with a Celeron J3455 CPU. It sounds cheap, particularly for something server like. The interesting thing with the J3455 CPU is that it is actually Quad Core, with no hyper threading. To me it sounds amazing!

I also have an older NUC, a 54250WYKH with an i5 CPU.

Raspberry Pi V2:   ARMv7    4 Cores      900MHz                  1GB RAM
NUC                Celeron  4 Cores      1500MHz (2300 burst)    8GB RAM
NUC                i5       2 Cores (HT) 1300MHz (2600 burst)   16GB RAM

I/O is obviously superior for the NUCs (both using SSD) versus the RPI v2 having a rotating disk connected to USB. But for my purposes I think I/O and (amount of) RAM makes little difference. I think it is more about raw CPU power.

Node.js / JavaScript
When it comes to different Node.js applications, it seems the older i5 is about twice as fast as the newer Celeron (for one Core and one thread). I would say this is slightly disappointing (for the Celeron). On the other hand the Celeron is about 10x faster than the RPi V2 when it comes to Node.js code, and that is a very good reason to use a NUC rather than a Raspberry PI.

Update 2018-02-11: after a few months
I came back to my RPi2 from my cheap NUC. The difference is… everything. I really like Raspberry PIs. I have built cases for them, bought cases for them, worked on them, made servers of them. But I really must say that a NUC makes more sense: it contains everything nicely and it is so much more powerful.

You can get a Celeron NUC with 2GB RAM and a 2.5′ disk for quite little money. And from there you can go to Core i7, 32GB RAM and two hard drives: M.2 + 2.5′. And check out the Hades Canyon NUC.

I feel sorry there is basically nothing in the market like a NUC with ARM, AMD, PowerPC or Mips. The only competition is the 4 year old MacMini, which is completely an Intel machine. If you find something cool, NUC-like, not Intel, feel free to post below.

Update 2018-02-28
I ran into a new problem on my RPi. It could be anything. My guess, that I will never be able to prove, is that it is a glitch made possible by using an SD-card as root device (and possibly questionable drivers/hardware for SD on the RPi).

Update 2018-04-09
Premier Farnell has introduced a Desktop Pi. Especially promising is that together with a recent RPi you can get rid of the SD-card entirely, and only use SSD/HDD or even mSATA (over USB i presume).

Raspberry PI performance and freezes

On a daily basis I use a Raspberry Pi v2 (4x900MHz) with Raspian as a work station and web server. It is connected to a big display, I edit multiple files and it runs multiple Node.js instances. These Node.js processes serve HTTP and access (both read and write) local files.

I experienced regular freezes. Things that could take 2-3 seconds were listing files in a directory, opening a file, saving a file and so on.

I moved my working directory from my (high performance) SD-card to a regular spinning USB hard drive. That completely solved the problem. I experience zero freezes now, compared to plenty before.

My usual experience with Linux is that the block caching layer is highly effective: things get synced to disk when there is time to do so. I dont know if Linux handles SD-cards fundamentally different from other hard drives (syncing more often) or if the SD card (or the Raspberry Pi SD card hardware) is just slower.

So, for making real use of a Raspberry Pi I would clearly recommend a harddrive.

Raspberry Pi Server

The Raspberry Pi has been around for some years now and it has been used in unbelievable projects. As a budget desktop computer it has not quite had the required performance (although v2 and v3 are much improving the situation over v1). However, for simple hobby server tasks the RPi can work very well.

A simple RPi (any version) setup typically requires:

  • RPi
  • SD Card
  • USB PSU + USB cable
  • Network Cable
  • External USB Drive + USB Cable (+power adapter)
  • A case

That is without display, mouse and keyboard, and you dont have a power button. It gets a bit messy.

The market is full of RPi cases that all do the same thing: nothing. They just contain the board. The market is full of mini/micro-towers for MiniITX. There are rather expensive NAS devices that come without hard drives. Why are there no small tower cases that comes with:

  • PSU
  • Slots for 1-2 hard drives (+USB to SATA converters)
  • Cabling that makes everything tidy and neat

Powering the RPi using an external hard drive
I happened to have an external USB drive with an integrated USB hub (an Iomega Minimax that was left alone when its Mac Mini died). With some wood and glue I built a simple stand for the hard drive and the RPi:

DSCN5193

DSCN5194

DSCN5196

As you can see:

  • the hard drive powers the RPi, and I can even use the hard drive power switch
  • the Ethernet and USB ports are conveniently available on the back side
  • the footprint is just slightly larger (just taller) than the hard drive itself
  • the two USB cables between RPi and harddrive are nicely contained
  • heat/ventilation should be pretty good

I have experienced no problems powering the RPi from a USB drive that it itself is connected to. It may not be a supported or recommended configuration, but for practical purposes it works for me.

Performance
I mostly run Syncthing on this RPi. The bottleneck is very much the 700MHz ARMv6 CPU, not the USB2-to-SATA-overhead.

hdparm gives me:

$ sudo /sbin/hdparm -t /dev/sda
/dev/sda:
 Timing buffered disk reads:  82 MB in  3.03 seconds =  27.09 MB/sec

$ sudo /sbin/hdparm -T /dev/sda
/dev/sda:
 Timing cached reads:   496 MB in  2.01 seconds = 247.36 MB/sec

Of course it sucks compared to what you can get in 2016, but it is not remarkably bad in anyway. And it is not so fun to live on an SD card.

The Western Digital Kit
The other day Western Digital announced both a special 314GB hard drive and accessories to make it all nice.

Plusberry Pi
There is also the interesting Plusberry Pi project.

Picocluster
Picocluster is clearly bringing new options! Not so much focus on support for a USB Hard Drive though. Some models have an HDMI port. I am not sure, but I think the idea is that you connect one RPi to the external USB/HDMI ports, and then that RPi can control the other one, if you run Picoclusters custom distribution. Not so bad, but not a KVM either. Perhaps a little modification to make a serial switch so RPi #2-5 can be controlled from #1 over serial?

Best Raspberry Pi Server Linux Distribution

Since I got my first Raspberry Pi have have wondered: how to turn it into a proper server. Options that I have not been entirely satisfied with:

  • Arch Linux: probably a great option if you know Arch… I have been too lazy to learn.
  • Gentoo Linux: is Gentoo still relevant? Building everything on the RPi sounds very painful (slow)
  • OpenWrt: nice, but slightly too minimal for a server
  • Raspbian: nice, but a little bit too big standard installation (perhaps it does not really matter, but every apt-get upgrade takes longer time, and so on)
  • NetBSD: such a disappointment 🙁

I now found, and tested, Raspbian Unattended Netinstaller. For me, this is the shit.

If is really this simple:

  1. Format your SD-card with FAT32 (just as usual)
  2. Unpack (unzip) the raspbian-ua-netinst on your SD-card
  3. Connect the SD-card, ethernet and power to your Raspberry Pi
  4. Wait (about 25 minutes, they say, that was ok with me)
  5. SSH into your new lean Raspbian system (root/raspbian).
  6. Read under “first boot” what to do next

Clearly, you need a properly configured network (DHCP, allow fetching of packages, and you need to know what IP address it got).

The entire experience is much enhanced if you connect to your Raspberry Pi with a serial cable during the entire procedure. Jokes aside, I used a serial with my first installation. Second time when I felt confident with the process I did not bother with the serial cable.

First boot quick guide

#dpkg-reconfigure locales
#dpkg-reconfigure tzdata

/boot/config.txt: add the line
gpu_mem=16

Upgrade to jessie
For some reason, Raspbian installation is still based on wheezy, not jessie (you don’t get the latest version of Debian). I suggest, upgrade immediately:

/etc/apt/sources.list (replace wheezy with jessie, two places)

# apt-get update
# apt-get dist-upgrade

It is almost as fast as the installation itself 😉

Conclusion
I think, for the Raspberry Pi V1, Raspbian installed this way is the best server system you can have (perhaps Arch is better if you know it). For a Raspberry Pi V2, perhaps standard Debian is better (I have never used an RPi v2). Everthing I have written applies perfectly to the RPi v2 as well.

Notes on Raspberry Pi and Serial

I experimented with my Raspberry Pi (v1 B) and a serial cable, a USB-serial identified as:

[85907.504415] usb 4-5: new full-speed USB device number 19 using ohci-pci
[85907.730850] usb 4-5: New USB device found, idVendor=0403, idProduct=6001
[85907.730863] usb 4-5: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[85907.730871] usb 4-5: Product: TTL232R-3V3
[85907.730877] usb 4-5: Manufacturer: FTDI
[85907.730882] usb 4-5: SerialNumber: ********
[85907.737978] ftdi_sio 4-5:1.0: FTDI USB Serial Device converter detected
[85907.738070] usb 4-5: Detected FT232RL
[85907.744057] usb 4-5: FTDI USB Serial Device converter now attached to ttyUSB1

My USB-serial-device has six cables: black-brown-red-orange-yellow-green.
Connected to the RPi from the corner pin: none-none-black-yellow-orange-none8x.

At this point I have no success with minicom. Screen works though:

sudo minicom -b 115200 -o -D /dev/ttyUSB1
sudo screen /dev/ttyUSB1 115200

When serial works, my procedure is:

  1. Connect everything except power
  2. Start screen
  3. Connect power
  4. Within a few seconds i get output

If I start a fresh default NOOBS (v1.4):

Uncompressing Linux... done, booting the kernel.

Welcome to the rescue system
recovery login: 

You can log in with root/raspberry, but I don’t know if you are meant to (can) install Raspbian this way.

NOTE: The Raspberry Pi itself prints nothing to the serial console. Only with a properly installed SD-card inserted, you get output.

Already installed System
For an already installed Raspbian, I got a normal login prompt over serial.
For an already installed OpenWRT (14.07), I got a root prompt, no password required, over serial.

Formatting SD-card using Linux
Sometimes it is hard to produce an SD-card that the Raspberry Pi wants to boot from.
This partitioning and formatting works:

$ sudo /sbin/fdisk -l /dev/sde

Disk /dev/sde: 7,4 GiB, 7948206080 bytes, 15523840 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x00055f28

Device     Boot Start      End  Sectors  Size Id Type
/dev/sde1        2048 15523839 15521792  7,4G  e W95 FAT16 (LBA)

gt@oden:~/Downloads$ sudo mkfs.vfat /dev/sde1
mkfs.fat 3.0.27 (2014-11-12)

To be on the safe side, before using fdisk:

$ sudo dd if=/dev/zero of=/dev/sde bs=1024 count=10240

Node.js performance of Raspberry Pi 1 sucks

In several previous posts I have studied the performance of the Raspberry Pi (version 1) and Node.js to find out why the Raspberry Pi underperforms so badly when running Node.js.

The first two posts indicate that the Raspberry Pi underperforms about 10x compared to an x86/x64 machine, after compensation for clock frequency is made. The small cache size of the Raspberry Pi is often mentioned as a cause for its poor performance. In the third post I examine that, but it is not that horribly bad: about 3x worse performance for big memory needs compared to in-cache-situations. It appears the slow SDRAM of the RPi is more of a problem than the small cache itself.

The Benchmark Program
I wanted to relate the Node.js slowdown to some other scripted language. I decided Lua is nice. And I was lucky to find Mandelbrot implementations in several languages!

I modified the program(s) slightly, increasing the resolution from 80 to 160. I also made a version that did almost nothing (MAX_ITERATIONS=1) so I could measure and substract the startup cost (which is signifacant for Node.js) from the actual benchmark values.

The Numbers
Below are the average of three runs (minus the average of three 1-iteration rounds), in ms. The timing values were very stable over several runs.

 (ms)                           C/Hard   C/Soft  Node.js     Lua
=================================================================
 QNAP TS-109 500MHz ARMv5                 17513    49376   39520
 TP-Link Archer C20i 560MHz MIPS          45087    65510   82450
 RPi 700MHz ARMv6 (Raspbian)       493             14660   12130
 RPi 700MHz ARMv6 (OpenWrt)        490    11040    15010   31720
 RPi2 900MHz ARMv7 (OpenWrt)       400     9130      770   29390
 Eee701 900MHz Celeron x86         295               500    7992
 3000MHz Athlon II X2 x64           56                59    1267

Notes on Hard/Soft floats:

  • Raspbian is armhf, only allowing hard floats (-mfloat-abi=hard)
  • OpenWrt is armel, allowing both hard floats (-mfloat-abi=softfp) and soft floats (-mfloat-abi=soft).
  • The QNAP has no FPU and generates runtime error with hard floats
  • The other targets produce linkage errors with soft floats

The Node.js versions are slightly different, and so are the Lua versions. This makes no significant difference.

Findings
Calculating the Mandelbrot with the FPU is basically “free” (<0.5s). Everything else is waste and overhead.

The cost of soft float is about 10s on the RPI. The difference between Node.js on Raspbian and OpenWrt is quite small – either both use the FPU, or none of them does.

Now, the interesting thing is to compare the RPi with the QNAP. For the C-program with the soft floats, the QNAP is about 1.5x slower than the RPi. This matches well with earlier benchmarks I have made (see 1st and 3rd link at top of post). If the RPi would have been using soft floats in Node.js, it would have completed in about 30 seconds (based on the QNAP 50 seconds). The only thing (I can come up with) that explains the (unusually) large difference between QNAP and RPi in this test, is that the RPi actually utilizes the FPU (both Raspbian and OpenWrt).

OpenWrt and FPU
The poor Lua performance in OpenWrt is probably due to two things:

  1. OpenWrt is compiled with -Os rather than -O2
  2. OpenWrt by default uses -mfloat-abi=soft rather than -mfloat-abi=softfp (which is essentially like hard).

It is important to notice that -mfloat-abi=softfp not only makes programs much faster, but also quite much smaller (10%), which would be valuable in OpenWrt.

Different Node.js versions and builds
I have been building Node.js many times for Raspberry Pi and OpenWrt. The above soft/softfp setting for building node does not affect performance much, but it does affect binary size. Node.js v0.10 is faster on Raspberry Pi than v0.12 (which needs some patching to build).

Lua
Apart from the un-optimized OpenWrt Lua build, Lua is consistently 20-25x slower than native for RPi/x86/x64. It is not like the small cache of the RPi, or some other limitation of the CPU, makes it worse for interpreted languages than x86/x64.

RPi ARMv6 VFPv2
While perhaps not the best FPU in the world, the VFPv2 floating point unit of the RPi ARMv6 delivers quite decent performance (slightly worse per clock cycle) compared to x86 and x64. It does not seem like the VFPv2 is to be blamed for the poor performance of Node.js on ARM.

Conclusion and Key finding
While Node.js (V8) for x86/x64 is near-native-speed, on the ARM it is rather near-Lua-speed: just another interpreted language, mostly. This does not seem to be caused by any limitation or flaw in the (RPi) ARM cpu, but rather the V8 implementation for x86/x64 being superior to that for ARM (ARMv6 at least).

Effects of cache on performance

It is not clear to me, why is Node.js so amazyingly slow on a Raspberry Pi (article 1, article 2)?

Is it because of the small cache (16kb+128kb)? Is Node.js emitting poor code on ARM? Well, I decided to investigate the cache issue. The 128kb cache of the Raspberry Pi is supposed to be primarily used by the GPU; is it actually effective at all?

A suitable test algorithm
To understand what I test, and because of the fun of it, I wanted to implement a suitable test program. I can imagine a good test program for cache testing would:

  • be reasonably slow/fast, so measuring execution time is practical and meaningful
  • have working data sets in sizes 10kb-10Mb
  • the same problem should be solvable with different work set sizes, in a way that the theoretical execution time should be the same, but the difference is because of cache only
  • be reasonably simple to implement and understand, while not so trivial that the optimizer just gets rid of the problem entirely

Finally, I think it is fun if the program does something slightly meaningful.

I found that Bubblesort (and later Selectionsort) were good problems, if combined with a quasi twist. Original bubble sort:

Array to sort: G A F C B D H E   ( N=8 )
Sorted array:  A B C D E F G H
Theoretical cost: O(N2) = 64/2 = 32
Actual cost: 7+6+5+4+3+2+1     = 28 (compares and conditional swaps)

I invented the following cache-optimized Bubble-Twist-Sort:

Array to sort:                G A F C B D H E
Sort halves using Bubblesort: A C F G B D E H
Now, the twist:                                 ( G>B : swap )
                              A C F B G D E H   ( D>F : swap )
                              A C D B G F E H   ( C<E : done )
Sort halves using Bubblesort: A B C D E F G H
Theoretical cost = 16/2 + 16/2 (first two bubbelsort)
                 + 4/2         (expected number of twist-swaps)
                 + 16/2 + 16/2 (second two bubbelsort)
                 = 34
Actual cost: 4*(3+2+1) + 2 = 26

Anyway, for larger arrays the actual costs get very close. The idea here is that I can run a bubbelsort on 1000 elements (effectively using 1000 memory units of memory intensively for ~500000 operations). But instead of doing that, I can replace it with 4 runs on 500 elements (4* ~12500 operations + ~250 operations). So I am solving the same problem, using the same algorithm, but optimizing for smaller cache sizes.

Enough of Bubblesort… you are probably either lost in details or disgusted with this horribly stupid idea of optimizing and not optimizing Bubblesort at the same time.

I made a Selectionsort option. And for a given data size I allowed it either to sort bytes or 32-bit words (which is 16 times faster, for same data size).

The test machines
I gathered 10 different test machines, with different cache sizes and instructions sets:

	QNAP	wdr3600	ac20i	Rpi	Rpi 2	wdr4900	G4	Celeron	Xeon	Athlon	i5
								~2007   ~2010   ~2013
============================================================================================
L1	32	32	32	16	?	32	64	32	32	128	32
L2				128	?	256	256	512	6M	1024	256
L3							1024				6M
Mhz	500	560	580	700	900	800	866	900	2800	3000	3100
CPU	ARMv5	Mips74K	Mips24K	ARMv6	ARMv7	PPC	PPC	x86	x64	x64	x64
OS	Debian	OpenWrt	OpenWrt	OpenWrt	OpenWrt	OpenWrt	Debian	Ubuntu	MacOSX	Ubuntu	Windows

Note that for the multi-core machines (Xeon, Athlon, i5) the L2/L3 caches may be shared or not between cores and the numbers above are a little ambigous. The sizes should be for Data cache when separate from Instruction cache.

The benchmarks
I ran Bubblesort for sizes 1000000 bytes down to 1000000/512. For Selectionsort I just ran three rounds. For Bubblesort I also ran for 2000000 and 4000000 but those times are divided by 4 and 16 to be comparable. All times are in seconds.

Bubblesort

	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
============================================================================================
4000000	1248	1332	997	1120	396	833		507	120	104	93
2000000	1248	1332	994	1118	386	791	553	506	114	102	93
1000000	1274	1330	1009	1110	367	757	492	504	113	96	93
500000	1258	1194	959	1049	352	628	389	353	72	74	63
250000	1219	1116	931	911	351	445	309	276	53	61	48
125000	1174	1043	902	701	349	397	287	237	44	56	41
62500	941	853	791	573	349	373	278	218	38	52	37
31250	700	462	520	474	342	317	260	208	36	48	36
15625	697	456	507	368	340	315	258	204	35	49	35
7812	696	454	495	364	340	315	256	202	34	49	35
3906	696	455	496	364	340	315	257	203	34	47	35
1953	698	456	496	365	342	320	257	204	35	45	35

Selectionsort

	QNAP	wdr3600	ac20i	rpi	rpi2	wdr4900	G4	Celeron	Xeon	Athlon	i5
============================================================================================
1000000	1317	996	877	1056	446	468	296	255	30	45	19
31250	875	354	539	559	420	206	147	245	28	40	21
1953	874	362	520	457	422	209	149	250	30	41	23

Theoretically, all timings for a single machine should be equal. The differences can be explained much by cache sizes, but obviously there are more things happening here.

Findings
Mostly the data makes sense. The caches creates plateaus and the L1 size can almost be prediced by the data. I would have expected even bigger differences between best/worse-cases; now it is in the range 180%-340%. The most surprising thing (?) is the Selectionsort results. They are sometimes a lot faster (G4, i5) and sometimes significantly slower! This is strange: I have no idea.

I believe the i5 superior performance of Selectionsort 1000000 is due to cache and branch prediction.

I note that the QNAP and Archer C20i both have DDRII memory, while the RPi has SDRAM. This seems to make a difference when work sizes get bigger.

I have also made other Benchmarks where the WDR4900 were faster than the G4 – not this time.

The Raspberry Pi
What did I learn about the Raspberry Pi? Well, memory is slow and branch prediction seems bad. It is typically 10-15 times slower than the modern (Xeon, Athlon, i5) CPUs. But for large selectionsort problems the difference is up to 40x. This starts getting close to the Node.js crap speed. It is not hard to imagine that Node.js benefits heavily from great branch prediction and large cache sizes – both things that the RPi lacks.

What about the 128k cache? Does it work? Well, compared to the L1-only machines, performance of RPi degrades sligthly slower, perhaps. Not impressed.

Bubblesort vs Selectionsort
It really puzzles me that Bubblesort ever beats Selectionsort:

void bubbelsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, jm1;
  uint32_t tmp;
  for ( i=len ; i>1 ; i-- ) {
    for ( j=1 ; j<i ; j++ ) {
      jm1 = j-1;
      if ( array[jm1] > array[j] ) {
        tmp = array[jm1];
        array[jm1] = array[j];
        array[j] = tmp;
      }
    }
  }
}

void selectionsort_uint32_t(uint32_t* array, size_t len) {
  size_t i, j, best;
  uint32_t tmp;
  for ( i=1 ; i<len ; i++ ) {
    best = i-1;
    for ( j=i ; j<len ; j++ ) {
      if ( array[best] > array[j] ) {
        best = j;
      }
    }
    tmp = array[i-1];
    array[i-1] = array[best];
    array[best] = tmp;
  } 
}

Essentially, the difference is how the swap takes place outside the inner loop (once) instead of all the time. The Selectionsort should also be able of benefit from easier branch prediction and much fewer writes to memory. Perhaps compiling to assembly code would reveal something odd going on.

Power of 2 aligned data sets
I avoided using a datasize with the size an exact power of two: 1024×1024 vs 1000×1000. I did this becuase caches are supposed to work better this way. Perhaps I will make some 1024×1024 runs some day.