Sort strings without case sensitivity

In JavaScript, I wanted to sort arrays of strings without caring about case. It was more complicated than I first thought.

The background is that I presents lists like this in a GUI:

  • AMD
  • Apple
  • Gigabyte
  • IBM
  • Intel
  • Microsoft
  • MSI
  • Nokia
  • Samsung
  • Sony

The want AMD and MSI (spelled in all caps) to be sorted without respect to case. Standard sort() would put MSI before Microsoft.

Obviously I am not the first one wanting to do this and I found an article on stackoverflow. It suggests the following solution:

Use toLowerCase()
You can make your own string compare function that uses toLowerCase and send it as an argument to sort():

function cmpCaseless(a,b) {
    a = a.toLowerCase();
    b = b.toLowerCase();
    if ( a < b ) return -1;
    if ( a > b ) return  1;
    return 0;
}

myStringArray.sort(cmpCaseless);

This has a number of problems. The article above mentions that it is not stable. That is probably true in some cases but I was of course worried about performance: making two String objects for each compare should make the garbage collector quite busy, not to mention the waste of copying and lowercasing potentially quite long stings when usually the first character is enought. When I started experimenting I found another more critical flaw though: in Swedish we have three extra characters in the alphabet; Å,Ä,Ö, in that order. The above cmpCaseless orders Ä,Å,Ö, which sounds like a little problem, but it is simply unacceptable.

Use localeCompare
There is a more competent (or so I thought, read on) way to compare strings in JavaScript: the localeCompare function. This one simply treats A,Å,Ä and O,Ö as the same character, which is far more unacceptable than the toLowerCase problem.

However, it also has a “locales” option (a second optional argument). If I set it to ‘sv’ I get the sort order that I want, but performance is horrible. And I still have to use toLowerCase as well as localeCompare:

function localeCompare(a,b) {
    return a.toLowerCase().localeCompare(b.toLowerCase());
}

function localeCompare_sv(a,b) {
    return a.toLowerCase().localeCompare(b.toLowerCase(), 'sv');
}

localeCompare() has an extra options argument with a “sensitivity” parameter, but it is no good for our purpuses.

Rolling my own
Of course, I ended up building my own function to do caseless string compare. The strategy is to compare one character at a time, not making any new String objects, and fallback to localeCompare if both characters are above the 127 ASCII characters:

function custom(a,b) {
    var i, al, bl, l;
    var ac, bc;
    al = a.length;
    bl = b.length;
    l = al < bl ? al : bl;
        
    for ( i=0 ; i<l ; i++ ) {
        ac = a.codePointAt(i);  // or charCodeAt() for better compability
        bc = b.codePointAt(i);
        if ( 64 < ac && ac < 91 ) ac += 32;
        if ( 64 < bc && bc < 91 ) bc += 32;
        if ( ac !== bc ) { 
            if ( 127 < ac && 127 < bc ) {
                ac = a.substr(i,1).toLowerCase();
                bc = b.substr(i,1).toLowerCase();
                if ( ac !== bc ) return ac.localeCompare(bc);
            } else {
                return ac-bc;
            }
        }
    }
    return al-bl;
}

One fascinating thing is that here I can use localeCompare() without 'sv'.

Test for yourself
I built a simple webpage where you can test everything yourself.

Conclusion
Defining a string sort order is not trivial, when you dont just have ASCII characters. If you look at the ascii table you see that non alphabetic characters are spread out:

  • SPACE, #, 1-9, and many more come before both A-Z and a-z
  • Underscore: _, and a few other characters come after A-Z but before a-z
  • Pipe: | and a few other characters come after A-Z and a-z

When it comes to characters behind ASCII 127, it just gets more complicated: how do you sort european language latin letters, greek letters and arrows and other symbols?

For this reason, I think it makes sense to define your own sorting function and clearly define the behaviour for the characters you know that you care about. If it really matters in your application.

My function above is significantly faster than the options.

Disclaimer
These results can probably be inconsistent over different web browsers.

Leave a Comment


NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Time limit is exhausted. Please reload CAPTCHA.