stristr, case insensitive string search

Download source code and demo - 27 Kb




I can't count how many times I have rambled over the lack of a case-insensitive implementation of strstr, the C run-time function that performs a search of a string within another. And, as I move slowly but firmly into the Windows search area, I thought that would have been good to implement just that.

A nice benefit is that not only the function can be used as is, it can now also be added to a MFC CString-derived class for instance, P/Invoked in a C# code, and so on.

Anytime I had to perform a case-insensitive lookup in the past, I would do one of those following 2 kind of horrible things (and that's a euphemism) only because of the absence of such function or method :

  • I would turn the searched buffer into lowercase, then the string to search for to lowercase, and then rely on the usual lookup method. With MFC code it goes something like this :
    long Lookup(CString &szOriginalBuffer, CString &szString)
      CString szBuffer = szOriginalBuffer;
      CString szStringLower = szString;
      return szBuffer.Find(szStringLower,0);
  • Or just try to match the characters by hand, one by one, resulting in an even more horrible piece of code

All of this is OVER.

What's odd is that not only the C run-time doesn't bring such function to the game, there is no such equivalent either in the MFC CString class, nor in the STL basicstring, nor in the C# String class. Simply put, amazing!


x86 assembly source code

With this implentation of stristr, I hope to fill the gap once for all. The current implementation is not really optimized, that's really a POC (Proof of concept), but the performance we have with it is orders of magnitude different than higher-level languages since the use of assembly registers is maximized, something that the compiler cannot achieve automagically. And of course, the bonus is also that you need not create a dummy buffer, copy your arbitrary size buffer in it, and turn it to lowercase only to be able to call a lookup method. The entire lookup in stristr is performed directly and in read-only mode over the passed parameters.

Here we go with the code :

#pragma warning(disable : 4035)

// stristr /////////////////////////////////////////////////////////
// performs a case-insensitive lookup of a string within another
// (see C run-time strstr)
// str1 : buffer
// str2 : string to search for in the buffer
// example char* s = stristr("Make my day","DAY");
// S.Rodriguez, Jan 11, 2004
char* stristr(char* str1, char* str2)
    mov ah, 'A'
    mov dh, 'Z'

    mov esi, str1
    mov ecx, str2
    mov dl, [ecx]
    test dl,dl ; NULL?
    jz short str2empty_label

    mov ebx, esi ; save esi
    inc ebx
    mov al, [esi]
    inc esi
    test al,al
    je short str2notfound_label ; not found!

    cmp dl,ah           ; 'A'
    jb short skip1
    cmp dl,dh           ; 'Z'
    ja short skip1
    add dl,'a' - 'A'    ; make lowercase the current character in str2

    cmp al,ah           ; 'A'
    jb short skip2
    cmp al,dh           ; 'Z'
    ja short skip2
    add al,'a' - 'A'    ; make lowercase the current character in str1

    cmp al,dl
    je short onecharfound_label
    mov esi, ebx ; restore esi value, +1
    mov ecx, str2 ; restore ecx value as well
    mov dl, [ecx]
    jmp short outerloop_label ; search from start of str2 again
    inc ecx
    mov dl,[ecx]
    test dl,dl
    jnz short innerloop_label
    jmp short str2found_label ; found!
    mov eax, esi // empty str2 ==> return str1
    jmp short ret_label
    dec ebx
    mov eax, ebx // str2 found ==> return occurence within str1
    jmp short ret_label
    xor eax, eax // str2 nt found ==> return NULL
    jmp short ret_label


#pragma warning(default : 4035)


Using it in your own source code

The demo code provided in the zip package contains a benchmark suite that find occurences of a given string in an html file using a couple ways. The benchmark purpose is only to know if the additional code branches in stristr are significantly slower than the implementation of strstr from the C run-time, and another implementation of strstr of my own. That said, when strstr, it doesn't perform case insensitive search, so it just a matter of speed comparison, not actual feature.

I also provide a reference C-based case insensitive search. Since it's not written with assembly code, it won't be compiled into code that maximizes the use of registry rather than the data segment. But this is how CPU with a great amount of L2 cache would come into play.



Stephane Rodriguez- February 6, 2004.