Fast string searchDownload source code (69 kb)This article provides an insight on performance over efficient algorithms meant to search strings within text (exact matching), and uses different flavor of languages including C, C++, C# and Java. The results are astonishing.
What for?Why spend time on string search? Isn't this supposed to be provided by the core run-times? Is it worth any investment? To all those questions the answer is pretty clear : it depends on what you are trying to achieve. Clearly, if you only want to provide search capabilities in the text edit component of your app, then you might just as well be off with the existing API available for you. Regardless the language you are using, you will always find basic algorithms for just that, so why bother? Bothering begins when you need high performing processes. It's utterly fine you can find content among your myriad of documents, but if this takes as much as 10 minutes before you get the results page, then you won't do it often. On the web, the success of google is just as much the relevant character of results as well as the response time. So yes, for some application it matters. If you are writing parsers, it matters. Almost 100% developers I can think of have to write parsers at one time or another in the development life cycle, so why not take the best algorithm for that matter, or why not rely on an efficient library? The additional reason for that article is that I wanted to benchmark technologies as they are evolving. Java introduced a new full object orientation in the mid 90s, and now Microsoft is taking this and reintroducing it in the .NET development platform. So it matters a lot, imagine that Java or .NET equivalent of the former algorithms are 10 or 20 slower, doesn't it hurt the adoption of these technologies? Why should people, including developers and business people invest in technologies that only add bloat to problems trying to get solved? In other words, does Java and .NET really add up anything? Are they part of the problem, or part of the solution?
Introducing algorithmsAlgorithms used are either standard, that is part of standard run-time libraries, or deliberately resulting from individual implementations. In fact, none of those individual implementations are optimized, the code was left as simple as it could be as a way to be able to assess the brute force comparison with highly optimized ones from standard libraries. No less than 4 languages have been used. Although Java and .NET code internally use Unicode to store text, they arguably compare with the C/C++ single-byte algorithms I have used since both can look-up characters arbitrarily, unlike UTF8 for instance. For the C language, I have used strstr from the MS C run-time. It's written with assembly, that is optimized code. In other words, it uses x86 registers instead of memory stacks or memory data segs any time it is possible. Although assembly code looks odd these days, no one (except teams working on games may be) would ever rely on that language anymore, it has produced tremendous results in the tests I have performed. For the C++ language, I have used 3 algorithms. They are not optimized. The first is naive string search, it looks characters one after the other, until the whole text to search is all traversed. The second is based on Donald Knuth - Morris - Pratt work. The third is based on Boyer - Moore work. The second and third algorithms deserve explanations, that you'll find here (iteration by iteration tutorial) and here (explanations, time/memory cost, descriptive algorithm, C++ algorithm). In the remainder of the article, the second algorithm shall be referenced by KMP, while the third algorithm shall be by BM. In essence, KMP takes advantage of former lookups to skip characters, but characters in the text to perform the search are never skipped. BM pushes that beyond by skipping characters whenever possible, for instance any time a character within reach is not part of the pattern being searched for, then the whole range of characters is skipped. I have deliberately not used variants of those algorithms, although they provide improvements. C++ algorithms provide two interesting comparisons : are clever algorithms worth it? is the C++ compiler able to produce code equivalent to the highly optimized C routine? In Java, I have used the default and standard way to look for a pattern within a text. That's implemented by the For .NET, I have used C# even though I could have taken any other .NET language. I rely on the String class from the base class library. I have used the
The test suiteThe test suite consists in choosing one algorithm and executing it a great amount of times, great enough so that the cost of initializing the variables is not significant. For instance, the time required to start the Java VM or the CLR VM is not taken into account (although it can account in the real world!). In the downloadable code, the test suite executes 1000 unit tests. The unit test consists in finding all occurences of 5 patterns in a text. The text is chosen to be generic enough so not to influence one algorithm or another. It's 48kb, providing sufficient lookup iterations to allow algorithms to play at their best. Patterns are chosen to allow further interesting comparisons. In terms of occurences, the patterns are as below :
Astonishing resultsThe test suite was executed on a Celeron 800Mhz with 256MB of RAM. No other process was running at the same time. Of course, the test suite was executing a release build of the code, for all languages. Results are below. The number of the duration for the test suite to perform 1000 times all unit tests.
First impressionsOf course results may vary with your machine if you do the test, but what matters is the ratio between algorithms, not the actual absolute value. For a simple and often used function call, the .NET code (whether being ngen'd or not) is 20 times slower than the most optimized (and most brutal) native code. So performance leaves a lot to desire. Other conclusions are :
Looking at the code.1 strstr implementationfindnext: mov esi,edi ; esi = edi = pointers to somewhere in str1 mov ecx,[esp + 14h] ; str2 ;use edi instead of esi to eliminate AGI mov al,[edi] ; al is next char from str1 add esi,1 ; increment pointer into str1 cmp al,dl je first_char_found test al,al ; end of str1? jz not_found ; yes, and no match has been found loop_start: mov al,[esi] ; put next char from str1 into al add esi,1 ; increment pointer in str1 in_loop: cmp al,dl je first_char_found test al,al ; end of str1? jnz loop_start ; no, go get another char from str1 not_found: pop esi pop ebx pop edi xor eax,eax ret If you don't know assembly code, let me tell you in short that closed brackets reflect memory access. The more memory access, the slower the code. Even though the .2 C++ asm code generated for the naive implementation(only an excerpt of the code, two lines of C++ code are dissected) ; 207 : LPTSTR p = m_cpPattern; 00030 8b 45 f8 mov eax, DWORD PTR _this$[ebp] 00033 8b 48 10 mov ecx, DWORD PTR [eax+16] 00036 89 4d ec mov DWORD PTR _p$64824[ebp], ecx ; 208 : long i = 0; 00039 c7 45 e0 00 00 00 00 mov DWORD PTR _i$64825[ebp], 0 $L64827: ; 209 : while (*(m_cpPtr+i) && *p == *(m_cpPtr+i)) 00040 8b 45 f8 mov eax, DWORD PTR _this$[ebp] 00043 8b 48 0c mov ecx, DWORD PTR [eax+12] 00046 8b 55 e0 mov edx, DWORD PTR _i$64825[ebp] 00049 0f be 04 11 movsx eax, BYTE PTR [ecx+edx] 0004d 85 c0 test eax, eax 0004f 74 2b je SHORT $L64828 00051 8b 45 ec mov eax, DWORD PTR _p$64824[ebp] 00054 0f be 08 movsx ecx, BYTE PTR [eax] 00057 8b 55 f8 mov edx, DWORD PTR _this$[ebp] 0005a 8b 42 0c mov eax, DWORD PTR [edx+12] 0005d 8b 55 e0 mov edx, DWORD PTR _i$64825[ebp] 00060 0f be 04 10 movsx eax, BYTE PTR [eax+edx] 00064 3b c8 cmp ecx, eax 00066 75 14 jne SHORT $L64828 .3 Rotor code (non commercial .NET VM and base class library)INT32 COMNlsInfo::FastIndexOfString(WCHAR *source, INT32 startIndex, INT32 endIndex, WCHAR *pattern, INT32 patternLength) { int endPattern = endIndex - patternLength + 1; if (endPattern<0) { return -1; } if (patternLength <= 0) { return startIndex; } WCHAR patternChar0 = pattern[0]; for (int ctrSrc = startIndex; ctrSrc<=endPattern; ctrSrc++) { if (source[ctrSrc] != patternChar0) continue; int ctrPat; for (ctrPat = 1; (ctrPat < patternLength) && (source[ctrSrc + ctrPat] == pattern[ctrPat]); ctrPat++) { ; // just loop } if (ctrPat == patternLength) { return (ctrSrc); } } return (-1); } .4 Java code (actual VM and base class library)
Stéphane Rodriguez December 22, 2003.
|
Home Blog |