Showing posts with label benchmark. Show all posts
Showing posts with label benchmark. Show all posts

Saturday, May 29, 2010

When linear search is faster than binary

It is often stated that a binary search algorithm performs better than linear search. Indeed, let's assume that the goal is to find the index of the largest integer in the sorted zero-based array A that is less than or equal to the given integer X. For simplicity, let's also assume that the size of array is N = 2K, that for all possible inputs A[0] <= X and that one can also put a sentinel value after the end, so that X < A[N] for all possible inputs X. E.g., if it is known that X always fits in 16 bits, one can put 65536 as the sentinel.

The assumptions above are true for the entropy decoder found in the Monkey's Audio codec. In that case, N = 64, X is unsigned and always fits in 16 bits, A[0] = 0, A[64] = 65536.

This is linear search:


A[N] = sentinel;
index = 0;
while (A[index + 1] <= X)
    ++index;


This is binary search:


int bit = N >> 1;  /* N is a power of two */
index = 0;
while (bit != 0) {
    if (A[index + bit] <= X)
        index += bit;
    bit >>= 1;
}


If the size of array to be searched is N, then the average case complexity is commonly stated to be O(log(N)) for binary search and O(N) for linear search. So, for the Monkey's Audio case presented above, one would expect, on average, 32 loop iterations for the linear search algorithm and 6 iterations for the binary search.

However, there is one key fact that makes the estimate above invalid. Such simple averaging is valid only if each result appears with the same probability. For Monkey's Audio entropy decoder, this is not the case.

In fact, out of the 64 possible results, "0" appears with 31% probability, and the total probability of the first 6 results is 87%. So, it is quite natural that linear search is faster than binary in this case, because it often stops very early.

Sunday, September 6, 2009

Matching multiple strings

Sometimes, it is necessary to check efficiently which of the many input strings contain one of the few pre-defined substrings.

The simplest approach



This task, of course, can be solved inefficiently by brute force, i.e. by passing each substring to strstr(), as demonstrated in the following benchmark that tries to search for known names of linux games in the same string (that actually doesn't match) 100000 times:


/* compile as follows: gcc -o benchmark1 benchmark1.c */
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int check(const char* p)
{
int i;
for (i = 0; i < sizeof(words)/sizeof(words[0]); i++) {
if (strstr(p, words[i]) != NULL)
return 0;
}
return -1;
}

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
int result;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = check(text);

gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


On my desktop, this benchmark takes 2925 ms.

Aho-Corasick algorithm



Special fast algorithms exist for the same task. All of them preprocess the given set of substrings and use the result of such preprocessing when deciding whether the input string matches. One of such ahorithms is the Aho-Corasick algorithm. A free (BSD-licensed) implementation is available for download. Unfortunately, this implementation operates on wide-character strings and thus is not directly comparable with the strstr() approach. Also, if one of the substrings matches the very beginning of the input string, one cannot distinguish this situation with CSuffixTrie::SearchAhoCorasik() from non-match by looking only atCSuffixTrie::_DataFound::iFoundPosition. To compensate for this, I have extracted the SuffixTrie.{h,cpp} files from the archive and applied the following edits:


--- SuffixTrie.h
+++ SuffixTrie.h
@@ -51,7 +51,7 @@
{
public:
//Our string type
- typedef std::wstring SearchString;
+ typedef std::string SearchString;

//Data returned from our search
typedef struct _DataFound
@@ -107,7 +107,7 @@
virtual ~CSuffixTrie();
private:
//Our char search type
- typedef wchar_t SearchChar;
+ typedef char SearchChar;

//Forward declare the node
struct _Node;
--- SuffixTrie.cpp 2008-09-24 02:20:46.000000000 +0600
+++ SuffixTrie.cpp 2009-09-06 15:22:11.000000000 +0600
@@ -1,4 +1,3 @@
-#include "stdafx.h"
#include "SuffixTrie.h"

CSuffixTrie::CSuffixTrie()
@@ -136,7 +135,7 @@
void CSuffixTrie::BuildTreeIndex()
{
//Build index on the root
- BuildIndex(L"",
+ BuildIndex("",
&m_aRoot);
}

@@ -261,7 +260,7 @@
{
//Our data found
DataFound aData;
- aData.iFoundPosition=0;
+ aData.iFoundPosition=-1;

//The current string we match
SearchString sMatchedString;
@@ -295,7 +294,7 @@
pNode=&m_aRoot;

//Reset search string
- sMatchedString=L"";
+ sMatchedString="";

//Did we do a switch?
if (bSwitch)
@@ -386,7 +385,7 @@
pNode=&m_aRoot;

//Reset search string
- sMatchedString=L"";
+ sMatchedString="";

//Did we do a switch?
if (bSwitch)
@@ -439,7 +438,7 @@
iCount-=sMatchedString.length()-1;

//Reset the data
- sMatchedString=L"";
+ sMatchedString="";
}
}

@@ -495,7 +494,7 @@

//Start to build the trie
BuildStrings(aVector,
- L"",
+ "",
&m_aRoot);

//Done


Here is a benchmark:


/* compile as follows: g++ -O2 -o benchmark2 benchmark2.cpp SuffixTrie.cpp */
#include "SuffixTrie.h"
#include <stdio.h>
#include <string>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;

CSuffixTrie aTree;
int i;
for (i = 0; i < sizeof(words)/sizeof(words[0]); i++)
aTree.AddString(words[i]);

aTree.BuildTreeIndex();

std::string stext = text;
int pos;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++) {
pos = aTree.SearchAhoCorasik(stext).iFoundPosition;
}
gettimeofday(&end, NULL);

if (pos == -1)
printf("The string didn't match\n");
else
printf("The string matched\n");

usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}



On my desktop, this benchmark takes 326 ms. I.e., this is indeed faster than the simple strstr()-based approach.

Regular expressions



One may wonder why the Aho-Corasick algorithm is not available in the standard C library. The answer is that a more general mechanism, based on a similar idea, is available: regular expressions. One first constructs a regular expression by escaping and joining the given substrings using "|" as the separator, then compiles this regular expression with regcomp() and uses the result in regexec(). Here is a benchmark:


/* compile as follows: gcc -o benchmark3 benchmark3.c */
#include <stdio.h>
#include <regex.h>
#include <string.h>
#include <sys/time.h>

const char* re = "(actioncube|alien arena|astromenace|armagetron|"
"assault cube|atomorun2008|battle for wesnoth|blob wars|breakout|"
"briquolo|bzflag|celetania|conquest|enigma|freeciv|freecol|"
"freeorion|freetrain|glest|hedgewars|heretic|hexen|jardinanis|"
"lordsawar|lxdream|maniadrive|neverball|nexuiz|oolite|openarena|"
"opencity|openmw|open quartz|openttd|paintball 2|planeshift|"
"postal 2|sacred|scorched 3d|smokin' guns|supertux|supertuxkart|"
"teeworlds|tremulous|urban terror|vegastrike|warsow|wormux|x-moto)";

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
regex_t reg;
int result;

result = regcomp(&reg, re, REG_EXTENDED | REG_NOSUB);
if (result)
return 1;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = regexec(&reg, text, 0, NULL, 0);
gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


On this linux x86 system with glibc-2.10.1, it takes 312 ms, i.e., as good as the Aho-Corasick implementation referenced above.

On other platforms, the result may be different, because the system C library uses a different regular expression engine. E.g., the same benchmark, when compiled with TRE (replace <regex.h> with <tre/regex.h> and add -ltre to the gcc command line), takes 4158 ms, i.e., more than the brute-force approach. Since NetBSD will use TRE in the next release, this unsatisfactory result will likely manifest itself there. So, if you want to give the same performance guarantees on all platforms, don't use regcomp()/regexec().

The same note about unsatisfactory speed (8884 ms here) applies to PCRE. To repeat this test, replace <regex.h> with <pcreposix.h> and add -lpcreposix to gcc command line.

Generated code



It is possible to match multiple substrings much faster, if they are known at compile time. The trick is to convert the regular expression to C code with a tool such as re2c. Here is a benchmark of this approach:


/* compile as follows:
re2c benchmark4.re >benchmark4.c
gcc -O1 -o benchmark4 benchmark4.c */
#include <stdio.h>
#include <sys/time.h>

int check(const char* p)
{
const unsigned char* marker;
/*!re2c
re2c:define:YYCTYPE = "unsigned char";
re2c:define:YYCURSOR = p;
re2c:yyfill:enable = 0;
re2c:define:YYMARKER = marker;
re2c:yych:conversion = 1;
[\001-\377]*( "actioncube" | "alien arena" |
"astromenace" | "armagetron" |
"assault cube" | "atomorun2008" |
"battle for wesnoth" | "blob wars" |
"breakout" | "briquolo" | "bzflag" |
"celetania" | "conquest" | "enigma" |
"freeciv" | "freecol" | "freeorion" |
"freetrain" | "glest" | "hedgewars" |
"heretic" | "hexen" | "jardinanis" |
"lordsawar" | "lxdream" | "maniadrive" |
"neverball" | "nexuiz" | "oolite" |
"openarena" | "opencity" | "openmw" |
"open quartz" | "openttd" | "paintball 2" |
"planeshift" | "postal 2" | "sacred" |
"scorched 3d" | "smokin' guns" |
"supertux" | "supertuxkart" |
"teeworlds" | "tremulous" |
"urban terror" | "vegastrike" |
"warsow" | "wormux" |
"x-moto" ) { return 0; }
"" { return 1; }
*/
}

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
int result;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = check(text);
gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


When compiled with -O1 or -O3, this benchmark runs 12 ms here (i.e., almost 250 times faster than the brute-force approach). However, the -O2 optimization option makes the code slower (it runs 16 ms).