Sunday, September 6, 2009

Matching multiple strings

Sometimes, it is necessary to check efficiently which of the many input strings contain one of the few pre-defined substrings.

The simplest approach



This task, of course, can be solved inefficiently by brute force, i.e. by passing each substring to strstr(), as demonstrated in the following benchmark that tries to search for known names of linux games in the same string (that actually doesn't match) 100000 times:


/* compile as follows: gcc -o benchmark1 benchmark1.c */
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int check(const char* p)
{
int i;
for (i = 0; i < sizeof(words)/sizeof(words[0]); i++) {
if (strstr(p, words[i]) != NULL)
return 0;
}
return -1;
}

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
int result;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = check(text);

gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


On my desktop, this benchmark takes 2925 ms.

Aho-Corasick algorithm



Special fast algorithms exist for the same task. All of them preprocess the given set of substrings and use the result of such preprocessing when deciding whether the input string matches. One of such ahorithms is the Aho-Corasick algorithm. A free (BSD-licensed) implementation is available for download. Unfortunately, this implementation operates on wide-character strings and thus is not directly comparable with the strstr() approach. Also, if one of the substrings matches the very beginning of the input string, one cannot distinguish this situation with CSuffixTrie::SearchAhoCorasik() from non-match by looking only atCSuffixTrie::_DataFound::iFoundPosition. To compensate for this, I have extracted the SuffixTrie.{h,cpp} files from the archive and applied the following edits:


--- SuffixTrie.h
+++ SuffixTrie.h
@@ -51,7 +51,7 @@
{
public:
//Our string type
- typedef std::wstring SearchString;
+ typedef std::string SearchString;

//Data returned from our search
typedef struct _DataFound
@@ -107,7 +107,7 @@
virtual ~CSuffixTrie();
private:
//Our char search type
- typedef wchar_t SearchChar;
+ typedef char SearchChar;

//Forward declare the node
struct _Node;
--- SuffixTrie.cpp 2008-09-24 02:20:46.000000000 +0600
+++ SuffixTrie.cpp 2009-09-06 15:22:11.000000000 +0600
@@ -1,4 +1,3 @@
-#include "stdafx.h"
#include "SuffixTrie.h"

CSuffixTrie::CSuffixTrie()
@@ -136,7 +135,7 @@
void CSuffixTrie::BuildTreeIndex()
{
//Build index on the root
- BuildIndex(L"",
+ BuildIndex("",
&m_aRoot);
}

@@ -261,7 +260,7 @@
{
//Our data found
DataFound aData;
- aData.iFoundPosition=0;
+ aData.iFoundPosition=-1;

//The current string we match
SearchString sMatchedString;
@@ -295,7 +294,7 @@
pNode=&m_aRoot;

//Reset search string
- sMatchedString=L"";
+ sMatchedString="";

//Did we do a switch?
if (bSwitch)
@@ -386,7 +385,7 @@
pNode=&m_aRoot;

//Reset search string
- sMatchedString=L"";
+ sMatchedString="";

//Did we do a switch?
if (bSwitch)
@@ -439,7 +438,7 @@
iCount-=sMatchedString.length()-1;

//Reset the data
- sMatchedString=L"";
+ sMatchedString="";
}
}

@@ -495,7 +494,7 @@

//Start to build the trie
BuildStrings(aVector,
- L"",
+ "",
&m_aRoot);

//Done


Here is a benchmark:


/* compile as follows: g++ -O2 -o benchmark2 benchmark2.cpp SuffixTrie.cpp */
#include "SuffixTrie.h"
#include <stdio.h>
#include <string>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;

CSuffixTrie aTree;
int i;
for (i = 0; i < sizeof(words)/sizeof(words[0]); i++)
aTree.AddString(words[i]);

aTree.BuildTreeIndex();

std::string stext = text;
int pos;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++) {
pos = aTree.SearchAhoCorasik(stext).iFoundPosition;
}
gettimeofday(&end, NULL);

if (pos == -1)
printf("The string didn't match\n");
else
printf("The string matched\n");

usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}



On my desktop, this benchmark takes 326 ms. I.e., this is indeed faster than the simple strstr()-based approach.

Regular expressions



One may wonder why the Aho-Corasick algorithm is not available in the standard C library. The answer is that a more general mechanism, based on a similar idea, is available: regular expressions. One first constructs a regular expression by escaping and joining the given substrings using "|" as the separator, then compiles this regular expression with regcomp() and uses the result in regexec(). Here is a benchmark:


/* compile as follows: gcc -o benchmark3 benchmark3.c */
#include <stdio.h>
#include <regex.h>
#include <string.h>
#include <sys/time.h>

const char* re = "(actioncube|alien arena|astromenace|armagetron|"
"assault cube|atomorun2008|battle for wesnoth|blob wars|breakout|"
"briquolo|bzflag|celetania|conquest|enigma|freeciv|freecol|"
"freeorion|freetrain|glest|hedgewars|heretic|hexen|jardinanis|"
"lordsawar|lxdream|maniadrive|neverball|nexuiz|oolite|openarena|"
"opencity|openmw|open quartz|openttd|paintball 2|planeshift|"
"postal 2|sacred|scorched 3d|smokin' guns|supertux|supertuxkart|"
"teeworlds|tremulous|urban terror|vegastrike|warsow|wormux|x-moto)";

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
regex_t reg;
int result;

result = regcomp(&reg, re, REG_EXTENDED | REG_NOSUB);
if (result)
return 1;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = regexec(&reg, text, 0, NULL, 0);
gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


On this linux x86 system with glibc-2.10.1, it takes 312 ms, i.e., as good as the Aho-Corasick implementation referenced above.

On other platforms, the result may be different, because the system C library uses a different regular expression engine. E.g., the same benchmark, when compiled with TRE (replace <regex.h> with <tre/regex.h> and add -ltre to the gcc command line), takes 4158 ms, i.e., more than the brute-force approach. Since NetBSD will use TRE in the next release, this unsatisfactory result will likely manifest itself there. So, if you want to give the same performance guarantees on all platforms, don't use regcomp()/regexec().

The same note about unsatisfactory speed (8884 ms here) applies to PCRE. To repeat this test, replace <regex.h> with <pcreposix.h> and add -lpcreposix to gcc command line.

Generated code



It is possible to match multiple substrings much faster, if they are known at compile time. The trick is to convert the regular expression to C code with a tool such as re2c. Here is a benchmark of this approach:


/* compile as follows:
re2c benchmark4.re >benchmark4.c
gcc -O1 -o benchmark4 benchmark4.c */
#include <stdio.h>
#include <sys/time.h>

int check(const char* p)
{
const unsigned char* marker;
/*!re2c
re2c:define:YYCTYPE = "unsigned char";
re2c:define:YYCURSOR = p;
re2c:yyfill:enable = 0;
re2c:define:YYMARKER = marker;
re2c:yych:conversion = 1;
[\001-\377]*( "actioncube" | "alien arena" |
"astromenace" | "armagetron" |
"assault cube" | "atomorun2008" |
"battle for wesnoth" | "blob wars" |
"breakout" | "briquolo" | "bzflag" |
"celetania" | "conquest" | "enigma" |
"freeciv" | "freecol" | "freeorion" |
"freetrain" | "glest" | "hedgewars" |
"heretic" | "hexen" | "jardinanis" |
"lordsawar" | "lxdream" | "maniadrive" |
"neverball" | "nexuiz" | "oolite" |
"openarena" | "opencity" | "openmw" |
"open quartz" | "openttd" | "paintball 2" |
"planeshift" | "postal 2" | "sacred" |
"scorched 3d" | "smokin' guns" |
"supertux" | "supertuxkart" |
"teeworlds" | "tremulous" |
"urban terror" | "vegastrike" |
"warsow" | "wormux" |
"x-moto" ) { return 0; }
"" { return 1; }
*/
}

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
struct timeval start, end;
int usec_spent;
int i;
int result;

gettimeofday(&start, NULL);
for (i = 0; i < 100000; i++)
result = check(text);
gettimeofday(&end, NULL);
if (result == 0)
printf("The string matched\n");
else
printf("The string didn't match\n");
usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
usec_spent += (end.tv_usec - start.tv_usec);
printf("Spent %d ms\n", usec_spent / 1000);
return 0;
}


When compiled with -O1 or -O3, this benchmark runs 12 ms here (i.e., almost 250 times faster than the brute-force approach). However, the -O2 optimization option makes the code slower (it runs 16 ms).

3 comments:

GByte said...

Very interest post! :)

Alexander Patrakov said...

Updated timings, now for x86_64 with gcc-4.4.3:

Brute-force approach: 1889 ms

Aho-Corasick algorithm: 267 ms

Regular expression with glibc: 223 ms

Regular expression with TRE: 3633 ms

Regular expression with PCRE: 6298 ms

[NEW] Regular expression with Google RE2: 62 ms

re2c at -O1: 13 ms

re2c at -O2 and -O3: 11 ms

Cstarter said...

Hello,

Nice post about re2c ! although I was wondering if you knew how to make a condition that returns the number of different words found in the text.

Thanks