My Blog: Matching multiple strings

Sometimes, it is necessary to check efficiently which of the many input strings contain one of the few pre-defined substrings.

The simplest approach

This task, of course, can be solved inefficiently by brute force, i.e. by passing each substring to strstr(), as demonstrated in the following benchmark that tries to search for known names of linux games in the same string (that actually doesn't match) 100000 times:


/* compile as follows: gcc -o benchmark1 benchmark1.c */
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int check(const char* p)
{
    int i;
    for (i = 0; i < sizeof(words)/sizeof(words[0]); i++) {
        if (strstr(p, words[i]) != NULL)
            return 0;
    }
    return -1;
}

int main(int argc, char* argv[])
{
    struct timeval start, end;
    int usec_spent;
    int i;
    int result;

    gettimeofday(&start, NULL);
    for (i = 0; i < 100000; i++)
        result = check(text);

    gettimeofday(&end, NULL);
    if (result == 0)
        printf("The string matched\n");
    else
        printf("The string didn't match\n");
    usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
    usec_spent += (end.tv_usec - start.tv_usec);
    printf("Spent %d ms\n", usec_spent / 1000);
    return 0;
}

On my desktop, this benchmark takes 2925 ms.

Aho-Corasick algorithm

Special fast algorithms exist for the same task. All of them preprocess the given set of substrings and use the result of such preprocessing when deciding whether the input string matches. One of such ahorithms is the Aho-Corasick algorithm. A free (BSD-licensed) implementation is available for download. Unfortunately, this implementation operates on wide-character strings and thus is not directly comparable with the strstr() approach. Also, if one of the substrings matches the very beginning of the input string, one cannot distinguish this situation with CSuffixTrie::SearchAhoCorasik() from non-match by looking only atCSuffixTrie::_DataFound::iFoundPosition. To compensate for this, I have extracted the SuffixTrie.{h,cpp} files from the archive and applied the following edits:


--- SuffixTrie.h
+++ SuffixTrie.h
@@ -51,7 +51,7 @@
 {
 public:
     //Our string type
-    typedef std::wstring SearchString;
+    typedef std::string SearchString;
 
     //Data returned from our search
     typedef struct _DataFound
@@ -107,7 +107,7 @@
     virtual ~CSuffixTrie();
 private:
     //Our char search type
-    typedef wchar_t SearchChar;
+    typedef char SearchChar;
 
     //Forward declare the node
     struct _Node;
--- SuffixTrie.cpp    2008-09-24 02:20:46.000000000 +0600
+++ SuffixTrie.cpp    2009-09-06 15:22:11.000000000 +0600
@@ -1,4 +1,3 @@
-#include "stdafx.h"
 #include "SuffixTrie.h"
 
 CSuffixTrie::CSuffixTrie()
@@ -136,7 +135,7 @@
 void CSuffixTrie::BuildTreeIndex()
 {
     //Build index on the root
-    BuildIndex(L"",
+    BuildIndex("",
                &m_aRoot);
 }
 
@@ -261,7 +260,7 @@
 {
     //Our data found
     DataFound aData;
-    aData.iFoundPosition=0;
+    aData.iFoundPosition=-1;
 
     //The current string we match
     SearchString sMatchedString;
@@ -295,7 +294,7 @@
                     pNode=&m_aRoot;
 
                     //Reset search string
-                    sMatchedString=L"";
+                    sMatchedString="";
 
                     //Did we do a switch?
                     if (bSwitch)
@@ -386,7 +385,7 @@
                     pNode=&m_aRoot;
 
                     //Reset search string
-                    sMatchedString=L"";
+                    sMatchedString="";
 
                     //Did we do a switch?
                     if (bSwitch)
@@ -439,7 +438,7 @@
             iCount-=sMatchedString.length()-1;
 
             //Reset the data
-            sMatchedString=L"";
+            sMatchedString="";
         }
     }
 
@@ -495,7 +494,7 @@
 
     //Start to build the trie
     BuildStrings(aVector,
-                 L"",
+                 "",
                  &m_aRoot);
 
     //Done

Here is a benchmark:


/* compile as follows: g++ -O2 -o benchmark2 benchmark2.cpp SuffixTrie.cpp */
#include "SuffixTrie.h"
#include <stdio.h>
#include <string>
#include <sys/time.h>

const char* words[] = { "actioncube", "alien arena", "astromenace",
"armagetron", "assault cube", "atomorun2008", "battle for wesnoth",
"blob wars", "breakout", "briquolo", "bzflag", "celetania",
"conquest", "enigma", "freeciv", "freecol", "freeorion",
"freetrain", "glest", "hedgewars", "heretic", "hexen",
"jardinanis", "lordsawar", "lxdream", "maniadrive", "neverball",
"nexuiz", "oolite", "openarena", "opencity", "openmw",
"open quartz", "openttd", "paintball 2", "planeshift", "postal 2",
"sacred", "scorched 3d", "smokin' guns", "supertux", "supertuxkart",
"teeworlds", "tremulous", "urban terror", "vegastrike", "warsow",
"wormux", "x-moto" };

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
    struct timeval start, end;
    int usec_spent;

    CSuffixTrie aTree;
    int i;
    for (i = 0; i < sizeof(words)/sizeof(words[0]); i++)
        aTree.AddString(words[i]);

    aTree.BuildTreeIndex();

    std::string stext = text;
    int pos;

    gettimeofday(&start, NULL);
    for (i = 0; i < 100000; i++) {
        pos = aTree.SearchAhoCorasik(stext).iFoundPosition;
    }
    gettimeofday(&end, NULL);

    if (pos == -1)
        printf("The string didn't match\n");
    else
        printf("The string matched\n");

    usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
    usec_spent += (end.tv_usec - start.tv_usec);
    printf("Spent %d ms\n", usec_spent / 1000);
    return 0;
}

On my desktop, this benchmark takes 326 ms. I.e., this is indeed faster than the simple strstr()-based approach.

Regular expressions

One may wonder why the Aho-Corasick algorithm is not available in the standard C library. The answer is that a more general mechanism, based on a similar idea, is available: regular expressions. One first constructs a regular expression by escaping and joining the given substrings using "|" as the separator, then compiles this regular expression with regcomp() and uses the result in regexec(). Here is a benchmark:


/* compile as follows: gcc -o benchmark3 benchmark3.c */
#include <stdio.h>
#include <regex.h>
#include <string.h>
#include <sys/time.h>

const char* re = "(actioncube|alien arena|astromenace|armagetron|"
"assault cube|atomorun2008|battle for wesnoth|blob wars|breakout|"
"briquolo|bzflag|celetania|conquest|enigma|freeciv|freecol|"
"freeorion|freetrain|glest|hedgewars|heretic|hexen|jardinanis|"
"lordsawar|lxdream|maniadrive|neverball|nexuiz|oolite|openarena|"
"opencity|openmw|open quartz|openttd|paintball 2|planeshift|"
"postal 2|sacred|scorched 3d|smokin' guns|supertux|supertuxkart|"
"teeworlds|tremulous|urban terror|vegastrike|warsow|wormux|x-moto)";

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
    struct timeval start, end;
    int usec_spent;
    int i;
    regex_t reg;
    int result;

    result = regcomp(&reg, re, REG_EXTENDED | REG_NOSUB);
    if (result)
        return 1;

    gettimeofday(&start, NULL);
    for (i = 0; i < 100000; i++)
        result = regexec(&reg, text, 0, NULL, 0);
    gettimeofday(&end, NULL);
    if (result == 0)
        printf("The string matched\n");
    else
        printf("The string didn't match\n");
    usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
    usec_spent += (end.tv_usec - start.tv_usec);
    printf("Spent %d ms\n", usec_spent / 1000);
    return 0;
}

On this linux x86 system with glibc-2.10.1, it takes 312 ms, i.e., as good as the Aho-Corasick implementation referenced above.

On other platforms, the result may be different, because the system C library uses a different regular expression engine. E.g., the same benchmark, when compiled with TRE (replace <regex.h> with <tre/regex.h> and add -ltre to the gcc command line), takes 4158 ms, i.e., more than the brute-force approach. Since NetBSD will use TRE in the next release, this unsatisfactory result will likely manifest itself there. So, if you want to give the same performance guarantees on all platforms, don't use regcomp()/regexec().

The same note about unsatisfactory speed (8884 ms here) applies to PCRE. To repeat this test, replace <regex.h> with <pcreposix.h> and add -lpcreposix to gcc command line.

Generated code

It is possible to match multiple substrings much faster, if they are known at compile time. The trick is to convert the regular expression to C code with a tool such as re2c. Here is a benchmark of this approach:


/* compile as follows:
  re2c benchmark4.re >benchmark4.c
  gcc -O1 -o benchmark4 benchmark4.c */
#include <stdio.h>
#include <sys/time.h>

int check(const char* p)
{
    const unsigned char* marker;
/*!re2c
    re2c:define:YYCTYPE  = "unsigned char";
    re2c:define:YYCURSOR = p;
    re2c:yyfill:enable = 0;
    re2c:define:YYMARKER = marker;
    re2c:yych:conversion = 1;
    [\001-\377]*( "actioncube" | "alien arena" |
    "astromenace" | "armagetron" |
    "assault cube" | "atomorun2008" |
    "battle for wesnoth" | "blob wars" |
    "breakout" | "briquolo" | "bzflag" |
    "celetania" | "conquest" | "enigma" |
    "freeciv" | "freecol" | "freeorion" |
    "freetrain" | "glest" | "hedgewars" |
    "heretic" | "hexen" | "jardinanis" |
    "lordsawar" | "lxdream" | "maniadrive" |
    "neverball" | "nexuiz" | "oolite" |
    "openarena" | "opencity" | "openmw" |
    "open quartz" | "openttd" | "paintball 2" |
    "planeshift" | "postal 2" | "sacred" |
    "scorched 3d" | "smokin' guns" |
    "supertux" | "supertuxkart" |
    "teeworlds" | "tremulous" |
    "urban terror" | "vegastrike" |
    "warsow" | "wormux" |
    "x-moto" ) { return 0; }
    "" { return 1; }
*/
}

const char* text = "gnome uses gconf to store all of its configuration";

int main(int argc, char* argv[])
{
    struct timeval start, end;
    int usec_spent;
    int i;
    int result;

    gettimeofday(&start, NULL);
    for (i = 0; i < 100000; i++)
        result = check(text);
    gettimeofday(&end, NULL);
    if (result == 0)
        printf("The string matched\n");
    else
        printf("The string didn't match\n");
    usec_spent = 1000000 * (end.tv_sec - start.tv_sec);
    usec_spent += (end.tv_usec - start.tv_usec);
    printf("Spent %d ms\n", usec_spent / 1000);
    return 0;
}

When compiled with -O1 or -O3, this benchmark runs 12 ms here (i.e., almost 250 times faster than the brute-force approach). However, the -O2 optimization option makes the code slower (it runs 16 ms).

3 comments:

Unknown said...: Very interest post! :); September 6, 2009 at 11:51 AM
Alexander E. Patrakov said...: Updated timings, now for x86_64 with gcc-4.4.3:

Brute-force approach: 1889 ms

Aho-Corasick algorithm: 267 ms

Regular expression with glibc: 223 ms

Regular expression with TRE: 3633 ms

Regular expression with PCRE: 6298 ms

[NEW] Regular expression with Google RE2: 62 ms

re2c at -O1: 13 ms

re2c at -O2 and -O3: 11 ms; May 10, 2010 at 7:36 AM
Cstarter said...: Hello,

Nice post about re2c ! although I was wondering if you knew how to make a condition that returns the number of different words found in the text.

Thanks; June 13, 2013 at 7:27 AM

My Blog

Sunday, September 6, 2009

Matching multiple strings

The simplest approach

Aho-Corasick algorithm

Regular expressions

Generated code

3 comments:

Links