A while ago, in spring 1997, I started out to write some
tools that were supposed to make string handling and parsing
text faster than what the standard library has to offer. I
had a need for this since I was (and still am) working on a
WebService Framework that greatly simplifies building and
maintaining interactive web sites. After some initial
prototypes of what I call Tagging Engine written
totally in Python I started rewriting the main parts in C
and soon realized that I needed a little more sophisticated
searching tools.
I could walk through text pretty fast, but in many
situations I just needed to replace some text with some
other text.
The next step was to create a new types for fast searching
in text. I decided to code up an enhanced version of the
well known Boyer-Moore search algorithm. This made me think
a bit more about searching and how knowledge about the text
and the search pattern could be better used to make it work
even faster. The result was an algorithm that uses a suffix
skip array, which I call Fast Search Algorithm.
The two search types are built upon a small C lib I wrote
for this. The implementations are optimized for gcc/Linux
and from the tests I ran I can say that they out-perform
every other technique I have tried. Even the very fast
Boyer-Moore implementation of fgrep (1).
Then I reintegrated those search utilities into the Tagging
Engine and also added a fast variant for doing 'char out of
a string'-kind of tests. These are done using 'sets',
i.e. strings that contain one bit per character position
(and thus 32 bytes long).
All this got wrapped up in a nice Python package:
- a fast search mechanism,
- a state machine for doing fast tagging,
- a set of functions aiding in post-processing the output of the
two and
- a set of functions handling sets of characters.
One word about the word 'tagging'. This originated
from what is done in HTML to mark some text with a certain
extra information. I extended this notion to assigning
Python objects to text substrings. Every substring marked in
this way carries a 'tag' (the object) which can be used to
do all kinds of nifty things.
Tagging Engine
Marking certains parts of a text should not involve storing
hundreds of small strings. This is why the Tagging Engine
uses a specially formatted list of tuples to return the
results:
Tag List
A tag list is a list of tuples marking certain slices of
a text. The tuples always have the format
(object, left_index, right_index, sublist)
with the meaning: object
contains
information about the slice
[left_index:right_index]
in some text. The
sublist
is either another taglist created
by recursively invoking the Tagging Engine or
None
.
Tag Table
To create such taglists, you have to define a Tag Table
and let the Tagging Engine use it to mark the text. Tag
Tables are really just standard Python tuples containing
other tuples in a specific format:
tag_table = (('lowercase',AllIn,a2z,+1,+2),
('upper',AllIn,A2Z,+1),
(None,AllIn,white+newline,+1),
(None,AllNotIn,alpha+white+newline,+1),
(None,EOF,Here,-4)) # EOF
The tuples contained in the table use a very simple format:
(tagobj, command+flags, command_argument
[,jump_no_match] [,jump_match=+1])
Semantics
The Tagging Engine reads the Tag Table starting at the top
entry. While performing the command actions (see below for
details) it moves a read-head over the characters of the
text. The engine stops when a command fails to match and no
alternative is given or when it reaches a non-existing
entry, e.g. by jumping beyond the end of the table.
Tag Table entries are processed as follows:
If the command
matched, say the slice
text[l:r]
, the default action is to append
(tagobj,l,r,sublist)
to the taglist (this
behaviour can be modified by using special
flags
; if you use None
as tagobj,
no tuple is appended) and to continue matching with the
table entry that is reached by adding
jump_match
to the current position (think of
them as relative jump offsets). The head position of the
engine stays where the command left it (over index
r
), e.g. for (None,AllIn,'A')
right after the last 'A' matched.
In case the command
does not match, the
engine either continues at the table entry reached after
skipping jump_no_match
entries, or if this
value is not given, terminates matching the current
table and returns not matched. The head position is
always restored to the position it was in before the
non-matching command was executed, enabling
backtracking.
The format of the command_argument
is dependent
on the command. It can be a string, a set, a search object,
a tuple of some other wild animal from Python land. See the
command section below for details.
A table matches a string if and only if the Tagging Engine
reaches a table index that lies beyond the end of the
table. The engine then returns matched ok. Jumping
beyond the start of the table (to a negative table index)
causes the table to return with result failed to
match.
Tagging Commands
The commands and constants used here are integers defined in
Constants/TagTables.py and imported into the
package's root module. For the purpose of explaining the
taken actions we assume that the tagging engine was called
with tag(text,table,start=0,end=len(text))
. The
current head position is indicated by x
.
Command |
Matching Argument |
Action |
Fail |
Here |
Causes the engine to fail matching at the current head
position.
|
Jump |
To |
Causes the engine to perform a relative jump by
jump_no_match entries.
|
AllIn |
string |
Matches all characters found in text[x:end]
up to the first that is not included in string. At least
one character must match.
|
AllNotIn |
string |
Matches all characters found in text[x:end]
up to the first that is included in string. At least one
character must match.
|
AllInSet |
set |
Matches all characters found in text[x:end]
up to the first that is not included in the string
set. At least one character must match.
|
Is |
character |
Matches iff text[x] == character .
|
IsNot |
character |
Matches iff text[x] != character .
|
IsIn |
string |
Matches iff text[x] is in string .
|
IsNotIn |
string |
Matches iff text[x] is not in string .
|
IsInSet |
set |
Matches iff text[x] is in set .
|
Word |
string |
Matches iff text[x:x+len(string)] == string .
|
WordStart |
string |
Matches all characters up to the first occurance of
string in text[x:end] .
If string is not found, the command does not match and
the head position remains unchanged. Otherwise, the
head stays on the first character of string in the
found occurance.
At least one character must match.
|
WordEnd |
string |
Matches all characters up to the first occurance of
string in text[x:end] .
If string is not found, the command does not match and
the head position remains unchanged. Otherwise, the
head stays on the last character of string in the
found occurance.
|
sWordStart |
search object |
Same as WordStart except that the search object is used
to perform the necessary action (which can be much faster)
and zero matching characters are allowed.
|
sWordEnd |
search object |
Same as WordEnd except that the search object is used
to perform the necessary action (which can be much faster).
|
sFindWord |
search object |
Uses the search object to find the given substring.
If found, the tagobj is assigned only to the slice of
the substring. The characters leading up to it are
ignored.
The head position is adjusted to right after the
substring -- just like for sWordEnd.
|
Call |
function |
Calls the matching
function(text,x,end) .
The function must return the index y of
the character in text[x:end] right after
the matching substring.
The entry is considered to be matching, iff x !=
y . The engines head is positioned on
y in that case.
|
CallArg |
(function,[arg0,...]) |
Same as Call except that
function(text,x,end[,arg0,...]) is being
called. The command argument must be a tuple.
|
Table |
tagtable or ThisTable |
Matches iff tagtable matches text[x:end] .
This calls the engine recursively.
In case of success the head position is adjusted to
point right after the match and the returned taglist
is made available in the subtags field of this tables
taglist entry.
You may pass the special constant
ThisTable instead of a Tag Table if you
want to call the current table recursively.
|
SubTable |
tagtable or ThisTable |
Same as Table except that the subtable reuses this
table's tag list for its tag list. The
subtags entry is set to None.
You may pass the special constant
ThisTable instead of a Tag Table if you
want to call the current table recursively.
|
TableInList |
(list_of_tables,index) |
Same as Table except that the matching table to be used
is read from the list_of_tables at position
index whenever this command is
executed.
This makes self-referencing tables possible which
would otherwise not be possible (since Tag Tables are
immutable tuples).
Note that it can also introduce circular references,
so be warned !
|
SubTableInList |
(list_of_tables,index) |
Same as TableInList except that the subtable reuses this
table's tag list. The subtags entry is set
to None .
|
EOF |
Here |
Matches iff the head position is beyond end .
|
Skip |
offset |
Always matches and moves the head position to x +
offset .
|
Move |
position |
Always matches and moves the head position to
slice[position] . Negative indices move the
head to slice[len(slice)+position+1] ,
e.g. (None,Move,-1) moves to EOF. slice
refers to the current text slice being worked on by the
Tagging Engine.
|
Loop |
count |
Remains undocumented for this release.
|
LoopControl |
Break/Reset |
Remains undocumented for this release.
|
Some additional constants that can be used as argument or relative
jump position:
Internally, the Tag Table is used as program for a state
machine which is coded in C and accessible through the
package as tag()
function along with the
constants used for the commands (e.g. Allin, AllNotIn,
etc.). Note that in computer science one normally
differentiates between finite state machines, pushdown
automata and turing machines. The Tagging Engine offers all
these levels of complexity depending on which techniques you
use, yet the basic structure of the engine is best compared
to a finite state machine.
I admit, these tables don't look very elegant. In fact I
would much rather write them in some meta language that gets
compiled into these tables instead of handcoding them. But
I've never had time to do much research into this. Mike
C. Fletcher has been doing some work in this direction
recently. You may want to check out his SimpleParse
add-on for mxTextTools. Recently, Tony J. Ibbs has also
started to work in this direction. His meta-language
for mxTextTools aims at simplifying the task of writing
Tag Table tuples.
The packages includes a nearly complete Python emulation of
the Tagging Engine in the Examples subdirectory
(pytag.py). Though it is unsupported it might still provide
some use since it has a builtin debugger that will let you
step through the Tag Tables as they are executed. See the
source for further details.
As an alternative you can build a version of the Tagging
Engine that provides lots of debugging output. See
mxTextTools/Setup for explanations on how to do
this. When enabled the module will create several
.log files containing the debug information of
various parts of the implementation whenever the Python
interpreter is run with the debug flag enabled (python
-d). These files should give a fairly good insight into the
workings of the Tag Engine (though it still isn't as elegant
as it could be).