matt
--- text matching and extraction utility ---

Introduction

matt is, I believe, fairly unique in relation to commonly available tools, such as grep and awk, with somewhat similar functions. Like them, it is a command-line-invoked program that locates segments of a text by matching these to regular expressions. Unlike them, it is not 'line-bound' (or even 'record-bound'): an expression 'locates' exactly the text that it matches, rather than the entire record that contains it. The matched segment may be part of a line, or extend over several lines. (I assume that most readers know what a "regular expression" is. If not, this is not the place for an in-depth review, but you may get the idea from what follows.]

That seemingly minor difference in strategy means that matt can handle tasks that are difficult or impossible with the other programs. It can pull entire paragraphs that match some desired criterion out of a text, locate elements within an HTML or XML file, even modify desired text segments while outputting the rest unchanged.

There are several ways, depending on the exact command-line and options used to invoke it, in which matt can process a source text. At its simplest, matt will output each matching segment it finds. Alternatively it can output all the text that is not matched.

The power of matt really comes into play, however, with the addition of an "output template" to the command line. The template can slice out segments of each match to be output, interspersed with additional characters. Other items, such as the position of the match in the text or the filename, may also be included in the template. There are even "conditional items", that will only appear if their corresponding sub-seqment is matched (i.e. not empty).

Unlike other common string-matching applications, matt is UTF8-aware: UTF8 multibyte characters can occur anywhere in the regular expression or the text being matched. Alternatively it can be switched (with the '-8' option) to be "8-bit clean". In this mode it will scan any file (even binary) for a specified byte pattern.

The Command Line

The general command-line format is:

matt [options] <pattern> [-o <template>] file...
where <pattern> is the regular expression to match (see below), and <template> is an optional string to control the format of the output. To read text from standard input rather than a file, use "-".

The simplest version:

matt <pattern> file...
sends each match it finds to standard output. By default, a newline is output after each match. To suppress the newline, and get all the exact segments matched concatenated, use:
matt -n <pattern> file...
If you use the '-v' command line option, you will get all the text that doesn't match:
matt -v <pattern> file...
This is always verbatim -- no newlines are added. You can use a template if you want to add stuff.

The complete set of switch options is listed later.

To rearrange the matched segments for output, use the template string:

matt [opts] <pattern> -o <template> file...
See below for the format of this string. Not all the options are appropriate if you're using templates.

Instead of including the pattern and optional template as command-line arguments, you can create a short file containing them, and use the -f option to specify it. This can avoid conflicts with the shell in quoting odd characters and so on (and also tends to a shorter command line!). The pattern file itself can be made executable if you like (detailed later), so you can use it directly as a command to process text.

matt returns a status of 0 if it finds a match, 1 if it does not, and 2 if there was an error in the regular expression supplied.

Regular Expressions

Regular expressions in matt follow the same form as grep and other applications such as awk or perl but, as each of these has its own quirks, so does matt. It uses the set of special characters that has become standard, but doesn't have extensions like 'interval expressions' ( "{m,n}") or predefined range specifiers ("[:ALNUM:]"). A newline may be explicitly or implicitly included in an expression, allowing it to match across lines. Successive matches will never overlap -- the scan resumes at the character following a match.

The special characters are . * + ? ^ | $ ( ) [ \
The dash - and right-bracket ] are special after a left-bracket, but elsewhere they are simply themselves. All other characters, including all extended UTF8 characters, are literal, representing themselves.

A regular expression entered on the command line has to obey the rules of the shell, which means that it should at least always be quoted. Single quotes are usually preferable, as everything inside them — except another single quote! — is taken verbatim. If you need to match a single quote (apostrophe), use the '\q' special literal. Some other particular 'literal' characters also need special representation; one is of course 'newline' itself which can be represented with the usual \n pair. The complete list is given below. You can alternatively enclose the pattern in double-quotes, but you will then have to escape characters like '$' as well.

Each Special Character has meaning as follows:

Regular Expression Examples:

See 'Recipes' below for other possibilities.


Output Templates

By default, when a match to the regular expression is found in the incoming text stream, it is simply sent verbatim to standard output. Other information about each match is available, however, such as its byte position in the stream and the segments of the entire match that correspond to subgroups in the expression. You may want to output a formatted string containing some of this data, rather than just the match itself. A Template string ('-o' option) lets you do this. If you follow the regular-expression argument with '-o' and a template string it will be used to format the output.

If the '-v' switch is present as well, both the results of the template and the unmatched segments will be output, so you can modify the matched segments and leave the rest of the text unchanged.

The template string can have both plain text and 'data selector' elements. Each of the latter is a dollar-sign followed by a single selection character (except for conditional insertion selectors, which are multicharacter). When a match is found, each data selector is equated to its appropriate value for that match, and the template is sent in sequence to the output, with plain text going out as is, and each selector replaced by its value. You can use a particular selector more than once in the same template if you need to. (There is an overall limit of 20 selectors, but that should suffice...)

The possible data selectors are:


Options

You can modify matching behaviour in a number of ways through these command options:

-s  Shortest
By default, each match found is the longest possible one at that point. In many cases this isn't what you want, especially when matching across multiple lines. To get the minimum length match, use the -s switch.
-i  Case Insensitive
By default, all literal characters must match exactly. Include this option to ignore the case of characters.
-a  All matched by Period including Newline
By default, a period ('.') in a regular expression matches any character except a newline. Including this option makes it match newline as well. If you set this, you will probably want to use '-s' also, otherwise your matches may be longer than you expect! This option also controls whether newline will be matched by a negated character-class ("[^a-z]").
-z  $ (End-of-Line) matches End-of-Text
If you want to ensure that an end-of-line is always seen when end-of-data is read, even if no newline is actually there, use this option.
-n  No added Newline
By default, with no other switches or template present, a newline is added after each match. To inhibit this newline, use this option. (This switch is inactive if a template is used. No newlines except those in the template itself are ever added in this case.)
-t  Text Output Forced
A number of the other switch options (-p, -l, -c) normally suppress output of the matched text. This switch causes the matches to be output in these cases also. (This switch is inactive if a template is used.)
-p  Positions only
When supplied, this causes the start and end character positions of each match in the text to be output. Used by itself, it also suppresses output of the matched text, but adding the -t switch as well will restore this. (This switch is inactive if a template is used. Template selectors '$b' and '$e' provide equivalent output.)
-8  8-bit ASCII
Normally matt expects its texts (pattern, template, and source text) to be 7-bit ascii/UTF-8 unicode (identical when only ascii is involved). If instead you want to handle full 8-bit single bytes, use this switch. With this option, you can even scan arbitrary binary files for a pattern. The pattern itself needn't be ascii either — you can specify any byte value from 0 to 255 (octal 377) by using the "\nnn" convention. (If you aren't scanning a text you know to be UTF-8 or 7-bit ascii, it might be advisable to use this switch, in case it should try to treat an extended ISO character as multibyte!)
-v  Inverse Match
Setting this switch causes all the unmatched segments of text to be output. If used without a template, the matched portions are not output (the -t switch is inactive here), but if a template is present as well it behaves in the usual fashion, with template outputs interspersed appropriately with the unmatched segments.
-l  List Filename
Outputs the name of any file containing a match. The name is only output once, on a separate line before any other output for the file. By itself it also suppresses match output, but -t will reverse this, and all the other switches, or a template, will result in their usual output.
-c  Count number of matches only
By itself, this results in only the total count of matches in each file being output. The other switches, such as -t, have their usual effects, as will a template. The count will appear as the last output for that file.
-V  Print Version information and exit
-f filename  File for Pattern (and Template)
Rather than including pattern and template as arguments on the command line, a file can be used to hold them. The first line of this file must either be the regular expression pattern, or a comment beginning with '#'; if it a comment, the next line must be the pattern. An optional line immediately after this can hold the template. The format of these strings is identical to the command-line versions, except of course there must be no enclosing quotes, and an expression on the first line must not begin with '#' so as not to be confused with a comment. You also need not worry about conflicts with shell conventions. (The main reason for allowing a comment as the first line is to permit the pattern file itself to be executable — see below.)
-o string  Output Template
See Template section above.

Executable Pattern File:

It is normal for a command shell to check the first line of an executable script for a possible interpreter to execute the script rather than the shell itself. The convention is for the first two characters to be '#!' followed by the complete path to the location of the interpreter executable. Desired options may follow this, just as on a normal command line.

So, to make a pattern file self-executable, you set the 'execute' bit on the file (with the 'chmod' command), and make the first line something like:

#!/usr/local/bin/matt -f
assuming that there is where matt can be found on your system. The option '-f' should be the last item on the line, so that when the file gets passed as the first argument to the invoked matt it gets used as any other pattern file would. You should be able to add other options — such as '-v' perhaps — provided that they're placed before the '-f'. You may have to combine all the switches into a single term (like '-vf'); Linux, for instance, lumps everything after the interpreter itself into a single argument! Any other arguments passed to the script at invocation become additional arguments to matt as you would expect.

Thus, if you have such an executable pattern file named 'patfile', giving the command:

patfile myfile.txt
would effectively execute:
#!/boot/home/config/bin/matt -f patfile myfile.txt

Real-life Recipes

Extracting Relevant Paragraphs:

One simple use of matt is to extract relevant paragraphs from a text. Thus suppose — for reasons of egotism — that I have some text that makes a number of favourable references to "Pete", and I want to extract all the paragraphs (assumed separated by blank lines) that contain my name. (This particular example could be done in awk, too, but it makes a good first illustration.) This command line would be the most basic way of doing this:
matt '^(.+\n)*.*Pete.*\n(.+\n)*' petesfile.txt
Here, the initial '^' ensures that the match always starts at the beginning of a line. Then the '(.+\n)' subexpression will match any non-blank line (i.e. containing at least one character before the newline). The '*' following allows any number of these — including none — to occur.  '.*Pete.*\n' matches a line that contains 'Pete' with any number of characters before or after, and the final subexpression specifies that any number of non-blank lines may follow.

Hence the match will begin at the first non-blank line it finds in the text, and continues through until it is blocked by a blank line end-of-paragraph. If anywhere in that span the string 'Pete' is found, the match succeeds and the paragraph is output. Otherwise it fails, and that paragraph is discarded. Either way, it resumes the scan from that point, looking for the next non-blank line.

You can see, though, that this simple pattern probably isn't good enough. Not only will it find 'Pete', it will also respond to 'Peter' or even 'Peterborough'. We should restrict it a bit further, perhaps like this:

matt '^(.+\n)*.*Pete([^r]|$).*\n(.+\n)*' petesfile.txt
The added alternative section '([^r]|$)' specifically excludes a terminating 'r', but it also provides for the word being at the very end of the line. Otherwise it would not find a match unless there was some character (other than newline) there. Obviously there are other choices for that part of the expression. Instead of excluding only 'r', we could have specifically looked for a space or punctuation: '([ ,.!\"\q;:]|$)' or whatever. (Note the escaped quotes used, and the '\q' to represent a single quote to keep the shell happy.)

This basic scheme is easily adapted to pick out paragraphs containing any desired word or sets of words. The inverse may also be useful. Without going into extreme detail, I have some web-page access logs that are interesting to browse sometimes, to see what googling brought people to the page [no personal details... don't worry!]. Unfortunately, many of the hits are just robots, which are boring to wade through. The entries are multiline, so grep doesn't help. I now filter the log through a matt command that recognizes and discards any entries from robots, leaving a more compact list to browse.

Modifying Text:

As a final variant on the 'Pete' theme, imagine that I have a sudden fit of formality, and decide that every occurrence of 'Pete' should be changed to 'Peter'. This time, I'm not concerned with paragraphs, but I do have to pass the rest of the text through unchanged.
matt -v 'Pete([^r]|$)' -o 'Peter$1' petesfile.txt >petersfile.txt
The -v switch ensures that unmatched text gets passed on. The pattern is simpler this time, as we only have to find the string itself (avoiding any 'Peter' already present, of course). The template string just has the desired change, but also must naturally reproduce any character following 'Pete' that was also matched ('$1').

Working with HTML (XML too...):

Matt can make a good HTML manipulation tool, and of course does just as well with XML. I use it to make anchors and links in documents like this, and to create a Table of Contents when done. These scripts are a bit complex to reproduce here, but some simpler ideas can be demonstrated. (In all the following, output would normally be redirected to another file, but this part of the command line has been omitted for compactness.)

One simple job is to discard all the HTML tags, leaving just plain text — much more convenient for a spell-check, for instance. Here's the appropriate command line (note the 's', 'a' and 'v' switches — it should be clear why they're needed):

matt -sav '<.*>' file.html

Another pain that matt can alleviate is the handling of the special characters <, >, and & that need to be transformed into multicharacter 'entities' for inclusion in an HTML document. Passing an original plain text file through the following command line will make all the changes at once:

matt -v '(<)|(>)|(&)' -o '$(1&lt;)$(2&gt;)$(3&amp;)' text.txt
(And, yes, I did run the above command line through itself to get the conversion right!) This example illustrates the utility of the '$(n...)' "conditional" template selectors to insert completely new text where needed.

Consider wanting to extract all the links referenced in the file. They're all going to be enclosed between '<a href=...>' and '</a>', so they're easy to find:

matt -sai '<a href=\".*\".*>.*</a>' file.html
Notice the switch options '-sai'; the 'i' because tags can be lower case or caps, and 's' and 'a' because anchor elements can extend across several lines (so '.' should match a newline) but we have to make sure we only capture one anchor element, so we choose the "shortest match" option.

If we just want the actual URL and the contents of the link, we can add subexpressions and a template to extract them separately:

matt -sai '<a href=\"(.*)\"[^>]*>(.*)</a>' -o '$2: $1\n' file.html
The template shown here just prints the link description (the second matched subexpression) first, then prints the URL. Also, now the match for characters following the (escaped) quote at the end of the URL match is replaced by a match for any characters except '>' because we want to make sure it only matches any attributes that might be in the same tag, and not across other tags or even the link description itself.

Scanning Binary Data:

Not so much recipes in this section, but a few comments... matt is fully capable of scanning arbitrary binary files rather than text, if the '-8' command-line option is used. If you're just looking for text strings, this may be no better than the standard 'strings' command (perhaps piped to grep), but if you want to locate patterns containing non-ascii bytes, matt may have the edge. Or if you need to look through all the files in a directory — text and otherwise — to find those containing a particular string or expression, matt with '-8l' (that's "eight-el") as its options can be a convenient way to do this.

Remember that any byte value can be included in the regular expression pattern using the '\nnn' octal specification. (Actually the hex specification '\xnn...' can be used equivalently in 8-bit mode, as all values are treated as bytes.)

Of course printing out non-ascii may not have much point (though you can always pipe the output directly to another file). If you simply need the positions of the bytes in the file, use the '-p' option; for example this will show where all the nulls appear in 'binfile':

matt -8p '\000+' binfile

Subtleties and Cautions

Because matt is string- rather than line-oriented, it does not keep track of input lines. Therefore, though byte position is available, line position is not.

While it is determining a match, the application has to store the pending text. In the worst case this might mean holding on to the entire incoming stream, but in normal use the program can determine when a segment cannot match and will then discard it. It keeps a count of position as a 32-bit value, though, so it will eventually overflow that if fed an "infinite" stream.

Matt buffers its input and output, and as it is not line oriented cannot be expected to output matches interactively.

You can determine whether the overall longest or shortest match will be found with the ('-s') option, but if there is no unambiguous way in which the match should be divided into its subexpressions, there may be no easy way to tell which the program will choose. (It will always find one solution, but according to its own algorithm, which may not be obvious to the user.) In general, the first subexpression will be the longest possible, but not always. For example (.*)X(.*)$ will match the line "abcXdefXpqr" with $1 as "abcXdef" and $2 as "pqr" (whether or not '-s' is set). On the other hand, the pattern (.*)*X(.*)*$ will return "abc" and "defXpqr" respectively! You may be able to make use of this, um..., "feature", but you had better experiment first!

matt is no speed demon compared to grep, but it's really doing quite different things, and in fact has a lot more work to do. If you have a lot of text to scan, and don't need matt's special features, you may be better off using grep. On the other hand, it doesn't pretend to supplant full-fledged languages such as awk or perl, though it also does some things easily that may be cumbersome with those tools. Choose your weapon wisely.


Background

Why the name "matt"? Why, it's short for Matthew, of course... (grin) Actually, it isn't really intended as an acronym, but if you insist, you can think of "MATching with Templates", or add an 'e' and be reminded of the technique used by film makers to extract and superimpose elements in a scene. I've been using an earlier more limited version 'in-house' for a few years, but I decided it was time to extend it, polish it up, and release it for general consumption.

This program owes a great deal to earlier work by Rob Pike and others at Bell Labs. Once upon a time there was a text editor called "Sam". (Well, I guess there still is, but it isn't very well known.) It made use of these "Structural Regular Expressions" as Pike calls them to locate and process segments of text — as well as employing the usual interactive cut and paste. I was never able to get comfortable with Sam (my impression was that the interactive side was a bit old-fashioned) but I was able to take Pike's freely available regular expression code and adapt it into a C++ class for my own use. The most effort was in adapting the original in-memory scanning to buffered UTF-8 character streams.

Environments like Python and Perl now also provide what are effectively Structural Regular Expressions, but they operate on text in memory rather than on 'streams'. If you need to do complex things to text, a full language is likely to be more appropriate, but I encounter many tasks where a matt command line seems more convenient.

The code is straightforwardly posix compliant and should compile without change for most platforms. (I use it heavily on our Linux server also.)

Copyrights

The matt program and this documentation are Copyright 1999-2006 by Peter Goodeve, All Rights Reserved.

The regular expression matching code is derived from original code written by Rob Pike for the 'sam' editor. This has the following Copyright notice:

/*
 * The authors of this software are Rob Pike and Howard Trickey.
 *		Copyright (c) 1998 by Lucent Technologies.
 * Permission to use, copy, modify, and distribute this software for any
 * purpose without fee is hereby granted, provided that this entire notice
 * is included in all copies of any software which is or includes a copy
 * or modification of this software and in all copies of the supporting
 * documentation for such software.
 * THIS SOFTWARE IS BEING PROVIDED "AS IS", WITHOUT ANY EXPRESS OR IMPLIED
 * WARRANTY.  IN PARTICULAR, NEITHER THE AUTHORS NOR LUCENT TECHNOLOGIES MAKE ANY
 * REPRESENTATION OR WARRANTY OF ANY KIND CONCERNING THE MERCHANTABILITY
 * OF THIS SOFTWARE OR ITS FITNESS FOR ANY PARTICULAR PURPOSE.
 */
========================