PCRS(3)PCRS(3)NAMEpcrs - Perl-compatible regular substitution.
SYNOPSIS
#include <pcrs.h>
pcrs_job *pcrs_compile(const char *pattern,
const char *substitute, const char *options,
int *errptr);
pcrs_job *pcrs_compile_command(const char *command,
int *errptr);
int pcrs_execute(pcrs_job *job, char *subject,
int subject_length, char **result,
int *result_length);
int pcrs_execute_list (pcrs_job *joblist, char *subject,
int subject_length, char **result,
int *result_length);
pcrs_job *pcrs_free_job(pcrs_job *job);
void pcrs_free_joblist(pcrs_job *joblist);
char *pcrs_strerror(int err);
DESCRIPTION
The PCRS library is a supplement to the PCRE(3) library that implements
regular expression based substitution, like provided by Perl(1)'s 's'
operator. It uses the same syntax and semantics as Perl 5, with just a
few differences (see below).
In a first step, the information on a substitution, i.e. the pattern,
the substitute and the options are compiled from Perl syntax to an
internal form called pcrs_job by using either the pcrs_compile() or
pcrs_compile_command() functions.
Once the job is compiled, it can be used on subjects, which are arbi‐
trary memory areas containing string or binary data, by calling
pcrs_execute(). Jobs can be chained to joblists and whole joblists can
be applied to a subject using pcrs_execute_list().
There are also convenience functions for freeing the jobs and for
errno-to-string conversion, namely pcrs_free_job(), pcrs_free_joblist()
and pcrs_strerror().
COMPILING JOBS
The function pcrs_compile() is called to compile a pcrs_job from a pat‐
tern, substitute and options string. The resulting pcrs_job structure
is dynamically allocated and it is the caller's responsibility to call
pcrs_free_job() when it's no longer needed.
pcrs_compile_command() is a convenience wrapper function that parses a
Perl command of the form s/pattern/substitute/[options] into its compo‐
nents and then calls pcrs_compile(). As in Perl, you are not bound to
the '/' character: Whatever follows the 's' will be used as the delim‐
iter. Patterns or substitutes that contain the delimiter need to quote
it: s/th\/is/th\/at/ will replace th/is by th/at and can be written
more simply as s|th/is|th/at|.
pattern, substitute, options and command must be zero-terminated C
strings. substitute and options may be NULL, in which case they are
treated like the empty string.
Return value and diagnostics
On success, both functions return a pointer to the compiled job. On
failure, NULL is returned. In that case, the pcrs error code is written
to *err.
Patterns
For the syntax of the pattern, see the PCRE(3) manual page.
Substitutes
The substitute uses Perl syntax as documented in the perlre(1) manual
page, with some exceptions:
Most notably and evidently, since PCRS is not Perl, variable interpola‐
tion or Perl command substitution won't work. Special variables that
do get interpolated, are:
$1, $2, ..., $n
Like in Perl, these variables refer to what the nth capturing
subpattern in the pattern matched.
$& and $0
refer to the whole match. Note that $0 is deprecated in recent
Perl versions and now refers to the program name.
$+ refers to what the last capturing subpattern matched.
$` and $' (backtick and tick)
refer to the areas of the subject before and after the match,
respectively. Note that, like in Perl, the unmodified subject
is used, even if a global substitution previously matched.
Perl4-style references to subpattern matches of the form \1, \2, ...
which only exist in Perl5 for backwards compatibility, are not sup‐
ported.
Also, since the substitute is a double-quoted string in Perl, you might
expect all Perl syntax for special characters to apply. In fact, only
the following are supported:
\n newline (0x0a)
\r carriage return (0x0d)
\t horizontal tab (0x09)
\f form feed (0x0c)
\b backspace (0x08)
\a alarm, bell (0x07)
\e escape (0x1b)
\0 binary zero (0x00)
Options
The options gmisx are supported. e is not, since it would require a
Perl interpreter and neither is o, because the pattern is explicitly
compiled, anyway. Additionally, PCRS honors the options U and T. Where
PCRE options are mentioned below, refer to PCRE(3) for the subtle dif‐
ferences to Perl behaviour.
g Replace all instances of pattern in subject, not just the first
one.
i Match the pattern without respect to case. This translates to
PCRE_CASELESS.
m Treat the subject as consisting of multiple lines, i.e. '^'
matches immediately after, and '$' immediately before each new‐
line. Translates to PCRE_MULTILINE.
s Treat the subject as consisting of one single line, i.e. let
the scope of the '.' metacharacter include newlines. Translates
to PCRE_DOTALL.
x Allow extended regular expression syntax in the pattern,
enabling whitespace and comments in complex patterns. Trans‐
lates to PCRE_EXTENDED.
U Switch the default behaviour of the '*' and '+' quantifiers to
ungreedy. Note that appending a '?' switches back to greedy(!).
The explicit in-pattern switches (?U) and (?-U) remain unaf‐
fected. Translates to PCRE_UNGREEDY.
T Consider the substitute trivial, i.e. do not interpret any ref‐
erences or special character escape sequences in the substitute.
Handy for large user-supplied substitutes, which would otherwise
have to be examined and properly quoted.
Unsupported options are silently ignored.
EXECUTING JOBS
Calling pcrs_execute() produces a modified copy of the subject, in
which the first (or all, if the 'g' option was given when compiling the
job) occurance(s) of the job's pattern in the subject is replaced by
the job's substitute.
The first subject_length bytes following subject are processed, so a
subject_length that exceeds the actual subject is dangerous. Note that
for zero-terminated C strings, you should set subject_length to
strlen(subject), so that the dollar metacharacter matches at the end of
the string, not after the string-terminating null byte. For conve‐
nience, an extra null byte is appended to the result so it can again be
used as a string.
The subject itself is left untouched, and the *result is dynamically
allocated, so it is the caller's responsibility to free() it when it's
no longer needed.
The result's length (excluding the extra null byte) is written to
*result_length.
If the job matched, the PCRS_SUCCESS flag in job->flags is set.
String subjects
If your
Return value and diagnostics
On success, pcrs_execute() returns the number of substitutions that
were made, which is limited to 0 or 1 for non-global searches. On
failure, a negative error code is returned and result is set to NULL.
FREEING JOBS
It is not sufficient to call free() on a pcrs_job, because it contains
pointers to other dynamically allocated structures. Use
pcrs_free_job() instead. It is safe to pass NULL pointers (or pointers
to invalid pcrs_jobs that contain NULL pointers to dependant struc‐
tures) to pcrs_free_job().
Return value
The value of the job's next pointer.
CHAINING JOBS
PCRS supports to some extent the chaining of multiple pcrs_job struc‐
tures by means of their next member.
Chaining the jobs is up to you, but once you have built a linked list
of jobs, you can execute a whole joblist on a given subject by a single
call to pcrs_execute_list(), which will sequentially traverse the
linked list until it reaches a NULL pointer, and call pcrs_execute()
for each job it encounters, feeding the result and result_length of
each call into the next as the subject and subject_length. As in the
single job case, the original subject remains untouched, but all
interim results are of course free()d. The return value is the accumu‐
lated number of matches for all jobs in the joblist. Note that while
this is handy, it reduces the diagnostic value of err, since you won't
know which job failed.
In analogy, you can free all jobs in a given joblist by calling
pcrs_free_joblist().
QUOTING
The quote character is (surprise!) '\'. It quotes the delimiter in a
command, the '$' in a substitute, and, of course, itself. Note that
the '$' doesn't need to be quoted if it isn't followed by [0-9+'`&].
For quoting in the pattern, please refer to PCRE(3).
DIAGNOSTICS
When compiling a job either via the pcrs_compile() or pcrs_compile_com‐
mand() functions, you know that something went wrong when you are
returned a NULL pointer. In that case, or in the event of non-fatal
warnings, the integer pointed to by err contains a nonzero error code,
which is either a passed-through PCRE error code or one generated by
PCRS. Under normal circumstances, it can take the following values:
PCRE_ERROR_NOMEMORY
While compiling the pattern, PCRE ran out of memory.
PCRS_ERR_NOMEM
While compiling the job, PCRS ran out of memory.
PCRS_ERR_CMDSYNTAX
pcrs_compile_command() didn't find four tokens while parsing the
command.
PCRS_ERR_STUDY
A PCRE error occured while studying the compiled pattern. Since
pcre_study() only provides textual diagnostic information, the
details are lost.
PCRS_WARN_BADREF
The substitute contains a reference to a capturing subpattern
that has a higher index than the number of capturing subpatterns
in the pattern or that exceeds the current hard limit of 33 (See
LIMITATIONS below). As in Perl, this is non-fatal and results in
substitutions with the empty string.
When executing jobs via pcrs_execute() or pcrs_execute_list(), a nega‐
tive return code indicates an error. In that case, *result is NULL.
Possible error codes are:
PCRE_ERROR_NOMEMORY
While matching the pattern, PCRE ran out of memory. This can
only happen if there are more than 33 backrefrences in the pat‐
tern(!) and memory is too tight to extend storage for more.
PCRS_ERR_NOMEM
While executing the job, PCRS ran out of memory.
PCRS_ERR_BADJOB
The pcrs_job* passed to pcrs_execute was NULL, or the job is
bogus (it contains NULL pointers to the compiled pattern, extra,
or substitute).
If you see any other PCRE error code passed through, you've either
messed with the compiled job or found a bug in PCRS. Please send me an
email.
Ah, and don't look for PCRE_ERROR_NOMATCH, since this is not an error
in the context of PCRS. Should there be no match, an exact copy of the
subject is found at *result and the return code is 0 (matches).
All error codes can be translated into human readable text by means of
the pcrs_strerror() function.
EXAMPLE
A trivial command-line test program for PCRS might look like:
#include <pcrs.h>
#include <stdio.h>
int main(int Argc, char **Argv)
{
pcrs_job *job;
char *result;
size_t newsize;
int err;
if (Argc != 3)
{
fprintf(stderr, "Usage: %s s/pattern/substitute/[options] subject\n", Argv[0]);
return 1;
}
if (NULL == (job = pcrs_compile_command(Argv[1], &err)))
{
fprintf(stderr, "%s: compile error: %s (%d).\n", Argv[0], pcrs_strerror(err), err);
}
if (0 > (err = pcrs_execute(job, Argv[2], strlen(Argv[2]), &result, &newsize)))
{
fprintf(stderr, "%s: exec error: %s (%d).\n", Argv[0], pcrs_strerror(err), err);
}
else
{
printf("Result: *%s*\n", result);
free(result);
}
pcrs_free_job(job);
return(err < 0);
}
LIMITATIONS
The number of matches that a global job can have is only limited by the
available memory. An initial storage for 40 matches is reserved, which
is dynamically resized by the factor 1.6 whenever it is exhausted.
The number of capturing subpatterns is currently limited to 33, which
is a Bad Thing[tm]. It should be dynamically expanded until it reaches
the PCRE limit of 99.
This limitation is particularly embarassing since PCRE 3.5 has raised
the capturing subpattern limit to 65K.
All of the above values can be adjusted in the "Capacity" section of
pcrs.h.
The Perl-style escape sequences for special characters \nnn, \xnn, and
\cX are currently unsupported.
BUGS
This library has only been tested in the context of one application and
should be considered high risk.
HISTORY
PCRS was originally written for the Privoxy project
(http://www.privoxy.org/).
SEE ALSOPCRE(3), perl(1), perlre(1)AUTHOR
PCRS is Copyright 2000 - 2003 by Andreas Oesterhelt <andreas@oester‐
helt.org> and is licensed under the terms of the GNU Lesser General
Public License (LGPL), version 2.1, which should be included in this
distribution, with the exception that the permission to replace that
license with the GNU General Public License (GPL) given in section 3 is
restricted to version 2 of the GPL.
If it is missing from this distribution, the LGPL can be obtained from
http://www.gnu.org/licenses/lgpl.html or by mail: Write to the Free
Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA
02111-1307, USA.
pcrs-0.0.3 2 December 2003 PCRS(3)