Initial commit

parents
9781565922570
\ No newline at end of file
## Example files for the title:
# Mastering Regular Expressions, by Jeffrey Friedl
[![Mastering Regular Expressions, by Jeffrey Friedl](http://akamaicovers.oreilly.com/images/9781565922570/cat.gif)](https://www.safaribooksonline.com/)
The following applies to example files from material published by O’Reilly Media, Inc. Content from other publishers may include different rules of usage. Please refer to any additional usage rights explained in the actual example files or refer to the publisher’s website.
O'Reilly books are here to help you get your job done. In general, you may use the code in O'Reilly books in your programs and documentation. You do not need to contact us for permission unless you're reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from our books does not require permission. Answering a question by citing our books and quoting example code does not require permission. On the other hand, selling or distributing a CD-ROM of examples from O'Reilly books does require permission. Incorporating a significant amount of example code from our books into your product's documentation does require permission.
We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN.
If you think your use of code examples falls outside fair use or the permission given here, feel free to contact us at <permissions@oreilly.com>.
Please note that the examples are not production code and have not been carefully testing. They are provided "as-is" and come with no warranty of any kind.
#
# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.oreilly.com/catalog/regex/)
#
# Optimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#
# Some things for avoiding backslashitis later on.
$esc = '\\\\'; $Period = '\.';
$space = '\040'; $tab = '\t';
$OpenBR = '\['; $CloseBR = '\]';
$OpenParen = '\('; $CloseParen = '\)';
$NonASCII = '\x80-\xff'; $ctrl = '\000-\037';
$CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
##############################################################################
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = qq< [^$esc$NonASCII$CRlist()] >;
# $Cnested matches one non-nested comment.
# It is unrolled, with normal of $ctext, special of $quoted_pair.
$Cnested = qq<
$OpenParen # (
$ctext* # normal*
(?: $quoted_pair $ctext* )* # (special normal*)*
$CloseParen # )
>;
# $comment allows one level of nested parentheses
# It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
$comment = qq<
$OpenParen # (
$ctext* # normal*
(?: # (
(?: $quoted_pair | $Cnested ) # special
$ctext* # normal*
)* # )*
$CloseParen # )
>;
##############################################################################
# $X is optional whitespace/comments.
$X = qq<
[$space$tab]* # Nab whitespace.
(?: $comment [$space$tab]* )* # If comment found, allow more spaces.
>;
# Item 10: atom
$atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of an atom
>;
# Item 11: doublequoted string, unrolled.
$quoted_str = qq<
\" # "
$qtext * # normal
(?: $quoted_pair $qtext * )* # ( special normal* )*
\" # "
>;
# Item 7: word is an atom or quoted string
$word = qq<
(?:
$atom # Atom
| # or
$quoted_str # Quoted string
)
>;
# Item 12: domain-ref is just an atom
$domain_ref = $atom;
# Item 13: domain-literal is like a quoted string, but [...] instead of "..."
$domain_lit = qq<
$OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = qq<
(?:
$domain_ref
|
$domain_lit
)
$X # optional trailing comments
>;
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq<
$sub_domain
(?:
$Period $X $sub_domain
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
$route = qq<
\@ $X $domain
(?: , $X \@ $X $domain )* # additional domains
:
$X # optional trailing comments
>;
# Item 6: local-part is a bunch of $word separated by periods
$local_part = qq<
$word $X
(?:
$Period $X $word $X # additional words
)*
>;
# Item 2: addr-spec is local@domain
$addr_spec = qq<
$local_part \@ $X $domain
>;
# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[
< $X # <
(?: $route )? # optional route
$addr_spec # address spec
> # >
];
# Item 3: phrase........
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
# We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
# because we take care of it manually.
$phrase = qq<
$word # leading word
$phrase_char * # "normal" atoms and/or spaces
(?:
(?: $comment | $quoted_str ) # "special" comment or quoted string
$phrase_char * # more "normal"
)*
>;
## Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq<
$X # optional leading comment
(?:
$addr_spec # address
| # or
$phrase $route_addr # name and address
)
>;
###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#
my $error = 0;
my $valid;
foreach $address (@ARGV) {
$valid = $address =~ m/^$mailbox$/xo;
printf "`$address' is syntactically %s.\n", $valid ? "valid" : "invalid";
$error = 1 if not $valid;
}
exit $error;
#
# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.oreilly.com/catalog/regex/)
#
# Optimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#
# Some things for avoiding backslashitis later on.
$esc = '\\\\'; $Period = '\.';
$space = '\040'; $tab = '\t';
$OpenBR = '\['; $CloseBR = '\]';
$OpenParen = '\('; $CloseParen = '\)';
$NonASCII = '\x80-\xff'; $ctrl = '\000-\037';
$CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
##############################################################################
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = qq< [^$esc$NonASCII$CRlist()] >;
# $Cnested matches one non-nested comment.
# It is unrolled, with normal of $ctext, special of $quoted_pair.
$Cnested = qq<
$OpenParen # (
$ctext* # normal*
(?: $quoted_pair $ctext* )* # (special normal*)*
$CloseParen # )
>;
# $comment allows one level of nested parentheses
# It is unrolled, with normal of $ctext, special of ($quoted_pair|$Cnested)
$comment = qq<
$OpenParen # (
$ctext* # normal*
(?: # (
(?: $quoted_pair | $Cnested ) # special
$ctext* # normal*
)* # )*
$CloseParen # )
>;
##############################################################################
# $X is optional whitespace/comments.
$X = qq<
[$space$tab]* # Nab whitespace.
(?: $comment [$space$tab]* )* # If comment found, allow more spaces.
>;
# Item 10: atom
$atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of an atom
>;
# Item 11: doublequoted string, unrolled.
$quoted_str = qq<
\" # "
$qtext * # normal
(?: $quoted_pair $qtext * )* # ( special normal* )*
\" # "
>;
# Item 7: word is an atom or quoted string
$word = qq<
(?:
$atom # Atom
| # or
$quoted_str # Quoted string
)
>;
# Item 12: domain-ref is just an atom
$domain_ref = $atom;
# Item 13: domain-literal is like a quoted string, but [...] instead of "..."
$domain_lit = qq<
$OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = qq<
(?:
$domain_ref
|
$domain_lit
)
$X # optional trailing comments
>;
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq<
$sub_domain
(?:
$Period $X $sub_domain
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon.
$route = qq<
\@ $X $domain
(?: , $X \@ $X $domain )* # additional domains
:
$X # optional trailing comments
>;
# Item 6: local-part is a bunch of $word separated by periods
$local_part = qq<
$word $X
(?:
$Period $X $word $X # additional words
)*
>;
# Item 2: addr-spec is local@domain
$addr_spec = qq<
$local_part \@ $X $domain
>;
# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[
< $X # <
(?: $route )? # optional route
$addr_spec # address spec
> # >
];
# Item 3: phrase........
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
# We've worked it so that $word, $comment, and $quoted_str to not consume trailing $X
# because we take care of it manually.
$phrase = qq<
$word # leading word
$phrase_char * # "normal" atoms and/or spaces
(?:
(?: $comment | $quoted_str ) # "special" comment or quoted string
$phrase_char * # more "normal"
)*
>;
## Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq<
$X # optional leading comment
(?:
$addr_spec # address
| # or
$phrase $route_addr # name and address
)
>;
###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#
my $error = 0;
my $valid;
foreach $address (@ARGV) {
$valid = $address =~ m/^$mailbox$/xo;
printf "`$address' is syntactically %s.\n", $valid ? "valid" : "invalid";
$error = 1 if not $valid;
}
exit $error;
#
# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.oreilly.com/catalog/regex/)
#
# Unoptimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#
# Some things for avoiding backslashitis later on.
$esc = '\\\\'; $Period = '\.';
$space = '\040'; $tab = '\t';
$OpenBR = '\['; $CloseBR = '\]';
$OpenParen = '\('; $CloseParen = '\)';
$NonASCII = '\x80-\xff'; $ctrl = '\000-\037';
$CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
# Item 10: atom
$atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of an atom
>;
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = qq< [^$esc$NonASCII$CRlist()] >;
$Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen >;
$comment = qq< $OpenParen
(?: $ctext | $quoted_pair | $Cnested )*
$CloseParen >;
$X = qq< (?: [$space$tab] | $comment )* >; # optional separator
# Item 11: doublequoted string, with escaped items allowed
$quoted_str = qq<
\" (?: # opening quote...
$qtext # Anything except backslash and quote
| # or
$quoted_pair # Escaped something (something != CR)
)* \" # closing quote
>;
# Item 7: word is an atom or quoted string
$word = qq< (?: $atom | $quoted_str ) >;
# Item 12: domain-ref is just an atom
$domain_ref = $atom;
# Item 13 domain-literal is like a quoted string, but [...] instead of "..."
$domain_lit = qq< $OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = qq< (?: $domain_ref | $domain_lit ) >;
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq< $sub_domain # initial subdomain
(?: #
$X $Period # if led by a period...
$X $sub_domain # ...further okay
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon
$route = qq< \@ $X $domain
(?: $X , $X \@ $X $domain )* # further okay, if led by comma
: # closing colon
>;
# Item 5: local-part is a bunch of $word separated by periods
$local_part = qq< $word # initial word
(?: $X $Period $X $word )* # further okay, if led by a period
>;
# Item 2: addr-spec is local@domain
$addr_spec = qq< $local_part $X \@ $X $domain >;
# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[ < $X # leading <
(?: $route $X )? # optional route
$addr_spec # address spec
$X > # trailing >
];
# Item 3: phrase
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
$phrase = qq< $word # one word, optionally followed by....
(?:
$phrase_char | # atom and space parts, or...
$comment | # comments, or...
$quoted_str # quoted strings
)*
>;
# Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq< $X # optional leading comment
(?: $addr_spec # address
| # or
$phrase $route_addr # name and address
) $X # optional trailing comment
>;
###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#
my $error = 0;
my $valid;
foreach $address (@ARGV) {
$valid = $address =~ m/^$mailbox$/xo;
printf "`$address' is syntactically %s.\n", $valid ? "valid" : "invalid";
$error = 1 if not $valid;
}
exit $error;
#
# Program to build a regex to match an internet email address,
# from Chapter 7 of _Mastering Regular Expressions_ (Friedl / O'Reilly)
# (http://www.oreilly.com/catalog/regex/)
#
# Unoptimized version.
#
# Copyright 1997 O'Reilly & Associates, Inc.
#
# Some things for avoiding backslashitis later on.
$esc = '\\\\'; $Period = '\.';
$space = '\040'; $tab = '\t';
$OpenBR = '\['; $CloseBR = '\]';
$OpenParen = '\('; $CloseParen = '\)';
$NonASCII = '\x80-\xff'; $ctrl = '\000-\037';
$CRlist = '\n\015'; # note: this should really be only \015.
# Items 19, 20, 21
$qtext = qq/[^$esc$NonASCII$CRlist\"]/; # for within "..."
$dtext = qq/[^$esc$NonASCII$CRlist$OpenBR$CloseBR]/; # for within [...]
$quoted_pair = qq< $esc [^$NonASCII] >; # an escaped character
# Item 10: atom
$atom_char = qq/[^($space)<>\@,;:\".$esc$OpenBR$CloseBR$ctrl$NonASCII]/;
$atom = qq<
$atom_char+ # some number of atom characters...
(?!$atom_char) # ..not followed by something that could be part of an atom
>;
# Items 22 and 23, comment.
# Impossible to do properly with a regex, I make do by allowing at most one level of nesting.
$ctext = qq< [^$esc$NonASCII$CRlist()] >;
$Cnested = qq< $OpenParen (?: $ctext | $quoted_pair )* $CloseParen >;
$comment = qq< $OpenParen
(?: $ctext | $quoted_pair | $Cnested )*
$CloseParen >;
$X = qq< (?: [$space$tab] | $comment )* >; # optional separator
# Item 11: doublequoted string, with escaped items allowed
$quoted_str = qq<
\" (?: # opening quote...
$qtext # Anything except backslash and quote
| # or
$quoted_pair # Escaped something (something != CR)
)* \" # closing quote
>;
# Item 7: word is an atom or quoted string
$word = qq< (?: $atom | $quoted_str ) >;
# Item 12: domain-ref is just an atom
$domain_ref = $atom;
# Item 13 domain-literal is like a quoted string, but [...] instead of "..."
$domain_lit = qq< $OpenBR # [
(?: $dtext | $quoted_pair )* # stuff
$CloseBR # ]
>;
# Item 9: sub-domain is a domain-ref or domain-literal
$sub_domain = qq< (?: $domain_ref | $domain_lit ) >;
# Item 6: domain is a list of subdomains separated by dots.
$domain = qq< $sub_domain # initial subdomain
(?: #
$X $Period # if led by a period...
$X $sub_domain # ...further okay
)*
>;
# Item 8: a route. A bunch of "@ $domain" separated by commas, followed by a colon
$route = qq< \@ $X $domain
(?: $X , $X \@ $X $domain )* # further okay, if led by comma
: # closing colon
>;
# Item 5: local-part is a bunch of $word separated by periods
$local_part = qq< $word # initial word
(?: $X $Period $X $word )* # further okay, if led by a period
>;
# Item 2: addr-spec is local@domain
$addr_spec = qq< $local_part $X \@ $X $domain >;
# Item 4: route-addr is <route? addr-spec>
$route_addr = qq[ < $X # leading <
(?: $route $X )? # optional route
$addr_spec # address spec
$X > # trailing >
];
# Item 3: phrase
$phrase_ctrl = '\000-\010\012-\037'; # like ctrl, but without tab
# Like atom-char, but without listing space, and uses phrase_ctrl.
# Since the class is negated, this matches the same as atom-char plus space and tab
$phrase_char =
qq/[^()<>\@,;:\".$esc$OpenBR$CloseBR$NonASCII$phrase_ctrl]/;
$phrase = qq< $word # one word, optionally followed by....
(?:
$phrase_char | # atom and space parts, or...
$comment | # comments, or...
$quoted_str # quoted strings
)*
>;
# Item #1: mailbox is an addr_spec or a phrase/route_addr
$mailbox = qq< $X # optional leading comment
(?: $addr_spec # address
| # or
$phrase $route_addr # name and address
) $X # optional trailing comment
>;
###########################################################################
# Here's a little snippet to test it.
# Addresses given on the commandline are described.
#
my $error = 0;
my $valid;
foreach $address (@ARGV) {
$valid = $address =~ m/^$mailbox$/xo;
printf "`$address' is syntactically %s.\n", $valid ? "valid" : "invalid";
$error = 1 if not $valid;
}
exit $error;
logo.png

941 Bytes

<HTML>
<HEAD>
<TITLE>Mastering Regular Expressions - Examples</TITLE>
</head>
<BODY>
<H1>Examples</H1>
<p>At the moment, I have only the email address regex program (from Chapter 7
and Appendix B), as it's the most substantive bit of code.</p>
<P>There is an <A
href="email-opt.pl">optimized version</A> from Appendix A and the
<A href="email-unopt.pl">unoptimized version</A> discussed in Chapter 7.</p>
<P>DOS users may find these versions
(<A href="email-opt.txt">optimized</A> and
<A href="email-unopt.txt">unoptimized</A>) easier to download -- the Perl is
the same, but the lines end with CR/LF, and the URLs endi with
``<TT>.txt</TT>''. </P>
</BODY></HTML>
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or sign in to comment