Regular expression in PHP

Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are basically pattern matching inside of text. Regular expressions are used by many text editors, utilities, and programming languages to search and manipulate text based on patterns. For example, Perl, Ruby and Tcl have a powerful regular expression engine built directly into their syntax.

Regular expressions are often used in validation classes, because they are a really powerful tool to verify e-mail addresses, telephone numbers, street addresses, zip codes, and more.

Regular expression types:

There are 2 types of regular expressions:

  • POSIX Extended
  • Perl Compatible

The ereg, eregi, ... are the POSIX versions and preg_match, preg_replace, ... are the Perl version. It is important that using Perl compatible regular expressions the expression should be enclosed in the delimiters, a forward slash (/), for example. However this version is more powerful and faster as well than the POSIX one.

Literals and Metacharacters

Literals are normal text characters and can include whitespace (tabs, spaces, newlines, etc.). Unless modified by a metacharacter, a literal will match itself on a one-for-one basis. Metacharacters' power lies in how they are arranged and interpreted as wildcards. Metacharacters can be escaped with a backslash (\) to find instances of themselves, for instance, if you need to find a caret (^) or a backslash, as well as used in nested groups or other combinations.

Metacharacter Match
\ the escape character - used to find an instance of a metacharacter like a period, brackets, etc.
. (period) match any character except newline
x match any instance of x
^x match any character except x
[x] match any instance of x in the bracketed range - [abxyz] will match any instance of a, b, x, y, or z
| (pipe) an OR operator - [x|y] will match an instance of x or y
() used to group sequences of characters or matches
{} used to define numeric quantifiers
{x} match must occur exactly x times
{x,} match must occur at least x times
{x,y} match must occur at least x times, but no more than y times
? preceding match is optional or one only, same as {0,1}
* find 0 or more of preceding match, same as {0,}
+ find 1 or more of preceding match, same as {1,}
^ match the beginning of the line
$ match the end of a line

 

Detailed descriptions of regex operators

Within these descriptions, x is used as a placeholder for examples - x can be an actual x or it can be an entire sequence like href="http://www.evolt.org", <DIV>, or ((\.\.)?/[a-z]+\.jpg).

. - Matches any one character except newline and is generally used with quantifiers, which will be explained below. For instance, .{3} would find three-letter words

x - Matches any instance of x and can include specific character sets or ranges, for instance, [wxyz] would match any instance of w, x, y, or z, but not wz, yx, or other combinations of the given character set, unless it was followed by a quantifier.

^x - Matches any character that is not x and can also be used in a range. For example, <[^abel]+> would match one or more letters that are not a, b, e, or l, and which are surrounded by < and >, thus it would match <font> but not <table>.

[x] - Matches any character in the given range. Examples of a range would be the expression [0-9], which would find a single digit, or [a-z], which would find a single lower case character. You can combine ranges as well - [A-Za-z0-9] will find a single upper or lower case character or digit. You may also combine ranges with commas, such as [0-3, 5-8] which would find any digit that isn't 4 or 9.

| - The OR operator can be used at the character level or combined in sequences. [x|y] will find instances of x or y and you aren't limited to just two objects - [w|x|y|z] is perfectly valid.

() - Parentheses are used to group operators much like basic algebra and are also used to delineate a backreference, which is the way you can do replaces with matches. (Backreferences get their own section below). A simple example would look something like: www\.([a-z]+)\.com.

{} - Curly brackets (or braces) are used to define numeric quantifiers, which allow you to specify the optional, minimum, or maximum number of occurrences in the match. x{3} would find exactly 3 occurrences of x. x{3,} matches on at least 3 occurrences of x. x{3,5} matches at least 3 occurrences of x and no more than 5.

? - The preceding match is optional or must match exactly one time. An example would be: ((\.\.)?/[a-z]+\.jpg) which matches a path to an image file ending in .jpg and could start with a ../ or just a /. A ./ or ../../ would fail to match that particular expression.

* - Matches the preceding character or group 0 or more times. Note that this is not the same as the use of the ? listed above. z* can match no z, z, or for those readers who have already fallen asleep, zzzzzzzzzzzzzzzzzzzzzzz.

+ - Matches the preceding character or group 1 or more times. In comparison to the previous example, z+ would have to match at least z or zz or zzz and so on.

^ - Used to force a match to the beginning of a line. Note that this is not the same as a character exclusion such as [^xyz], which would match any characters that are not x, y, or z. ^Hello would match at the beginning of a line such as Hello Chris and would not match Chris said Hello.

$ - Used to force a match to the end of a line. $end would match at the end of a line such as This is the end and would not match end this article already!

An example pattern to check valid emails looks like this:

Code: 
^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$

In this expression we have:
‘^' for Start of String,
‘[a-zA-Z0-9._-]' for any alpha-numeric, dot (.), underscore (_) or hyphen (-)
‘+' for one or more of previous expression
‘@' for want a @ letter
‘[a-zA-Z0-9-]´ for any alpha numeric or hyphen (-)
‘\.' for a dot(.), preceded by a backslash that says we really want a dot, not the special (.) wildcard character.
‘[a-zA-Z]{2, 5}' for specify range form 2 to 5 of any letter of the alpha
‘$' for End of string

 

Regular Expression:

Anchors

Quantifiers

Groups and Ranges

^ Start of string

\A Start of string

$ End of string

\Z End of string

\b Word boundary

\B Not word boundary

\< Start of word

\> End of word

* 0 or more

+1 or more

? 0 or 1

{3} Exactly 3

{3,} 3 or more

{3,5}3, 4 or 5

. Any character except new line (\n)

(a|b) a or b

(...)Group

(?:...)Passive Group

[abc] Range (a or b or c)

[^abc] Not a or b or c

[a-q] Letter between a and q

[A-Q] Upper case letter between A and Q

[0-7] Digit between 0 and 7

\n nth group/subpattern

Note: Ranges are inclusive.

Character Classes

Special Characters

Pattern Modifiers

\c Control character

\s White space

\S Not white space

\d Digit

\D Not digit

\w Word

\W Not word

\x Hexadecimal digit

\O Octal digit

\n New line

\r Carriage return

\t Tab

\v Vertical tab

\f Form feed

\xxx Octal character xxx

\xhh Hex character hh

g Global match

i Case-insensitive

m Multiple lines

s Treat string as single line

x Allow comments and white space in pattern

e Evaluate replacement

U Ungreedy pattern

POSIX

Assertions

[:upper:] Upper case letters

[:lower:] Lower case letters

[:alpha:] All letters

[:alnum:] Digits and letters

[:digit:] Digits

[:xdigit:] Hexadecimal digits

[:punct:] Punctuation

[:blank:] Space and tab

[:space:] Blank characters

[:cntrl:] Control characters

[:graph:] Printed characters

[:print:] Printed characters and spaces

[:word:] Digits, letters and underscore

?= Lookahead assertion

?! Negative lookahead

?<= Lookbehind assertion

?!= or ?<! Negative lookbehind

?> Once-only Subexpression

?() Condition [if then]

?()| Condition [if then else]

?# Comment

 

Sample Patterns

([A-Za-z0-9-]+)                                         Letters, numbers and hyphens

(\d{1,2}\/\d{1,2}\/\d{4})                          Date (e.g. 21/3/2006)

([^\s]+(?=\.(jpg|gif|png))\.\2)                    jpg, gif or png image

(^[1-9]{1}$|^[1-4]{1}[0-9]{1}$|^50$)     Any number from 1 to 50 inclusive

(#?([A-Fa-f0-9]){3}(([A-Fa-f0-9]){3})?)      Valid hexadecimal colour code

((?=.*\d)(?=.*[a-z])(?=.*[A-Z]).{8,15})      String with at least one upper case letter, one lower case letter, and one digit (useful for passwords).

(\w+@[a-zA-Z_]+?\.[a-zA-Z]{2,6})              Email addresses

(\<(/?[^\>]+)\>)                                        HTML Tags

 

Note: These patterns are intended for reference purposes and have not been extensively tested. Please use with caution and test thoroughly before use.

 

Some Example:

Here is a series of useful regular expressions.

<?php
function isValid($type,$var) {
$valid = false;
switch ($type) {
case "IP":
if (ereg('^([0-9]{1,3}\.){3}[0-9]{1,3}$',$var)) {
$valid = true;
}
break;
case "URL":
if (ereg("^[a-zA-Z0-9\-\.]+\.(com|org|net|mil|edu)$",$var)) {
$valid = true;
}
break;
case "SSN":
if (ereg("^[0-9]{3}[- ][0-9]{2}[- ][0-9]{4}|[0-9]{9}$",$var)) {
$valid = true;
}
break;
case "CC":
if (ereg("^([0-9]{4}[- ]){3}[0-9]{4}|[0-9]{16}$",$var)) {
$valid = true;
}
break;
case "ISBN":
if (ereg("^[0-9]{9}[[0-9]|X|x]$",$var)) {
$valid = true;
}
break;
case "Date":
if (ereg("^([0-9]{4})-([0-9]{1,2})-([0-9]{1,2})$", $var)) {
$valid = true;
}
break;
case "Zip":
if (ereg("^[0-9]{5}(-[0-9]{4})?$",$var)) {
$valid = true;
}
break;
case "Email":
if (ereg('^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,5}$', $var)) {

// preg_match("/^[a-zA-Z0-9_.-]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+$/", $bad)
$valid = true;
}
break;
case "Phone":
if (ereg("^((\([0-9]{3}\) ?)|([0-9]{3}-))?[0-9]{3}-[0-9]{4}$",$var)) {
$valid = true;  
}
break;
case "HexColor":
if (ereg('^#?([a-f]|[A-F]|[0-9]){3}(([a-f]|[A-F]|[0-9]){3})?$',$var)) {
$valid = true;
}
break;

case "User":
if (ereg("^[a-zA-Z0-9_]{3,16}$",$var)) {
$valid = true;
}
break;
}
return $valid;
}

#Example:
$phone = "789-1234";
if (isValid("Phone",$phone)) {
  echo "Valid Phone Number";
} else {
  echo "Invalid Phone Number";
}
?>

 

Here's a function i've created to return an array of each substring searched in a string.

<?
function Return_Substrings($text, $sopener, $scloser) {
$result = array();               
$noresult = substr_count($text, $sopener);                $ncresult = substr_count($text, $scloser);                if ($noresult < $ncresult)
$nresult = $noresult;
else
$nresult = $ncresult;       
unset($noresult);
unset($ncresult);

for ($i=0;$i<$nresult;$i++) {
$pos = strpos($text, $sopener) + strlen($sopener);
$text = substr($text, $pos, strlen($text));  
$pos = strpos($text, $scloser);
$result[] = substr($text, 0, $pos);
$text = substr($text, $pos + strlen($scloser), strlen($text));
}
return $result;
}
?>
Example :
<?php
$string = "<b>bonjour</b> à tous, <b>comment</b> allez-vous ?";
$result = Return_Substrings($string, "<b>", "</b>");
?>