Beginner's guide to Regular Expressions with PHP
ExoSoft Forums :: Programming :: General
Page 1 of 1 • Share •
Beginner's guide to Regular Expressions with PHP
Beginner's guide to Regular Expressions in PHP
What are regular expressions, you ask? Why they are a wonderful way of matching patterns in strings. They're great for validating things like email addresses and post codes and for doing things like find and replace and syntax highlighting.
Regular expressions are a fair bit slower than normal string functions (although they're still very fast) so if you can do something with a quick strpos() or substr() then you should use those. If you need to use several string functions to do a task, then it's sometimes faster and cleaner to use a regular expression.
I originally wrote this guide for the newgrounds.com programming forums and it was very well recieved. While this is aimed primarily at PHP, regular expressions are available in most programming languages (including Delphi) so much of the information is applicable in other environments.
I've tried to explain how regular expressions work by going step by step through creating a pattern for email addresses and postcodes. Neither are perfect, so use at your own risk. I will show you how to perfect them with more advanced regular expressions in a later guide, should this one proves popular. Thankyou and I hope this is useful.
Things I will be covering:
- Character Groups
- Repetitions
- Anchors
- Quantifiers
- Grouping
- Alternation
- using preg_* functions in php
What you should know:
- PHP at a reasonable standard (although most of this applies to other languages too)
- String manipulation
I heartily recommend that you try out EditPad Pro for testing your regular expressions. It has a really nice highlighting function that will highlight words as you type your regular expressions. Just press ctrl + f to bring up the find box, then tick the regular expressions box. It makes it a lot easier to understand what you are doing this way.
So let's get started with using regular expressions then. Now PHP has 2 sets of regular expression functions; the ereg series and the preg_* series. PHP 6 will be dumping the ereg functions so we'll be covering the preg_* functions (which are supposedly faster anyway). The preg_* functions are also known as Perl Compatible Regular Expressions, by the way.
So what functions have we got to play with? The three main ones are preg_match(), preg_split() and preg_replace(). They all do pretty much what they say on the tin: preg_match will return the number of matches in a string, preg_split will split a string into an array using a regular expression as a seperator and finally preg_replace will replace bits of a string that match a regular expression.
Of course, knowing how to use these functions is a bit pointless if you don't understand regular expressions, which is its own little language, so let's cover some basics first.
Pretty much any character you type (except for a few special ones, which can be escaped by prefixing them with a backslash) will be matched. You could type hello world and it would just find any occurences of the string hello world, which isn't terribly useful really. However if you use character sets, []s you can make it match multiple letters. If we typed h[ae]llo world, it would match hello world and hallo world. Starting to get more useful now, huh?
But what about if we want to match only numbers or only letters? Easily done with the - character! [A-Z] will match any character and [0-9] will match any number. You can use constructs like [1-9] or [a-q] if you so wish. You can have more than one set of stuff in one character group: [A-Z0-9] would work fine. If you want a - in your character group put it at the start of the group. So every valid, non-special email character would be [-_0-9A-Z] (I know there's more, don't pull me up on this).
We could match most UK postcode like so
See how this is getting good for validating stuff?
Problem, of course, is that you don't always know the length of a string and some bits are optional (a UK postcode, for example, can be anywhere between 6 and 9 characters long, including the space). Not to worry though! Regular expressions give us a way of specifying how many times something occurs.
There's two ways to go about it; one for if you need it to occur a set number of times and one for when you don't. We'll leave the greedy method for later and cover the set number method first. All you have to do is type a {#} after a character or group (# is a whole number). We can now simplify our postcode to:
Looks better already right? But, there's a problem. The starting bit of a UK postcode, while usually 3 characters can be anywhere between 2 and 4 characters long. Not to worry though, we can specify a range of repititions with {#, #}. The first # is the minimum repetitions and the second is the maximum number of repitions. Now we can have:
Still not a perfect UK postcode yet, but it's damned close. But, here's the catch, it will get a match if a string even only contains a postcode. If you want to validate that a string, in it's entirety, is a postcode then you'll need to use anchors.
Anchors are dead simple; You put ^ at the start of a regular expression to say "this has to be at the start of a string" and you put a $ at the end of a regular expression to say "this bit has to be at the end of the string". If you use both, you're effectively saying "this is the whole string".
Anyway, I'm going to leave postcodes alone until I can be bothered to cover more advanced stuff. Now we'll move onto email addresses and go back to our repititions.
Email addresses can have a stupidly high number of characters, which makes the earlier example using {#}s a bit useless for them. We'll need to use what are called greedy quantifiers. These simply don't know when to stop matching and will always match as much as they can (hence why they're called greedy), so you need to be a bit more careful with these. There are 3 kinds of greedy quantifier: * (0 or more matches), + (1 or more matches) and ? (0 or 1 matches). You could think of them as ? meaning optional, * meaning ignored and + meaning at least one required. Just bang one of these down after a character and they'll get to work.
If you wanted to match Dude, in a dude where's my car fashion (y'know, the Duuuuude, Sweeeet bit), we'd just go Du+de. Dude, Duude, Duuude etc. etc. will all now be matched. If we go Bes?t it'll match Best and Bet (because the s is now optional), and if you go hello*, it'll match hell, hello, helloo, hellooo etc.
We could put an email address down like so:
Notice the backslash (\)? As pointed out earlier, that is used to tell the engine not to use a character's special meaning, and . is the regular expression wildcard; it will match nearly everything. Chances are we really, really don't want that.
Anyway, I'm sure one or two of you have noticed some problems with this; most prominent being that it only matches .com email addresses. That won't do at all. We need to get it to do a special case for .co.uk/au/jp addresses. We can do an or in regular expressions with a | (I don't know the formal name so I'll just call it a pipe).
If you were to say use hello|world it'll match hello or world. But what about hello world|people|newgrounds? It will match hello world or people or newgrounds but not hello newgrounds or hello people. To do that, we'll need to take advantage of grouping.
Grouping is done with parenthesis (brackets, ()s). Anything in a group is seen as it's own little entity, so we could say for instance, go Hello( world)? and it'll match Hello or Hello world. If we combine these with a | we can get the example above: hello (world|newgrounds).
So let's fix our email address already! Firstly, we'll want it to sort out the .co* suffix.
See what we're doing there? we're either going for .com or .co.(2 letters). But wait! We're still not done with our grouping here. Notice the other problem? The first bit only allows for address@ and prefix.address@; it doesn't support more than one prefix. We can solve this by grouping the prefix and making that group optional (stay with me!).
Now we can have address.suffix.suffix@domain.co.uk or @domain.com. But we can't have subdomains. We can fix that up with grouping too:
And there we have a valid email address .co.* or .com email address. Now I'm going to go over each bit of this, as it's starting to get a bit complicated now.
So now you know how to do regular expressions, let's actually do some PHP!
Now first thing to note about the preg_* functions is that they use Perl syntax (hence why they're called perl compatible regular expressions). Our postcode regular expression in Perl would look like this:
The slashes are dividers for data that perl uses. The starting bit is which function to use with this regular expression, s meets match, essentially. Since we specify this bit by calling a particular function we can skip it. The last bit is any modifiers. The i there means case-insensitive, if you don't use this your character groups have to be [A-Za-z0-9] to match lower and upper case.
Now even though it uses perl's syntax, it'll look slightly different in PHP. In PHP that would be expressed as:
Notice that we still need that opening and ending slash even when we don't specify any modifiers! $result here would contain the number of matches in $string. We're using anchors in the above example so it will always return 0 or 1 results, however if we didn't it would return the number of postcodes in $string.
You can make preg_match work line by line if you give it the /m clause, like so:
That example would count the number of lines that are postcodes (note the anchors). Basically /m makes ^ and $ mean start and end of line rather than start and end of string.
You can also make preg_match copy all matches into an array by giving it an extra parameter to dump them into. If we wanted to get all postcodes in a string we'd go:
$matches will now be an array of every single postcode in the string.
Anyway, let's look into preg_replace(). This one is a bit more interesting as you can actually manipulate things with it. A very common use of preg_replace is to do things like convert bbcode in forums to the corresponding html. Let's use preg_replace to swap the 2 bits of our postcode around.
Notice the grouping (parenthesis) I've added around the 2 parts? When you use preg_replace it will collect all the groups in a match from left to right and store them inside a numbered variable ($1 to $99). I've swapped them around by outputting our second group then our first, but you can do some other stuff with it too.
One of the fun things about preg_replace is that you can also give it arrays of patterns and replacements, take this example here of a quick bbcode converter:
That will convert [bold]text[/bold] into bold text, [italic]text[/italic] into italic text and [link src="url"]text[/link] into a link.
Lastly we have preg_split(). It will split a string into an array, starting a new array element whenever you get a match. It's explode with regular expressions basically. If I wanted to get all words in a string without any punctuation (for a search string, for example) you could do something like this:
Pretty simple really.
A few handy tips:
- If you want to paste a string into a regular expression, run it through preg_quote(). This will return a copy of the string with all special characters escaped.
- When using greedy quanitfiers, you can make them "lazy" by adding a ? after them (so *?, +?, ??). When they are set to be lazy, the engine will do the repetition as few times as possible where it still gives a match (so *? and ?? are a bit useless really).
- Avoid regular expressions from the internet; they are invariably flawed. If you do use them, test the damned things properly. I have gotten in seriously trouble for not doing this.
A great place for details on more advanced (or even just better written) tutorials on regular expressions is http://www.regular-expressions.info/
What are regular expressions, you ask? Why they are a wonderful way of matching patterns in strings. They're great for validating things like email addresses and post codes and for doing things like find and replace and syntax highlighting.
Regular expressions are a fair bit slower than normal string functions (although they're still very fast) so if you can do something with a quick strpos() or substr() then you should use those. If you need to use several string functions to do a task, then it's sometimes faster and cleaner to use a regular expression.
I originally wrote this guide for the newgrounds.com programming forums and it was very well recieved. While this is aimed primarily at PHP, regular expressions are available in most programming languages (including Delphi) so much of the information is applicable in other environments.
I've tried to explain how regular expressions work by going step by step through creating a pattern for email addresses and postcodes. Neither are perfect, so use at your own risk. I will show you how to perfect them with more advanced regular expressions in a later guide, should this one proves popular. Thankyou and I hope this is useful.
Things I will be covering:
- Character Groups
- Repetitions
- Anchors
- Quantifiers
- Grouping
- Alternation
- using preg_* functions in php
What you should know:
- PHP at a reasonable standard (although most of this applies to other languages too)
- String manipulation
I heartily recommend that you try out EditPad Pro for testing your regular expressions. It has a really nice highlighting function that will highlight words as you type your regular expressions. Just press ctrl + f to bring up the find box, then tick the regular expressions box. It makes it a lot easier to understand what you are doing this way.
So let's get started with using regular expressions then. Now PHP has 2 sets of regular expression functions; the ereg series and the preg_* series. PHP 6 will be dumping the ereg functions so we'll be covering the preg_* functions (which are supposedly faster anyway). The preg_* functions are also known as Perl Compatible Regular Expressions, by the way.
So what functions have we got to play with? The three main ones are preg_match(), preg_split() and preg_replace(). They all do pretty much what they say on the tin: preg_match will return the number of matches in a string, preg_split will split a string into an array using a regular expression as a seperator and finally preg_replace will replace bits of a string that match a regular expression.
Of course, knowing how to use these functions is a bit pointless if you don't understand regular expressions, which is its own little language, so let's cover some basics first.
Pretty much any character you type (except for a few special ones, which can be escaped by prefixing them with a backslash) will be matched. You could type hello world and it would just find any occurences of the string hello world, which isn't terribly useful really. However if you use character sets, []s you can make it match multiple letters. If we typed h[ae]llo world, it would match hello world and hallo world. Starting to get more useful now, huh?
But what about if we want to match only numbers or only letters? Easily done with the - character! [A-Z] will match any character and [0-9] will match any number. You can use constructs like [1-9] or [a-q] if you so wish. You can have more than one set of stuff in one character group: [A-Z0-9] would work fine. If you want a - in your character group put it at the start of the group. So every valid, non-special email character would be [-_0-9A-Z] (I know there's more, don't pull me up on this).
We could match most UK postcode like so
- Code:
[A-Z][0-9A-Z][0-9A-Z] [0-9][A-Z0-9][A-Z0-9]
See how this is getting good for validating stuff?
Problem, of course, is that you don't always know the length of a string and some bits are optional (a UK postcode, for example, can be anywhere between 6 and 9 characters long, including the space). Not to worry though! Regular expressions give us a way of specifying how many times something occurs.
There's two ways to go about it; one for if you need it to occur a set number of times and one for when you don't. We'll leave the greedy method for later and cover the set number method first. All you have to do is type a {#} after a character or group (# is a whole number). We can now simplify our postcode to:
- Code:
[A-Z][0-9A-Z]{2} [0-9][A-Z0-9]{2}
Looks better already right? But, there's a problem. The starting bit of a UK postcode, while usually 3 characters can be anywhere between 2 and 4 characters long. Not to worry though, we can specify a range of repititions with {#, #}. The first # is the minimum repetitions and the second is the maximum number of repitions. Now we can have:
- Code:
[A-Z][0-9A-Z]{2,4} [0-9][A-Z0-9]{2}
Still not a perfect UK postcode yet, but it's damned close. But, here's the catch, it will get a match if a string even only contains a postcode. If you want to validate that a string, in it's entirety, is a postcode then you'll need to use anchors.
Anchors are dead simple; You put ^ at the start of a regular expression to say "this has to be at the start of a string" and you put a $ at the end of a regular expression to say "this bit has to be at the end of the string". If you use both, you're effectively saying "this is the whole string".
- Code:
^[A-Z][0-9A-Z]{1,3} [0-9][A-Z0-9]{2}$
Anyway, I'm going to leave postcodes alone until I can be bothered to cover more advanced stuff. Now we'll move onto email addresses and go back to our repititions.
Email addresses can have a stupidly high number of characters, which makes the earlier example using {#}s a bit useless for them. We'll need to use what are called greedy quantifiers. These simply don't know when to stop matching and will always match as much as they can (hence why they're called greedy), so you need to be a bit more careful with these. There are 3 kinds of greedy quantifier: * (0 or more matches), + (1 or more matches) and ? (0 or 1 matches). You could think of them as ? meaning optional, * meaning ignored and + meaning at least one required. Just bang one of these down after a character and they'll get to work.
If you wanted to match Dude, in a dude where's my car fashion (y'know, the Duuuuude, Sweeeet bit), we'd just go Du+de. Dude, Duude, Duuude etc. etc. will all now be matched. If we go Bes?t it'll match Best and Bet (because the s is now optional), and if you go hello*, it'll match hell, hello, helloo, hellooo etc.
We could put an email address down like so:
- Code:
[A-Z0-9]*\.?[A-Z0-9]@+[A-Z0-9]+\.com
Notice the backslash (\)? As pointed out earlier, that is used to tell the engine not to use a character's special meaning, and . is the regular expression wildcard; it will match nearly everything. Chances are we really, really don't want that.
Anyway, I'm sure one or two of you have noticed some problems with this; most prominent being that it only matches .com email addresses. That won't do at all. We need to get it to do a special case for .co.uk/au/jp addresses. We can do an or in regular expressions with a | (I don't know the formal name so I'll just call it a pipe).
If you were to say use hello|world it'll match hello or world. But what about hello world|people|newgrounds? It will match hello world or people or newgrounds but not hello newgrounds or hello people. To do that, we'll need to take advantage of grouping.
Grouping is done with parenthesis (brackets, ()s). Anything in a group is seen as it's own little entity, so we could say for instance, go Hello( world)? and it'll match Hello or Hello world. If we combine these with a | we can get the example above: hello (world|newgrounds).
So let's fix our email address already! Firstly, we'll want it to sort out the .co* suffix.
- Code:
[A-Z0-9]*\.?[A-Z0-9]+@[A-Z0-9]+\.?(com|co\.[A-Z]{2})
See what we're doing there? we're either going for .com or .co.(2 letters). But wait! We're still not done with our grouping here. Notice the other problem? The first bit only allows for address@ and prefix.address@; it doesn't support more than one prefix. We can solve this by grouping the prefix and making that group optional (stay with me!).
- Code:
([A-Z0-9]+\.)*[A-Z0-9]+@[A-Z0-9]+\.(com|co\.[A-Z]{2})
Now we can have address.suffix.suffix@domain.co.uk or @domain.com. But we can't have subdomains. We can fix that up with grouping too:
- Code:
([A-Z0-9]+\.)*[A-Z0-9]+@([A-Z0-9]+\.)+(com|co\.[A-Z]{2})
And there we have a valid email address .co.* or .com email address. Now I'm going to go over each bit of this, as it's starting to get a bit complicated now.
- Code:
([A-Z0-9]+\.)* Covers hello.world@ domains; the * at the end means it can happen 0 or more times.
[A-Z0-9]+ Text before our @; has to have at least one character, so no hello.@domain.com
@ Our @ symbol!
([A-Z0-9]+\.)+ our domain name and any subdomains; has to have at least one letter and one .
(com|co\.[A-Z]{2}) our .co.XX or .com
So now you know how to do regular expressions, let's actually do some PHP!
Now first thing to note about the preg_* functions is that they use Perl syntax (hence why they're called perl compatible regular expressions). Our postcode regular expression in Perl would look like this:
- Code:
s/^[A-Z][0-9A-Z]{2,4} [0-9][A-Z0-9]{2}$/i
The slashes are dividers for data that perl uses. The starting bit is which function to use with this regular expression, s meets match, essentially. Since we specify this bit by calling a particular function we can skip it. The last bit is any modifiers. The i there means case-insensitive, if you don't use this your character groups have to be [A-Za-z0-9] to match lower and upper case.
Now even though it uses perl's syntax, it'll look slightly different in PHP. In PHP that would be expressed as:
- Code:
$result = preg_match( '/^[A-Z][0-9A-Z]{1,3} [0-9][A-Z0-9]{2}$/i', $string );
Notice that we still need that opening and ending slash even when we don't specify any modifiers! $result here would contain the number of matches in $string. We're using anchors in the above example so it will always return 0 or 1 results, however if we didn't it would return the number of postcodes in $string.
You can make preg_match work line by line if you give it the /m clause, like so:
- Code:
$result = preg_match( '/^[A-Z][0-9A-Z]{1,3} [0-9][A-Z0-9]{2}$/im', $string );
That example would count the number of lines that are postcodes (note the anchors). Basically /m makes ^ and $ mean start and end of line rather than start and end of string.
You can also make preg_match copy all matches into an array by giving it an extra parameter to dump them into. If we wanted to get all postcodes in a string we'd go:
- Code:
$result = preg_match( '/[A-Z][0-9A-Z]{2,4} [0-9][A-Z0-9]{2}/im', $string, $matches );
$matches will now be an array of every single postcode in the string.
Anyway, let's look into preg_replace(). This one is a bit more interesting as you can actually manipulate things with it. A very common use of preg_replace is to do things like convert bbcode in forums to the corresponding html. Let's use preg_replace to swap the 2 bits of our postcode around.
- Code:
$result = preg_replace( '/^([A-Z][0-9A-Z]{1,3}) ([0-9][A-Z0-9]{2})$/i', '$2 $1', $string );
Notice the grouping (parenthesis) I've added around the 2 parts? When you use preg_replace it will collect all the groups in a match from left to right and store them inside a numbered variable ($1 to $99). I've swapped them around by outputting our second group then our first, but you can do some other stuff with it too.
One of the fun things about preg_replace is that you can also give it arrays of patterns and replacements, take this example here of a quick bbcode converter:
- Code:
$string = '[bold]hello[/bold] [italic]world[/italic] [link src="http://exosoft.omgforum.net/"]click here[/link]';
$patterns = array( '/\[bold\](.+)\[\/bold\]/i', '/\[italic\](.+)\[\/italic\]/i', '/\[link src="(.+)"\](.+)\[\/link\]/' );
$replaces = array( '<strong>$1</strong>', '<em>$1</em>', '<a href="$1">$2</a>' );
$result = preg_replace( $patterns, $replaces, $string );
That will convert [bold]text[/bold] into bold text, [italic]text[/italic] into italic text and [link src="url"]text[/link] into a link.
Lastly we have preg_split(). It will split a string into an array, starting a new array element whenever you get a match. It's explode with regular expressions basically. If I wanted to get all words in a string without any punctuation (for a search string, for example) you could do something like this:
- Code:
$result = preg_split( '/[-_,\.!? ]+/i', 'Woo Hello, Punctuation! Is. Fun?' );
Pretty simple really.
A few handy tips:
- If you want to paste a string into a regular expression, run it through preg_quote(). This will return a copy of the string with all special characters escaped.
- When using greedy quanitfiers, you can make them "lazy" by adding a ? after them (so *?, +?, ??). When they are set to be lazy, the engine will do the repetition as few times as possible where it still gives a match (so *? and ?? are a bit useless really).
- Avoid regular expressions from the internet; they are invariably flawed. If you do use them, test the damned things properly. I have gotten in seriously trouble for not doing this.
A great place for details on more advanced (or even just better written) tutorials on regular expressions is http://www.regular-expressions.info/
BoneIdol- Posts: 2
Join date: 2008-07-02
Age: 22
Location: Englandshire
Re: Beginner's guide to Regular Expressions with PHP
Fantastic tutorial. I used to work a lot with PHP just to save time. (Also for the things that are impossible in other ways ofcourse.)
Keep it up!
Keep it up!
_________________
A.K.A. Victor V.
ExoSoft Co-Founder
Modeler, Mac Programmer, Website Creator, Concept Art and much more crap

shadowspy- Admin

- Posts: 37
Join date: 2008-06-29
Location: Here ;)

Re: Beginner's guide to Regular Expressions with PHP
This is amazing, It's the best thing I've read about regular expressions all the other tutorials, guides and others didn't explain everything deep and they flipped from one thing to another making it almost impossible to understand.
_________________
ExoSoft Co-Founder
Windows Programmer, financial organizer and marketing organizer.

Hein D.- ExoSoft Member

- Posts: 41
Join date: 2008-06-29
Location: The Netherlands

Re: Beginner's guide to Regular Expressions with PHP
Stickied! 
_________________
A.K.A. Victor V.
ExoSoft Co-Founder
Modeler, Mac Programmer, Website Creator, Concept Art and much more crap

shadowspy- Admin

- Posts: 37
Join date: 2008-06-29
Location: Here ;)

Re: Beginner's guide to Regular Expressions with PHP
::Off-Topic:: Why the hell Can't I do such things?
=@
_________________
ExoSoft Co-Founder
Windows Programmer, financial organizer and marketing organizer.

Hein D.- ExoSoft Member

- Posts: 41
Join date: 2008-06-29
Location: The Netherlands

Permissions of this forum:
You can reply to topics in this forum





