preg_replace() madness

I’ve always considered myself very good with regular expressions.  This one drove me crazy for a little while, but it turns out to not be a problem with the regex so much as the way the PHP PCRE engine was applying the regex.

Imagine you want to normalize a URL by adding a trailing slash.  But you don’t want to double the trailing slashes.   You might use a regex like this:

So *if* there is a trailing ‘/’ replace it with a slash (in other words, do nothing); otherwise, just replace the “end of string” with a slash.

Let’s see how this works in practice with a couple of sample URLs:

Here’s the output:

Not exactly what we’re going for.  At first blush, I thought that somehow the PCRE engine was choosing to match just “$” instead of “/$”.  Almost as if it were behaving in some sort of “anti-greedy” way.  When I was at my wits’ end, I tried the same regex in perl, and got the expected results:

Output:

So what’s the difference between these two pieces of code?  By default, perl’s regex substitution operator only replaces a single match of the regex.  PHP’s regex_replace() replaces all matches.  In perl, you could accomplish the same by adding the “g” modifier to the regex substitution.

So if you think about it, my doubled-up slashes are the result of two replacements.  The first replacement is of the “slash plus end of string”.  The second replacement is of the “end of string”.   How do we fix this?  We use the $limit parameter of preg_replace() to limit the function to a single replacement:

Output:


 

Leave a Reply

Your email address will not be published. Required fields are marked *