Intelligent Abbreviation of Text (pretty_substr)

Recently, we found that we needed to limit the length of “headline” text on one of our sites for display purposes. The text in question was the title text for classified ads. Our classifieds system was simply truncating the title text at an admin-configured number of characters. This helps us preserve our design, but it leads to awkward truncations where words are cut in half. Not the optimal way to do this.

I took some snippets from around the web to build a robust version of substr() that truncates strings only at word boundaries and can add an ellipsis at the truncation points (it even handles non-zero start indices the same way that substr() does).

Here’s a quick rundown of what it’s doing:

  1. builds a regular expression to break the string into the desired substring plus any text before the substring and any text after it
  2. if we’re chopping something off the front, see if it ends with a non-space character and the keeper substring begins with a non-space character; if so, we cut in the middle of a word — remove the remainder of the word from the keeper string
  3. always prepend the indicator if we’ve chopped something off the front
  4. if we’re chopping something off the end, see if it starts with a non-space character and the keeper substring ends with a non-space character; if so, we cut in the middle of a word — remove the remainder of the word from the keeper string
  5. always append the indicator if we’ve chopped something off the end

We can hook up a little test harness to the function:

And here’s the resulting output of our function next to the output generated by identical calls to substr():

Note that I originally intended to avoid the use of regular expression functions for performance reasons, but I quickly found that it is non-trivial to “chop off all non-whitespace characters and the first whitespace character from the front of a line” without using regular expressions. The ctype_space() function helps somewhat, but without the luxury of a strpos() function that is whitespace-aware, you would have to iterate character by character until you find your first whitespace character. I can’t imagine that this would be as fast as the regular expression engine that is implemented as a native binary.

Leave a comment

Your email address will not be published. Required fields are marked *