Tuesday, May 29, 2007

RegExp : "Lazy Star"

Hey, I learned something new today.  Cool.  I'm trying to scrape some HTML so that I can produce an XML document out of it, and I came to this piece of logic.  I want to get rid of <a> links and just leave the text.  So if I have  "Blah blah <a href=foo>bar</a> something else" then I would just be left with "Blah blah bar something else."

First attempt (this is in Ruby, by the way):

line.gsub! /<a href=(.*)>/, ""

line.gsub! /<\/a>/, ""

In other words just a brute force attempt to chop out the tags, leaving the middle.  This appeared to work.  Except when I had nested tags.  In a case like this:

<th class="header"><em>This is a <a href="link">link</a></em></th>

I would end up with this: 

<th class="head"><em>This is a

Because it turns out the first .* was being greedy and going all the way to the last occurence of the > character (the one with the th) and eating the whole line.

I recalled the term "greedy regular expressions" from lessons I used to get from a coworker (Hi SteveO!) so I went googling and came across this rather lovely reference page which told me exactly what I wanted.  Namely that a single star is "Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.", and that there is a syntax *? called the "lazy star" which is "Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item."

Perfect.  Changing my regular expression to:

line.gsub! /<a href=(.*?)>/, ""

Did exactly what I wanted, finding the first > and stopping.

I knew about the typical use of ? as a "0 or 1" matcher (as well as + for "at least one") but this was the first time that I'd had opportunity to use the * in conjunction with ?.  Learn something new every day.

 

3 comments:

steveo said...

[^>]+> is the same thing as .*?> and would have worked fine in this case without resorting to non-greedy matching. .*? is a very handy tool when there is a more complicated end condition.

Anonymous said...

That's exactly the RegEx I was looking for
Many thanks :)
Zorn

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.