Hey, I learned something new today. Cool. I'm trying to scrape some HTML so that I can produce an XML document out of it, and I came to this piece of logic. I want to get rid of <a> links and just leave the text. So if I have "Blah blah <a href=foo>bar</a> something else" then I would just be left with "Blah blah bar something else."
First attempt (this is in Ruby, by the way):
line.gsub! /<a href=(.*)>/, ""
line.gsub! /<\/a>/, ""
In other words just a brute force attempt to chop out the tags, leaving the middle. This appeared to work. Except when I had nested tags. In a case like this:
<th class="header"><em>This is a <a href="link">link</a></em></th>
I would end up with this:
<th class="head"><em>This is a
Because it turns out the first .* was being greedy and going all the way to the last occurence of the > character (the one with the th) and eating the whole line.
I recalled the term "greedy regular expressions" from lessons I used to get from a coworker (Hi SteveO!) so I went googling and came across this rather lovely reference page which told me exactly what I wanted. Namely that a single star is "Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all.", and that there is a syntax *? called the "lazy star" which is "Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item."
Perfect. Changing my regular expression to:
line.gsub! /<a href=(.*?)>/, ""
Did exactly what I wanted, finding the first > and stopping.
I knew about the typical use of ? as a "0 or 1" matcher (as well as + for "at least one") but this was the first time that I'd had opportunity to use the * in conjunction with ?. Learn something new every day.