Technology Answer: String delimited regular expression

I'm trying to build a bbcode parser, but I'm having quite some problems figuring out how to avoid matching too widely. For example I want to implement a [list] to conversion like this:

\[list\](.*)\[/list\]

would be replaced by this:

<ul>$1</ul>

This works fine, except if I have two lists where the regular expression matches the beginning tag of the first list and the ending tag of the second. So this

[list]list1[/list] [list]list2[/list]

becomes this:

<ul>list1[/list] [list]list2</ul>

which produces really ugly output. Any idea on how to fix this?

From stackoverflow

The method you're using may not end up being a particularly good approach, but to solve that specific problem, just change to non-greedy matching:
```
\[list\](.*?)\[\/list\]
```
Note that this way will have trouble with nested lists instead of back-to-back ones.
If what you are doing is not just a lightweight hack, but something more permanent, you probably want to move to a real parser. Regexps in Java are particularly slow (even with precompiled patterns) and matching nested constructs (especially different nested contructs like "foo [u][i] bar [s]baz[/s][/i][/u]" ) is going to be a royal pain.

Instead, try using a state-based parser, that repeatedly cuts your sentence in sections like "foo " / (u) / "[i] bar [s]baz[/s][/i][/u]", and maintains a set of states that flip whenever you encounter the matching construct delimiter.

cdecker : Thanks for the heads up, do you know some resources or a working example of such a parser? Speed really is my main concern ^^

Alan Moore : Java's built-in regexes are plenty fast enough if you know what you're doing. I agree that regexes are not the right tool for this job, but performance is not the reason.

Varkhan : Even something as simple as matching a prefix with a precompiled Java regexp is painfully slow (printing all matching lines from a file, with a regexp like "^mystring": two orders of magnitude compared to a simple startswith, an order of magnitude compared with the same program in Perl... wtf?)

Technology Answer

Sunday, April 3, 2011

String delimited regular expression

0 comments:

Post a Comment

Blog Archive