I'm trying to build a bbcode parser, but I'm having quite some problems figuring out how to avoid matching too widely. For example I want to implement a [list] to conversion like this:
\[list\](.*)\[/list\]
would be replaced by this:
<ul>$1</ul>
This works fine, except if I have two lists where the regular expression matches the beginning tag of the first list and the ending tag of the second. So this
[list]list1[/list] [list]list2[/list]
becomes this:
<ul>list1[/list] [list]list2</ul>
which produces really ugly output. Any idea on how to fix this?
-
The method you're using may not end up being a particularly good approach, but to solve that specific problem, just change to non-greedy matching:
\[list\](.*?)\[\/list\]Note that this way will have trouble with nested lists instead of back-to-back ones.
-
If what you are doing is not just a lightweight hack, but something more permanent, you probably want to move to a real parser. Regexps in Java are particularly slow (even with precompiled patterns) and matching nested constructs (especially different nested contructs like "foo [u][i] bar [s]baz[/s][/i][/u]" ) is going to be a royal pain.
Instead, try using a state-based parser, that repeatedly cuts your sentence in sections like "foo " / (u) / "[i] bar [s]baz[/s][/i][/u]", and maintains a set of states that flip whenever you encounter the matching construct delimiter.
cdecker : Thanks for the heads up, do you know some resources or a working example of such a parser? Speed really is my main concern ^^Alan Moore : Java's built-in regexes are plenty fast enough if you know what you're doing. I agree that regexes are not the right tool for this job, but performance is not the reason.Varkhan : Even something as simple as matching a prefix with a precompiled Java regexp is painfully slow (printing all matching lines from a file, with a regexp like "^mystring": two orders of magnitude compared to a simple startswith, an order of magnitude compared with the same program in Perl... wtf?)
0 comments:
Post a Comment