Hi, I have the Lucene search extension (http://www.mediawiki.org/wiki/Extension_talk:Lucene-search) integrated with my mediawiki installation. Its all working really well, however- lucene seems to have indexed all the mediawiki /html markup as well and it is showing up in the results.
i.e. searching for "green" will return results with markup such as, style="background:green; color:white
Is there a way to strip the search results of all the markup? I believe wikipedia uses the same search plugin, how are they doing it?
-
You will probably have to transform the raw wiki markup before indexing it with Lucene. When dealing with pure XML content, it's possible to just use an XSL transform with
<xsl:value-of select="text()"/>to extract the text content.I'm afraid that won't work for wiki markup, but maybe you can capture the page post-HTML transformation?
0 comments:
Post a Comment