Sunday, February 13, 2011

How can I manipulate the DOM from a string of HTML in C#?

For the moment the best way that I have found to be able to manipulate DOM from a string that contain HTML is:

WebBrowser webControl = new WebBrowser();
webControl.DocumentText = html;
HtmlDocument doc = webControl.Document;

There are two problems:

  1. Requires the WebBrowser object!
  2. This can't be used with multiple threads; I need something that would work on different thread (other than the main thread).

Any ideas?

  • Depending on what you are trying to do (maybe you can give us more details?) and depending on whether or not the HTML is well-formed, you could convert this to an XmlDocument:

    System.Xml.XmlDocument x = new System.Xml.XmlDocument();
    x.LoadXml(html); // as long as html is well-formed, i.e. XHTML
    

    Then you could manipulate it easily, without the WebBrowser instance. As for threads, I don't know enough about the implementation of XmlDocument to know the answer to that part.


    If the document isn't in proper form, you could use NTidy (.NET wrapper for HTML Tidy) to get it in shape first; I had to do this very thing for a project once and it really wasn't too bad.

    Daok : The document might not be well formatted this is why the XmlDocument might not work but I appreciate the alternative.
  • I did a search to GooglePlex for HTML and I found Html Agility Pack I do not know if it's for that or not, I am downloading it right now to give a try.

    Mark Cidade : Html Agility Pack is awesome
    Jason Bunting : Ditto - I was actually about to recommend using HTML Tidy to get the document into good shape and then turn it into an XmlDocument, but perhaps you can skip that with the HTML Agility Pack. Good stuff.
    Daok : Agility pack work fine with HTML and thread! I got my answer! Thx all!!!
    Stewart Johnson : Yeah +1 for the HtmlAgilityPack. Stand on the shoulders of giants!
    From Daok
  • JasonBunting already posted this, but it really works to use a .net wrapper around HTML tidy and load it up in an XmlDocument.

    I have used this .net wrapper before :

    http://www.codeproject.com/KB/cs/ZetaHtmlTidy.aspx

    And implemented it somewhat like this:

    string input = "<p>crappy html<br <img src=foo></div>";
    HtmlTidy tidy = new HtmlTidy()
    string output = tidy.CleanHtml(input, HtmlTidyOptions.ConvertToXhtml);
    XmlDocument doc = new XmlDocument();
    doc.LoadXml(output);
    

    Sorry if considered a repost :)

0 comments:

Post a Comment