Parse the HTML as an XML Doc in a DOM Object and gone through it with XPATH for searching things in it. If u just want to read it from start to the end, then, u can just traverse it with the DOM object.
If u have a really big XML, go for SAX, an event based parser. That will keep u with a small memory footprint and a lot less parsing time.