dpointer: HTML Parser (Kinda)

1% coding, 99% debugging

Wednesday, July 14, 2004

HTML Parser (Kinda)

Let's assume I have a bunch of HTML files. Given a task to create an index of those files with the document title in the correspoding <title> tag, complete with proper hyperlink, how could I do it? Iterating over each file is easy, but to get the title isn't so trivial.

If all those files are XHTML, an XML parser like the built-in QDom or QXml offered by Qt would be an elegant ticket. Nevertheless let's try to face the much more general HTML files. The parser below, constructed only in 10 minutes, was my quick-and-dirty solution. Well, maybe it's even very dirty.

First of all, the interface is sharp clear:

#ifndef HTMLDOC
#define HTMLDOC

#include <qstring.h>

class HTMLDocument
{
public:
  HTMLDocument();
  void open( const QString& filename );
  QString title() const;
  QString filename() const;
  
private:
  QString mFilename;
  QString mTitle;
};

#endif // HTMLDOC

The implementation of HTMLDocument follows. Constructor and two accessors do not need explanation. The main parsing routine checks for the files content and keeps track whether we're inside a tag or not. When <title> is found, anything within the tag will be stored as the document title.

#include "htmldoc.h"

#include <qfile.h>
#include <qstring.h>
#include <qtextstream.h>

HTMLDocument::HTMLDocument()
{
  mFilename = "";
  mTitle = "Untitled";
}

void HTMLDocument::open( const QString& filename )
{
  QFile file( filename );
  if( !file.open( IO_ReadOnly ) )
    return;
    
  mFilename = filename;  
    
  bool inTag = false;  
  bool inTitle = false;
  QString tagName;
  
  mTitle = "";
  QTextStream stream( &file );
  while ( !stream.atEnd() ) 
  {
    QString line = stream.readLine();
    for( unsigned i = 0; i < line.length(); i++ )
    {
      if( !inTag && ( line[i] == '<' ) )
        inTag = true;
      else if( inTag && ( line[i] != '>' ) )
        tagName.append( line[i] );
      else if( inTag && ( line[i] == '>' ) )
      {
        inTag = false;
        inTitle = tagName.lower() == "title";
        tagName = "";
      }
      else if( !inTag && inTitle ) 
        mTitle.append( line[i] );
    }
  }
  file.close();
}

QString HTMLDocument::title() const
{
  return mTitle;
}

QString HTMLDocument::filename() const
{
  return mFilename;
}

As usual, this is just an illustration and therefore not suitable for real-world parsing. For correctness, you may want to follow the HTML syntax more closely, like for example the <html> and <head> tags, before taking anything from <title>. On optimization side, you can always stop whenever you're finished with this tag, or when you encounter <body>, if you're only interested in the title and not the rest of the content. QString and QRegExp can be further used to clean up the title string, say to convert special characters as well as to strip unncessary white spaces and line feed. Error handling, which aren't shown in the code above for simplicity, is actually an absolute requirement in this wild playing field. And last but not least, by proper handling of other important tags like <body>, <p>, etc, you can extend it into a simple HTML-to-text converter.

- posted by Ariya Hidayat @ 12:00 PM

Comments:

This comment has been removed by a blog administrator.

# posted by

Anonymous : 5:56 AM

About Me