hitcounter
dpointer
Wednesday, July 14, 2004
 
HTML Parser (Kinda)

Let's assume I have a bunch of HTML files. Given a task to create an index of those files with the document title in the correspoding <title> tag, complete with proper hyperlink, how could I do it? Iterating over each file is easy, but to get the title isn't so trivial.

If all those files are XHTML, an XML parser like the built-in QDom or QXml offered by Qt would be an elegant ticket. Nevertheless let's try to face the much more general HTML files. The parser below, constructed only in 10 minutes, was my quick-and-dirty solution. Well, maybe it's even very dirty.

First of all, the interface is sharp clear:

#ifndef HTMLDOC
#define HTMLDOC

#include <qstring.h>

class HTMLDocument
{
public:
  HTMLDocument();
  void open( const QString& filename );
  QString title() const;
  QString filename() const;
  
private:
  QString mFilename;
  QString mTitle;
};

#endif // HTMLDOC

The implementation of HTMLDocument follows. Constructor and two accessors do not need explanation. The main parsing routine checks for the files content and keeps track whether we're inside a tag or not. When <title> is found, anything within the tag will be stored as the document title.

#include "htmldoc.h"

#include <qfile.h>
#include <qstring.h>
#include <qtextstream.h>

HTMLDocument::HTMLDocument()
{
  mFilename = "";
  mTitle = "Untitled";
}

void HTMLDocument::open( const QString& filename )
{
  QFile file( filename );
  if( !file.open( IO_ReadOnly ) )
    return;
    
  mFilename = filename;  
    
  bool inTag = false;  
  bool inTitle = false;
  QString tagName;
  
  mTitle = "";
  QTextStream stream( &file );
  while ( !stream.atEnd() ) 
  {
    QString line = stream.readLine();
    for( unsigned i = 0; i < line.length(); i++ )
    {
      if( !inTag && ( line[i] == '<' ) )
        inTag = true;
      else if( inTag && ( line[i] != '>' ) )
        tagName.append( line[i] );
      else if( inTag && ( line[i] == '>' ) )
      {
        inTag = false;
        inTitle = tagName.lower() == "title";
        tagName = "";
      }
      else if( !inTag && inTitle ) 
        mTitle.append( line[i] );
    }
  }
  file.close();
}

QString HTMLDocument::title() const
{
  return mTitle;
}

QString HTMLDocument::filename() const
{
  return mFilename;
}

As usual, this is just an illustration and therefore not suitable for real-world parsing. For correctness, you may want to follow the HTML syntax more closely, like for example the <html> and <head> tags, before taking anything from <title>. On optimization side, you can always stop whenever you're finished with this tag, or when you encounter <body>, if you're only interested in the title and not the rest of the content. QString and QRegExp can be further used to clean up the title string, say to convert special characters as well as to strip unncessary white spaces and line feed. Error handling, which aren't shown in the code above for simplicity, is actually an absolute requirement in this wild playing field. And last but not least, by proper handling of other important tags like <body>, <p>, etc, you can extend it into a simple HTML-to-text converter.


Tuesday, July 13, 2004
 
hexview

Every programmer, regardless of the language in use, should write his/her own hex viewer, right? Well, this is mine. Years ago whenever I taught someone the C language, this was my damn favourite. And still, it's useful whenever I do some reverse engineering on Microsoft Excel file format.

It should compile in any platform. And just for the sake of completeness, building it using gcc 3.3 (Linux), plus additional compression with UPX, gives less than 7 KB executable file. Enjoy!

#include <stdio.h>

int main( int argc, char** argv )
{
  FILE *f;
  unsigned char buf[16];
  int i, j, k;

  if( argc < 2 ) return 0;
  f = fopen( argv[1], "rb" );
  if( !f ) perror( "error" );
  else
  {
    k = 0;
    while( !feof( f ) )
    {
      j = fread( buf, 1, 16, f );
      printf("%08x   ", k ); k += j;
      for( i = 0; i < 16; i++ )
        if( i < j ) printf("%02x ", buf[i] ); else printf("   ");
      printf("    ");
      for( i = 0; i < 16; i++ )
        if( i < j ) printf("%c", (buf[i] > 31) ? buf[i] : '.'); 
          else printf("   ");
      printf("\n");
    }
    fclose( f );
  } 
  return 0;
}

Monday, July 12, 2004
 
ziplist

This small program ziplist I have mentioned before in my other blog entry (April 20, 2004). After reading the specification, I was interested in making short utility to show the content of a zip file, thus ziplist was born. It was pure C, portable, self-contained (no need for zlib), and already tested in Linux, Windows, and AIX. Ziplist is neither perfect (probably it can't handle few variants of zip files out there) or highly optimized for performance, it serves only as an example. So enjoy!

/* ziplist: small program to list files inside a .zip 
 * Copyright (C) 2004 Ariya Hidayat <ariya.hidayat@gmail.com>
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
 */

#include <stdio.h>
#include <stdlib.h>

#define readu16(ptr) ((ptr)[0]+((ptr)[1]<<8))
#define readu32(ptr) ((ptr)[0]+((ptr)[1]<<8)+((ptr)[2]<<16)+((ptr)[3]<<24))

void panic( char* msg )
{
  printf( "Fatal error: %s\n", msg );
  exit( -1 );
}

char* method( int n )
{
  char* s;
  switch( n )
  {
    case 0:  s = "Stored"; break;
    case 1:  s = "Shrunk"; break;
    case 2:  s = "Reduced1"; break;
    case 3:  s = "Reduced2"; break;
    case 4:  s = "Reduced3"; break;
    case 5:  s = "Reduced4"; break;
    case 6:  s = "Imploded"; break;
    case 8:  s = "Deflated"; break;
    default: s = "Unknown"; break;
  }  
  return s;
}

void ziplist( char* filename )
{
  FILE* f;
  int count = 0, i;
  unsigned char buffer[32];
  int comp_size, uncomp_size;
  int mod_date, mod_time;
  int name_length;
  int skip;
  
  /* open and check first few bytes for zip ide */  
  f = fopen( filename, "rb" );
  if( !f ) panic( "can't open input file" );  
  if( fread( buffer, 1, 30, f ) != 30 )
   panic( "not a zip file" );
   
  while( !feof( f ) )
  {
  
    /* magic id for file header file is: 0x50,0x4b,0x03,0x04 */
   if( ( buffer[0] != 0x50 ) || ( buffer[1] != 0x4b ) ||  
      ( buffer[2] != 0x03 ) || ( buffer[3] != 0x04 ) )
      {
        if( !count )  
          panic( "not a zip file" );
        else
          break;
      }
   
    /* pretty column header */    
    if( !count )
    {
      printf(" Length    Method    Size    Ratio   Date & Time     CRC-32    Name\n");
      printf("--------  -------- --------  ----- ----------------  --------  ------\n");
    }
  
    count++;
                    
    /* get necessary stuff */
    comp_size = readu32( buffer+18 );
    uncomp_size = readu32( buffer+22 );
    name_length = readu16( buffer+26 );
    mod_time = readu16( buffer+10 );
    mod_date = readu16( buffer+12 );
    skip = readu16( buffer+28 );
    if( buffer[7] & 0x40 )
      skip += 12;  
  
    /* print information */        
    printf("%8d  %-8s %8d  %3d%%  %02d-%02d-%04d %02d:%02d  %02X%02X%02X%02X  ", 
      uncomp_size, method( buffer[8] ), comp_size, !uncomp_size ? 0 : 100-comp_size*100/uncomp_size,
      (mod_date>>5)&15, mod_date&31, 1980+(mod_date>>9),
      (mod_time>>5)&63, (mod_time>>11),
      (unsigned char)buffer[17], (unsigned char)buffer[16], 
      (unsigned char)buffer[15], (unsigned char)buffer[14] );
      
    /* get filename */
    while( name_length > 0 )
    {
      i = fread( buffer, 1, ( name_length < 30 ) ? name_length : 30, f );
      if( i < 0 )
      {
        fclose( f );
        panic( "can't read input file" );
      }
      buffer[ i ] = '\0';
      if( name_length < i ) buffer[ name_length ] = '\0';
      printf("%s", buffer );
      name_length -= i;
    }
    printf( "\n" );
  
    /* skip = number of bytes to skip, including data descriptor (if any) */
    /* skip extra field + compressed data */
    fseek( f, skip + comp_size, SEEK_CUR );
  
    /* load another header */
    if( !feof( f ) )
      if( fread( buffer, 1, 30, f ) != 30 )
        panic( "can't read input file" );
  }
  
  /* that's all, folks */
  fclose( f );
}

int main(int argc, char *argv[])
{
  if( argc < 2 )
  {
    printf("usage: %s filename\n", argv[0] );
    return 0;
  }
  
  ziplist( argv[1] );

  return 0;
}

Sunday, July 11, 2004
 
Column Label of Spreadsheet

If you once work with spreadsheet application, like Microsoft Excel, you'll notice that unlike rows, columns are labelled not by numbers, but by text. Column 1 is A, column 2 is B, and so on. Since we have only A to Z, column 26 is still Z but the next one would be a two-character label, i.e starts from AA, AB, and so on. When this reaches ZZ, the label starts over with AAA, AAB, until ZZZ, and restarts over, and so on, and so on.

So if I give you a column index, say column 1977, how could you tell me its label? The following code is the answer:

QString columnLabel( unsigned column )
{
  QString str;
  unsigned digits = 1;
  unsigned offset = 0;
  
  column--;
  for( unsigned limit = 26; column >= limit+offset; limit *= 26, digits++ )
    offset += limit;
      
  for( unsigned c = column - offset; digits; --digits, c/=26 )
    str.prepend( QChar( 'A' + (c%26) ) );
    
  return str;
}

So it looks a lot simpler than it could be, right? Note that I wrote the code with Qt's QString and QChar classes in mind, because at the time I did it for KSpread. So column 1977 is actually BXA.

The corner case of course when there's integer overflow duing limit calculation. If you track the algorithm, you'll know that this means this code fragment works find until column 4058115285 (which is MCNIOCC). I doubt that at the moment there's an application which can handle so big worksheet, so in the mean time we're on the safe side.

The task to improve the code, for example handling the above mentioned corner case (possible solution: switch to floating-point number, with performance sacrifice), porting it to pure C, C++ string class, or even glib's g_string, are as usual given as exercises for the reader.


Friday, July 02, 2004
 
Simple Logging Facility

If you're familiar with programming, you'll know why it's "1% coding, 99% debugging". In order to ease debugging sessions, it's useful to be able to track what's going on in the debugged application. In C, you can have very simple, yet effective poor-man's logging using the following code snippet:

#include <stdarg.h>
#include <stdio.h>

static void log( char* string, ... )
{
  va_list parameter;
  char output[256];
  FILE *f;

  va_start( parameter, string );
  vsnprintf( output, 255, string, parameter );
  
  f = fopen( "output.log", "at" );
  fprintf( f, "%s\n", output );
  fclose( f );
}
Then, wherever you want to record something, call the function log above, for example log("writing to file %s", filename). Pretty easy, isn't it? Of course, there's still enough room for improvement. It's left as an exercise for the reader :-P


Powered by Blogger