generating HTML from C++
(level : easy)


Contents

Introduction
CSS (Cascading Style Sheets)
The program
ParseLine function
Limitations
Appendix : CHTMLBuilder's code


Introduction

When you write articles like this one, or you make a website about programming, you need to include some source code excerpts, or even entire files, like this is the case in the appendix. These sources are of course more readable if syntax coloring is used, that is if the keywords of the language are displayed in one color, strings in another one, and so on for numbers, operators, and comments.

The aim of the provided program (exe+sources) (167 Kb) is to generate automatically, for a given .cpp or .h file, the HTML page that will display the code of this file respecting the C++ syntax coloring. The source is provided in case you'd need to modify it : it's a Visual C++ 6 project that can easily be ported to other compilers, as its main job simply consists in analysing the lines of a text file.


CSS (Cascading Style Sheets)

The generated HTML code doesn't change the current color with <font color="#......"> tags, but with the help of <span class=keyword> (for example). This instruction tells the browser that it has to use the styles corresponding to the specified class (here: "keyword"). The styles classes (that have nothing to do with C++ classes) are described in a file whose extension is .css, that needs to be included between the <head> and </head> tags by a command like :
<link rel="stylesheet" type="text/css" href="mystyles.css">.

The use of styles has the following advantage : if you decide one day to modify the color of the comments (for example) that appear in the C++ sources displayed by your html pages, all you have to do is change the class that defines the styles of the comments in the .css file, and that's it. Moreover, it is not only about colors : you can choose to display the operators in bold characters, or why not the strings in italic, without touching your html files.

Let's sum up : the provided exe generates html pages, that use a mystyles.css (but you can change the name) file of styles to color the C++ code they contain. This file has to be located in the same directory as the html pages (otherwise its path needs to be defined in the <link> tag). It is generally common to all pages (this is not a requirement), which allows to redefine a style for a whole site by only touching a single file. Finally, with regard to the structure of this file, it's very simple : a few lines of text correspond to each style, for example for my "keyword" class (C++ keyword) :

.keyword
{
 color: #0000ff;
 font-family: courier;
 font-size: 10pt;
}
Two complete examples are included in the above zip file.


The program

The program accompanying this article is a simple dialog box, with a single button intended to choose the source file. The generated file has the same name followed by the .htm extension, which leads for example to MySource.cpp.htm or MySource.h.htm. The C++ project uses some classes of Fairy that I'm not going to detail here, the part we're interested in is managed by the CHTMLBuilder class, whose code is in the appendix.

This class has 3 main (and public) functions, that are called by CParseCppDlg::OnSelectfile in this way :

  CHTMLBuilder HTML(MemFile);
  HTML.WriteHeader(dlgLoad.GetPathName()+".htm");
  HTML.Parse();
  HTML.WriteFooter();

WriteHeader and WriteFooter deal with the html stuff surrounding the C++ code, Parse reads the source and calls the ParseLine function for each line. The latter uses a few methods to help it do its job :

- IsDigit establishes if a character is a digit
- IsKeyword compares the word going to be added to the .htm file with the keywords of the language, that are stored in a static array. This search is not optimized, the array is walked entirely or until the word is identified
- IsNumber checks if the current word is a number. For Visual C++ syntax coloring, a number is a string beginning with a digit or a dot (example : .5f). So 987aaz is a number, and 0xAB9F too (with this rule hexadecimal is taken into account)
- PutWord adds the current word accumulated in m_szWord to the html page, verifying if it's a keyword or a number. We're going to see it's ParseLine that cuts the line into a series of words, separated by whitespaces, tabulations, or operators.


ParseLine function

To analyse code, for example of a script language, a very common practice is to build a graph defining the rules (= the grammar) of this language. This allows to establish the operators priorities, the definition of what the language recognizes as a number or a variable, and so on ; to learn more about this, search for the word "compiler" on the internet, and you should find some interesting and sometimes very in-depth things. The goal of my program not being to "understand" what has been coded, but to color it, there's no need to go into such complex details, and the working of ParseLine stays simple.

The function starts by initializing a few variables for the new line, then enters a loop that is run until there's no more character to deal with. The tests of this loop are the following :

- is the read character a whitespace or a tabulation ? If so the current word ends.
- is there an open (by /*) comment block ? If this is the case, the character is sent to the html file; keywords, strings, numbers, etc are not searched in comments. if */ is detected, the block ends. A block can spread on multiple lines, this is taken into account by the m_boCommBlock member variable.
- is there an open (by ") string ? To close such a string another " character has to be encountered, and not preceded by a \ because \" is used to display the " character itself.
- same thing for a string delimited by single quotes : '.
- the beginning of a simple comment (//) is checked. Such a comment spreads until the end of the line, there's no need to test the other characters, they are written to the html page immediately.
- because we're neither in a comment nor in a string, we look to see if one does not start.
- last test : does the read character belong to an operator ? It is compared to the content of a static array. Particular case : the dot, that when it's at the beginning of a word can be either an operator or the start of a floating point number (in this case it is followed by a digit, that's how the difference is made).
- none of the previous tests has accepted the character : it's added to the current word, which can be a keyword, a number, or something else (variable name, class name, function...), it's PutWord that will determine it when the end of this word is reached.


Limitations

This program is by no means exhaustive, a few cases are not handled :

- the \ at the end of the line (macros)
- in Visual, a unique single quote (that is to say a non closed string) on a line leads to the rest of the line having the color of non standard words (variables etc...). I can't see any reason for that, so I handle simple quotes like double ones.
- in Visual, 0...1.2 is a number, whereas in my program this will not be the case (but who could type such things in C++ ?...)
- be careful if your code contains html tags, in strings or comments : they're going to be written without modification in the html page, then interpreted by the browser ! This is normal, but leads to unpleasant results, when a <html> is for example encountered in the middle of the page. This case should be rare, but happens with HTMLBuilder.cpp, which has lines like :
  m_OutputFile.PutString("<html>\n");
The solution is to replace the < characters by a & followed by lt; (less than). The opposite effect also exists : the browser replaces what it believes to recognize as html special characters by their ascii equivalent, which is not necessarily wanted (but once more, very unlikely in C++ code).


Appendix : CHTMLBuilder's code

HTMLBuilder.h
HTMLBuilder.cpp

HTMLBuilder.h
// HTMLBuilder.h: interface for the CHTMLBuilder class.
//
//////////////////////////////////////////////////////////////////////

#if !defined(AFX_HTMLBUILDER_H__CB40BC70_31B4_11D6_9CD7_444553540000__INCLUDED_)
#define AFX_HTMLBUILDER_H__CB40BC70_31B4_11D6_9CD7_444553540000__INCLUDED_

#if _MSC_VER > 1000
#pragma once
#endif // _MSC_VER > 1000

#include "global/typedefs.h"

class CHTMLBuilder  
{
  public:
                    CHTMLBuilder        (Mythos::MemFile& MemFile);
    virtual        ~CHTMLBuilder        (void);

    void            WriteHeader         (const char* pszFile);
    void            WriteFooter         (void);
    void            Parse               (void);

  protected:

    void            ParseLine           (char* pszLine);
    bool            IsDigit             (const char cChar);
    bool            IsKeyword           (void);
    bool            IsNumber            (void);
    void            PutWord             (void);

  protected:

    Mythos::MemFile&m_MemFile;
    Mythos::File    m_OutputFile;

    bool            m_boCommBlock;
    char            m_szWord[1024];                         // current accumulated word
    DWORD           m_dwWordLen;                            // nb accumulated chars
};

#endif // !defined(AFX_HTMLBUILDER_H__CB40BC70_31B4_11D6_9CD7_444553540000__INCLUDED_)

HTMLBuilder.cpp
// HTMLBuilder.cpp: implementation of the CHTMLBuilder class.
//
//////////////////////////////////////////////////////////////////////

#include "stdafx.h"
#include "HTMLBuilder.h"

#ifdef _DEBUG
#undef THIS_FILE
static char THIS_FILE[]=__FILE__;
#define new DEBUG_NEW
#endif

//

static char szOperator[] = "!%&()*+,-./:;<=>?[]^{|}~\0";

static char szKeywords[][32] =
{
  "for","if","else","continue","do","while","goto",
  "switch","case","break","default","return",
  "new","delete","inline",

  "bool","char","double","float","int","long","short","void",
  "false","true",
  "const","unsigned","signed","volatile","mutable",
  "auto","extern","static","register",

  "#include","#if","#ifdef","#ifndef","#else","#elif","#endif","#define","#undef",
  "#pragma","once","defined",

  "struct","union","enum","typedef","sizeof",
  "this","explicit","operator","private","public","protected","friend","class","virtual",
  "template","using","namespace","typename","typeid","uuid","__uuidof","interface",
  "const_cast","static_cast","dynamic_cast","reinterpret_cast",

  "__asm","__based","__cdecl","__declspec","__fastcall","__inline","__stdcall","naked",
  "__single_inheritance","__multiple_inheritance","__virtual_inheritance",
  "__int8","__int16","__int32","__int64",
  "dllexport","dllimport",
  "thread","throw","try","catch",
  "__try","__leave","__finally","__except",
  NULL
};

//////////////////////////////////////////////////////////////////////
// Construction/Destruction
//////////////////////////////////////////////////////////////////////

CHTMLBuilder::CHTMLBuilder(Mythos::MemFile& MemFile) : m_MemFile(MemFile)
{
}

CHTMLBuilder::~CHTMLBuilder()
{
}

//

void CHTMLBuilder::WriteHeader(const char* pszFile)
{
  if(!m_OutputFile.Open(pszFile,Mythos::IFile::_WRITE_TEXT_)) return;

  m_OutputFile.PutString("<html>\n");
  m_OutputFile.PutString("<head>\n");
  m_OutputFile.PutString("<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\">\n");
  m_OutputFile.PutString("<link rel=\"stylesheet\" type=\"text/css\" href=\"mystyles.css\">\n");
  m_OutputFile.PutString("</head>\n");
  m_OutputFile.PutString("\n");
  m_OutputFile.PutString("<body>\n");
  m_OutputFile.PutString("<table border=0 cellspacing=0 cellpadding=0 width=\"100%\">\n");
  m_OutputFile.PutString("<tr><td bgcolor=\"#ffffff\">\n");
  m_OutputFile.PutString("<span class=source>\n");
  m_OutputFile.PutString("<pre>\n");
}

//

void CHTMLBuilder::WriteFooter()
{
  m_OutputFile.PutString("</pre>\n");
  m_OutputFile.PutString("</span>\n");
  m_OutputFile.PutString("</td></tr></table>\n");
  m_OutputFile.PutString("\n");
  m_OutputFile.PutString("</body>\n");
  m_OutputFile.PutString("</html>\n");

  m_OutputFile.Close();
}

//

void CHTMLBuilder::Parse()
{
  char szLine[1024];
  m_boCommBlock = false;

  while(true)
  {
    if(!m_MemFile.GetString(szLine,1024)) break;
    ParseLine(szLine);
    m_OutputFile.PutChar('\n');
  }
}

//

void CHTMLBuilder::ParseLine(char* pszLine)
{
  char* pszChar  = pszLine;
  bool  boString = false;
  bool  boQuote  = false;
  DWORD dwAnti   = 0;                                       // consecutive "\"
  char  cChar    = 0;
  char  cPrev;

  m_dwWordLen    = 0;

  while(*pszChar)
  {
    cPrev = cChar;
    cChar = *pszChar++;
    if(cPrev == '\\') dwAnti++;
    else              dwAnti = 0;
    
    if(cChar == ' ')
    {                                                       // space
      PutWord();
      m_OutputFile.PutChar(' ');
      continue;
    }
    if(cChar == 9)
    {                                                       // tab
      PutWord();
      m_OutputFile.PutChar(cChar);                          // tab length can be modified here, eg: m_OutputFile.PutString("  ");
      continue;
    }

    // comment block

    if(m_boCommBlock)
    {                                                       // can only end with "*/"
      if((cChar != '*') || (*pszChar != '/'))
      {
        m_OutputFile.PutChar(cChar);
        continue;
      }
      m_OutputFile.PutString("*/</span>");
      pszChar++;
      m_boCommBlock = false;
      continue;
    }

    // string block

    if(boString)
    {                                                       // can only end with '"'
      if((cChar != '"') || (dwAnti & 1))
      {                                                     // not ", or previous char is '\'
        m_OutputFile.PutChar(cChar);
        continue;
      }

      m_OutputFile.PutString("\"</span>");
      boString = false;
      continue;
    }

    // quote block

    if(boQuote)
    {                                                       // can only end with ' (and should, if we have started a block)
      if((cChar != '\'') || (dwAnti & 1))
      {
        m_OutputFile.PutChar(cChar);
        continue;
      }

      m_OutputFile.PutString("'</span>");
      boQuote = false;
      continue;
    }

    // comment starts

    if(cChar == '/')
    {
      if(*pszChar == '/')
      {                                                     // simple comment (//)
        PutWord();
        m_OutputFile.PutString("<span class=comment>//");
        m_OutputFile.PutString(pszChar+1);
        m_OutputFile.PutString("</span>");
        return;                                             // goes till the end of the line
      }

      if(*pszChar == '*')
      {                                                     // comment block starts (/*)
        PutWord();
        m_OutputFile.PutString("<span class=commblock>/*");
        m_boCommBlock = true;
        pszChar++;
        continue;
      }
    }

    // string starts

    if(cChar == '"')
    {
      PutWord();
      m_OutputFile.PutString("<span class=string>\"");
      boString = true;
      continue;
    }

    // quote starts

    if(cChar == '\'')
    {
      PutWord();
      m_OutputFile.PutString("<span class=string>'");
      boQuote = true;
      continue;
    }

    // operator

    if(strchr(szOperator,cChar))
    {
      if((cChar != '.') || !IsDigit(*pszChar))
      {                                                     // '.' can start a number
        PutWord();
        m_OutputFile.PutString("<span class=operator>");
        m_OutputFile.PutChar(cChar);
        while(*pszChar && strchr(szOperator,*pszChar))
        {
          // special case : '.' can start a number
          if((*pszChar == '.') && IsDigit(*(pszChar+1))) break;

          cPrev = cChar;
          cChar = *pszChar++;
          m_OutputFile.PutChar(cChar);
        }
        m_OutputFile.PutString("</span>");
        continue;
      }
    }

    //

    m_szWord[m_dwWordLen++] = cChar;
  }

  PutWord();
}

//

bool CHTMLBuilder::IsDigit(const char cChar)
{
  return((cChar >= '0') && (cChar <= '9'));
}

//

bool CHTMLBuilder::IsKeyword()
{
  for(DWORD dwI = 0; szKeywords[dwI][0] != 0; dwI++)
  {
    if(!strcmp(szKeywords[dwI],m_szWord))
    {
      return true;
    }
  }
  return false;
}

//

bool CHTMLBuilder::IsNumber()
{
  return(IsDigit(m_szWord[0]) || (m_szWord[0] == '.'));
}

//

void CHTMLBuilder::PutWord()
{
  if(!m_dwWordLen) return;
  m_szWord[m_dwWordLen] = 0;

  // keyword

  if(IsKeyword())
  {
    m_OutputFile.PutString("<span class=keyword>");
    m_OutputFile.PutString(m_szWord);
    m_OutputFile.PutString("</span>");
  }

  // number

  else if(IsNumber())
  {
    m_OutputFile.PutString("<span class=number>");
    m_OutputFile.PutString(m_szWord);
    m_OutputFile.PutString("</span>");
  }

  // text

  else
  {
    m_OutputFile.PutString(m_szWord);
  }

  //

  m_dwWordLen = 0;
}

back to top