Documenting Code with AWK

In this article I present an AWK script I've been using for quite some time to create HTML documentation from the headers in my source code.

The script fulfills a similar role as tools like Javadoc and Doxygen, but is not as powerful.

The simplicity comes with some advantages though: AWK is widely available, so I can include the script with my source code and be sure that it will just work. Also, the input (and the output) tends to be simpler, which is ideal for the smaller hobby projects I've been spending my time on.

Introduction to AWK

AWK must certainly be one of the most underappreciated programming languages.

It has a lot of things to like:

It certainly has a good pedigree by including Alfred Aho (author of the Dragon book) and Brian Kernighan (of The C Programming Language) among its creators.

An AWK program consists of a series of constructs like condition { statements }, where the condition is checked against each line in an input file and if it evaluates to true the statements are executed. Conditions are typically regular expressions, but there are some special forms for other purposes.

The following example prints a line from the input file only if it matches the regular expression /foo/:

		
/foo/ {print}
		
AWK also conveniently splits each line into columns denoted by '$' signs (separated by spaces by default, but you can change it). The following program prints the second column if the first column matches the regular expression /foo/:
		
$1~/foo/ {print $2}
		
AWK has many more features and I encourage you to read more about it if you're not already familiar with it.

Using AWK to document your code

Here's a link to the script's code: doc.awk on github. I don't want to use this article to advertise my particular script, but rather explain the idea so you can adapt it for your own needs.

The BEGIN section opens the <html> tag and generates the <head> with the CSS and then opens the <body> tag. Likewise, the END section closes the <body> and <html> tags;

The core of the script starts with the construct

/\/\*/ { comment = 1; }
		
which matches the opening of a multi-line comment (the /*) and sets the variable comment to 1. In AWK, anything non-zero is true, so you'll see that the code is littered with if(!comment) next; statements which ensures the script skips lines in the input file which accidentally matches patterns in the script but aren't in comments.

Later when the end of the multi-line comment (the */) is encountered, the script sets comment back to 0 and closes any <div> tags that it may have opened while processing the input.

The meat of the script then processes each line in the comments. Each comment line that forms part of the documentation starts with a glyph (an asterisk and some other symbol) that determines the output. If the line doesn't have a glyph then it is simply ignored, allowing you to have comments in the code that doesn't form part of the documentation.

The simplest glyph is *# that outputs the remainder of the line after passing it through the filter() function (which I'll get to in a moment). If the line is blank then a <br> tag is output, which seems more intuitive.

Other glyphs are *1, *2 and *3 that adds heading tags, *- and *= glyphs for <hr> tags and *[ and *] glyphs that wraps everything between them in a <pre> tag.

I use the *@ glyph to document individual functions - it highlights the line in the output and then wraps the rest of the comment in a <div> tag that creates a box around it. It basically serves the same purpose as the initial /** sequence in a Javadoc comment before a Java method, except that you actually have to supply the method prototype because the script is not smart enough to figure it out.

There are also *{ and *} glyps for creating <ul> lists, each <li> item indicated with a ** glyph.

There are also some glyphs like *X and *N for examples and notes respectively, but they are less useful than I initially thought they would be.

The last part of the script is the function filter(ss) which replaces special characters in the input with their HTML escape sequence (eg. '<' with '&lt;'), replaces special formatting sequences with the appropriate HTML tags (eg. replacing {* with <strong>, *} with </strong> and so on). It also replaces \n sequences with <br> tags to allow you to insert intentional newlines in the documentation.

It also generates <a href="..."> tags for hyperlinks, but it should only be used for simple URLs because detecting URLs with regex is complicated).

Here is an example of what a C header file that uses the script would look like:

/*1 doc.awk example
 *# This is an example of the output 
 *# generated by the {*doc.awk*} script.
 *#
 *# You can have text in {/italics/}, {*bold*}
 *# or in {{true type}}.
 *# You can also use simple URLs: http://example.com
 *# 
 *# I typically do something like the following to
 *# document my functions:
 *#
 *@ int blargle(int argle, char *blort)
 *# Blargles the {{argle}} into {{blort}},
 *# and returns the number of argles blargled.
 */ 
int blargle(int argle, char *blort) {
	/* normal comments are ignored */ 
}
		
This script will generate the following output:

I normally include the script as part of the distribution of my projects' code, but I have created a Gist to serve as the canonical version of the file.

The header files in my wernsey/miscsrc (miscellaneous source code) repository contains several more examples of how I use the script.

To use the script in a Makefile, simply add a target docs that runs the script on your header and redirects the output to an HTML file:

docs: docs.html
docs.html : file.h doc.awk
	awk -f doc.awk -vtitle=$< $< > $@