Werner's home on the Web

Documenting your code with AWK and Markdown

ยท 2208 words ยท 11 minutes to read

Here I present a d.awk 1, an AWK script to generate documentation from Markdown-formatted comments in C, C++, JavaScript, Java and C# and any other language that uses the /* */ comment syntax.

/**
 * My Project
 * ==========
 *
 * This is some _Markdown documentation_ in a
 * `source file`.
 */
int main(int argc, char *argv[]) {
    printf("hello, world");
    return 0;
}

Using it is very simple: You put anything you want documented as markdown formatted text in a comment block that starts with /**.

Each line in the comment must start with an asterisk. Lines that don’t start with an asterisk are not processed (so that you can effectively comment your comments).

You can also use three slashes, ///, if your comment is only a single line.

You then run the source file through the d.awk script to generate your HTML documentation, like this:

awk -f d.awk source.c > doc.html

The GitHub repository contains the file demo.c that serves as an example and test of all the features.

This is what the output looks like if you run demo.c through the script: demo.html

In my own projects, I usually put a docs target in my Makefile, like this:

docs: docs.html
docs.html: source.c d.awk
    $(AWK) -f d.awk $< > $@

Additional Scripts ๐Ÿ”—

I’ve added some additional scripts to the distribution:

mdown.awk is a script to generate HTML from a regular Markdown file.

It is basically the d.awk script without the parts that filter source code comments.

It was originally meant for cases where you want to generate an HTML file from the README.md document in your project that has the same styles as the other generated documentation. I used this script with a modified stylesheet for this website before I switched to Hugo.

It is obsolete for its original purpose now that d.awk has a Clean mode. The Clean mode will treat the input file as a regular Markdown document. This allows you to use a single script to document your source and non-source files. You enable it by setting the Clean variable in the script to a nonzero value:

awk -f d.awk -vClean=1 README.md > README.html

hashd.awk is the same as d.awk, but for languages that use hash (’#’) symbols for comments. You use it the same as the other scripts but you start your comment block with two hashes, like this:

##
# My Project
# ==========
#
# This is some _Markdown documentation_ in a `source file`.
if __name__ == "__main__":
	print("Hello World")

mdown.awk and hashd.awk’s code is the same as d.awk’s, except for the code at the start that feeds input lines to the

The xtract.awk script extracts the Markdown comments of a source file without processing it. I have this for a use case where you might, say, want to copy markdown comments in your source code into a tool that takes markdown input, such as a GitHub wiki.

wrap.awk is a pretty printer script for Markdown that word wraps a markdown document to fit the specified width of a page. It is basically like the Unix fmt utility, except that it respects Markdown. The lines in my README files tend to be as long as my mood makes them, so this script helps me keep them neat.

Background ๐Ÿ”—

Many years ago I experimented with using an AWK script to generate HTML documentation from comments in my source code.

A lot of the libraries and utilities I’ve written over the years are have a single source file with a header, so tools such as Doxygen and Javadoc felt too heavy duty. I wanted a script that you can bundle with your source code and generates a single HTML file.

I wrote about the idea in Documenting Code with AWK.

It worked, but the syntax I devised was an ad hoc affair, and I was never really satisfied with it2.

I had this idea long before I knew what Markdown was โ€ If memory serves, my first exposure to Markdown was with StackOverflow around 2008, and my script predated even that.

At some point I decided that I wanted a more markdown-like syntax, at least for the basic things, like _italics_, **Bold**, and [hyperlinks](http://example.com).

I start a rewrite, where the initial version did just that, but then I decided I needed headings, ordered and unordered lists and code blocks. And the feature creep just kept creeping.

The result is the 1000+ line AWK script in d.awk.

Throughout the years it gained a lot of features:

  • Tables using the GitHub syntax,
  • Images, including images encoded as Data URIs,
  • Support3 for highlight.js syntax highlighting,
  • Support for Mermaid diagrams,
  • Support for MathJax mathematical typesetting4,
  • Footnotes and abbreviations,
  • Block quotes,
  • Task lists,
  • It can generate a table of contents,
  • You can also use HTML tags (with some limitations),
  • The HTML output has support for light and dark modes, and I took special care that it prints correctly.

I strive to stay close to GitHub-flavoured Markdown syntax as possible, but I’ve taken the liberty to add some features where it seemed useful5.

I still hack on it from time to time: I recently added support for definition lists and replaced the old code-prettify syntax highlighter with highlight.js.

I once mentioned this on HackerNews, and someone asked why I used AWK and not Perl. I replied that AWK was basically available everywhere, to which they replied that so is Perl. That is a fair point. The honest truth is that I just like AWK and I haven’t used Perl in many, many years.

Internals ๐Ÿ”—

The parser revolves around two core functions:

  • filter(), which processes the document at the high level, dividing it into paragraphs, lists, code blocks and so on, and
  • scrub(), which deals with the inline items, like bold, italic and monospaced blocks, and inline HTML tags.

Each comment line from the input is fed to filter(). filter() implements a state machine that tracks its current state in a variable called Mode. Mode’s value will be a string that tracks the current type of HTML tag it’s processing, like “p”, “pre”, “ul” and so on. States can be pushed on a stack or popped to implement things like nested lists.

The results of filter() are eventually passed to scrub() which uses a succession of regular expressions to replace words between underscores with italics, backticks with code tags, escape ampersands and less than or greater than symbols, remove HTML comments, and so on.

scrub() may also pass through a selection of HTML tags to the output, so that your documentation can contain elements that the Markdown might not allow. These tags cannot have attributes, and will be styled the same as the tags generated from the Markdown.

Look at the global variable HTML_tags at the top of the script to see which HTML tags are allowed through like this.

There are a couple of functions whose names start with end_. They’re used to finish up the formatting of special blocks.

  • end_table() gets all the rows that filter extracted for a table and formats it into an HTML <table>
  • end_pre() deals with indented or fenced code blocks. It can add syntax highlighting, or treat the block as a Mermaid diagram.
  • end_blockquote() adds an icon and a heading to the blockquote if the document contains GitHub-style alerts
  • end_dl() formats definition lists. When dealing with definition lists, filter() just concatenates all the input lines together. end_dl() then separates those lines and converts them to <dt> and <dd> tags.

These may all call scrub() again to format their output.

GitHub’s markdown documentation has an appendix called A parsing strategy. I wish I had known about this document when I started out. On the other hand, it suggests parsing the document into a tree of blocks, and since trees are complicated to work with in AWK, maybe if I’d known I wouldn’t have attempted it at all.

Because my script doesn’t use a tree, there are some things it just cannot do, like nested block quotes, but those are features I can live without, and if they are really needed they can be done in HTML.

Still, the filter() function corresponds to the Phase 1: block structure section in that appendix, and scrub() to Phase 2: inline structure.

The output of the filter() is appended to a global variable called Out (I use the convention of capitalising global variable names). The END {} block in the script puts everything together to create the final HTML document.

It uses three additional helper functions, fix_footnotes(), fix_links() and fix_abbrs() to format footnotes, hyperlinks and abbreviations (<abbr> tags) respectively.

It also generates the JavaScript to toggle the dark mode and to copy code blocks to the clipboard.

If highlight.js, Mermaid or MathJax are used, it also generates the necessary <script> tags to load and initialise those libraries.

The END block also calls a function make_css() (defined near the end of the script) that generates the CSS. It first puts all the CSS into an associative array with the CSS selector as key and the styles as value. It then iterates through this associative array to generate the CSS itself.

This may seem overcomplicated, but back when I started, when the styles were simper, I had a couple of different themes with different styles, each in its own array. Eventually the CSS grew and I abandoned the idea of themes, but I left the CSS script the way it was because it may help users of the script customise fonts and colours.

A Note on Licensing ๐Ÿ”—

I want something permissive, so that anyone can just copy the script into their project directory and generate documentation, regardless of what license they use for their project.

For that reason, the individual files are distributed under the Free Software Foundation’s simple all-permissive license, which you’ll find in the comments at the top of each script, so that you can distribute the files freely with your project:

Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty.

The all permissive license is not an option when creating repositories on GitHub, though, so I decided to distribute the package as a whole under the MIT0 license.

Further Reading ๐Ÿ”—

The Internet Archive has a copy of the book The AWK Programming Language (PDF) by the original authors of AWK (Aho, Kernighan and Weinberger). This discussion about it on Hacker News has some further links and makes for interesting reading.

Extra Credit ๐Ÿ”—

After creating all of that, I once stumbled upon GitHub user r-lyeh’s stddoc.c project, which achieves the same results as my project but in a very different way: His project just isolates the special comments in a source file and writes it to a HTML file. Then it appends the Markdeep tag to the end, and you also have a HTML file with your code’s documentation.

If you’re unfamiliar with Markdeep, it is awesome. It is a piece of JavaScript code that you append to a Markdown document. You save the document with a .html extension and then when you open it in your browser the script renders the everything to HTML. Since I discovered it I have used it for a lot of things.

Of course, stddoc.c is in C, but the idea is very simple to implement in AWK. In fact, this is the type of text processing that AWK excels at. This is what such a script would look like:

#! /usr/bin/awk -f
BEGIN { print "<meta charset=\"utf-8\">" }
/\/\*\*/ {
	sub(/^.*\/\*/,"");
	incomment=1;
}
incomment && /\*\// {
	incomment=0;
	sub(/[[:space:]]*\*\/.*/,"");
	sub(/^[[:space:]]*\*[[:space:]]?/,"");
	print
}
incomment && /^[[:space:]]*\*/ {
	sub(/^[[:space:]]*\*[[:space:]]?/,"");
	print
}
!incomment && /\/\/\// {
	sub(/.*\/\/\/[[:space:]]?/,"");
	print
}
END {
	print "<!-- Markdeep: -->";
	print "<style class=\"fallback\">body{visibility:hidden;white-space:pre;font-family:monospace}</style>";
	print "<script>markdeepOptions={tocStyle:\"auto\"};</script>";
	print "<script src=\"https://morgan3d.github.io/markdeep/latest/markdeep.min.js\" charset=\"utf-8\"></script>";
	print "<script>window.alreadyProcessedMarkdeep||(document.body.style.visibility=\"visible\")</script>"
}

You can even put it directly into your Makefile, then you don’t even need an extra script in your project:

docs.md.html : include/file.h src/file.c
	echo '<meta charset="utf-8">' > $@
	awk '/\/\*\*/{sub(/\/\*\*[[:space:]]*/,"");incomment=1} incomment && /\*\//{incomment=0;sub(/[[:space:]]*\*\/.*/,"");print} incomment && /^[[:space:]]*\*/{sub(/^[[:space:]]*\*[[:space:]]?/,""); print}' $^ >> $@
	echo '<!-- Markdeep: --><style class="fallback">body{visibility:hidden;white-space:pre;font-family:monospace}</style>' >> $@
	echo '<script>markdeepOptions={tocStyle:"auto"};</script>' >> $@
	echo '<script src="https://morgan3d.github.io/markdeep/latest/markdeep.min.js" charset="utf-8"></script>' >> $@
	echo '<script>window.alreadyProcessedMarkdeep||(document.body.style.visibility="visible")</script>' >> $@

I’ll still keep mine around, though, because it has some features I prefer: The output is a single HTML file with no external dependencies (as long as you don’t use the syntax highlighting, mermaid and MathJax features), it can render fine without JavaScript (though things like the dark-mode toggle break), and it strives for compatibility with GitHub’s syntax6, so you can combine your code documentation with your README.md file.


  1. I pronounce it “dawk” to rhyme with “doc”. ↩︎

  2. If I had known better then, I might’ve been better off with a syntax based on troff because of the way troff also starts lines with special symbols. ↩︎

  3. The output HTML file won’t load highlight.js, MathJax or Mermaid if you don’t use them ↩︎

  4. KaTeX is a popular alternative, but I went with MathJax because that’s what sites like MathOverflow and GitHub uses ↩︎

  5. It will not support maps and 3D models anytime soon, however. ↩︎

  6. Markdeep’s syntax diverges a bit from GitHub’s flavour, so you can’t necessarily just slap the Markdeep tags onto your README.md↩︎