Let's Build a Hexdump Utility in C

17th December 2020


Let's build a hexdump utility in C — this is a good beginner-level project if you're learning the language for the first time.

A hexdump utility lets you look at the individual byte values that make up a file. Here's some sample output from the utility we're going to build, using its own source file as input:

The value of each 8-bit byte is displayed as a pair of hexadecimal (i.e. base-16) digits. Why hexadecimal? The number of distinct values any sequence of digits can represent is given by the base to the power of the number of digits in the sequence. The 8 binary digits in a byte can represent 28 or 256 distinct values, usually interpreted as the numbers 0 to 255. Similarly, a two-digit hexadecimal number can represent 162 distinct values — also 256. This convenient correspondance is why programmers like hexadecimal notation so much — it gives us a compact way of representing all possible byte values with just two digits.1

Setup

I'm going to assume for this tutorial that you're relatively new to programming in C. I'm not going to try to teach you the language itself, I'm assuming you know the basics already, but I am going to explain each step of the build process in detail, including how to set up and compile the project files. If you already have your own preferred way of doing things you should go ahead and use it — these instructions are intended for beginners.

I'm also going to assume that you're working on the command line on some kind of unixy system — Linux, Mac, or any flavour of BSD for example. If you're using Windows you can follow along by using the WSL.

Note that you can find working demonstration code for this tutorial on Github. This code has been placed in the public domain.

First Steps

First let's make sure we can get something (anything!) compiling. Create a new directory for the project, cd into it, and create a new file called hexdump.c.

$ mkdir hexdump
$ cd hexdump
$ touch hexdump.c

(You don't need to type the $ symbols — I'm using them to indicate that this is input we're typing in at a shell prompt.)

Next, add some simple hello world code to the file so we can verify that our toolchain is working as expected:

#include <stdio.h>

int main(int argc, char** argv) {
    printf("hello hexdump\n");
}

On my system I can compile this file by typing the following command:

$ cc -o hexdump hexdump.c

This produces an executable binary called hexdump in the same directory as the source file. I can run this binary with the following command:

$ ./hexdump
hello hexdump

cc is the traditional name for the system's C compiler. On my system it's actually an alias for clang; on yours it might be an alias for gcc, or your system might not ship with a builtin C compiler and you'll have to install one yourself.

You should make sure that you can run this code (or something very like it) on your system before going any further!

If you haven't used a command-line C compiler before, don't worry — it looks intimidating at first but it's actually quite straightforward once you get used to it. I remember when I was learning C for the first time finding it frustratingly difficult to get my hands on a clear, simple guide to how the compiler worked, but you're in luck as I've written up an introductory cheatsheet which should help you get started.

Organising the Project Files

Okay, now we know we have a working toolchain that can compile C code. Next we're going to organise our project files and set up a makefile to run the compiler for us.

First, create a src directory and move the hexdump.c file inside it:

$ mkdir src
$ mv hexdump.c src

Next, create a file called makefile and add the following lines:

binary:
    @mkdir -p bin
    cc -o bin/hexdump src/hexdump.c

Don't just copy and paste the code above — it won't work! The indent in a makefile needs to be an actual tab character and my website replaces tabs in code samples with spaces.2

Now we can compile our project by running the make command from the project's root directory:

$ make

This will compile the src/hexdump.c file and put the output binary in a directory called bin. We can run the binary by typing:

$ ./bin/hexdump
hello hexdump

(I'm assuming that make is already installed on your system. It usually is, but if it isn't you can download a copy from your operating system's package manager.)

make is an extremely useful tool — it's really designed for managing the dependencies between files in complex projects and we're only going to scratch the surface of its capabilities here. If you haven't used it before this tutorial is a good overview for beginners. (I'd also recommend adding the manual to your reading list.)

So what just happened? Looking at our makefile, the line:

binary:

is called a target.3 A makefile can have multiple targets — we can tell make to run the recipe below a particular target by specifying its name as a command line argument:

$ make binary

If we don't specify a target make will default to running the first target in the file, which is why we can now compile our project just by typing make.

The target's recipe is a set of shell commands — we can put any commands here that we could type in at a shell prompt. The first command:

@mkdir -p bin

simply creates the bin directory if it doesn't already exist. make normally prints each command to the shell before running it; I've put an @ symbol at the beginning of this line to tell make not to print it as I don't want this kind of boring housekeeping code cluttering up my shell.

The next line is the actual compiler command:

cc -o bin/hexdump src/hexdump.c

We could have typed this line in the shell ourselves and achieved exactly the same result. One advantage of using make is that it's shorter to type. A more important advantage is that when we come back in six months and want to recompile our project we won't have to remember whatever complicated set of flags and commands we ended up using last time — it will all be written down for us in our makefile.

Adding a Library

Our hexdump utility is going to support a handful of command line options, e.g. a --num <int> option for specifying the number of bytes to read. Parsing command line options in C is a pain — there's no builtin way to do it in the standard library — so we're going to use a simple library I've written for the purpose called (imaginatively) Args.

Using the library is easy. You need to download two files, args.h and args.c, and add them to your project's src folder. You can download these files from the tutorial repository on Github.

Next, change the compiler command in your makefile to the following:

cc -o bin/hexdump src/hexdump.c src/args.c

This tells the compiler to compile the hexdump.c and args.c files individually and link the two resulting object files together into a single executable.

We can make sure the library is working properly by adding support for --help/-h and --version/-v flags to our executable.

Open the hexdump.c file and change its contents to the following:

#include "args.h"

char* helptext =
    "Usage: hexdump [file]\n"
    "\n"
    "Arguments:\n"
    "  [file]              File to read (default: STDIN).\n"
    "\n"
    "Options:\n"
    "  -l, --line <int>    Bytes per line in output (default: 16).\n"
    "  -n, --num <int>     Number of bytes to read (default: all).\n"
    "  -o, --offset <int>  Byte offset at which to begin reading.\n"
    "\n"
    "Flags:\n"
    "  -h, --help          Display this help text and exit.\n"
    "  -v, --version       Display the version number and exit.\n";

int main(int argc, char** argv) {
    // Instantiate a new ArgParser instance.
    ArgParser* parser = ap_new();
    ap_helptext(parser, helptext);
    ap_version(parser, "0.1.0");

    // Parse the command line arguments.
    ap_parse(parser, argc, argv);
    ap_free(parser);
}

I wouldn't normally advocate cutting and pasting sample code but that helptext literal is an exception. The C language developers haven't gotten around to supporting multi-line strings yet (any decade now) but we can hack it, kinda, sorta, by using the fact that C concatenates adjacent string literals.

If you recompile the code and run the binary with a -h or --help flag:

$ ./bin/hexdump --help

you should see the help text printed. Similarly if you use a -v or --version flag you should see the version number printed.

I'm not going to explain how the argument-parsing library works in detail here — you can read the documentation if you're interested. The important point for us is that the library will handle the messy process of parsing the command line arguments, checking if they're valid, and converting any option values into integers.

Writing the Code

It's been a long road but we're finally ready to begin writing our application code! Here are all the standard library #include statements we're going to need, you should add them to the top of your hexdump.c file:

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

And here's our finished main() function:

int main(int argc, char** argv) {
    // Instantiate a new ArgParser instance.
    ArgParser* parser = ap_new();
    ap_helptext(parser, helptext);
    ap_version(parser, "0.1.0");

    // Register our options, each with a default value.
    ap_int_opt(parser, "line l", 16);
    ap_int_opt(parser, "num n", -1);
    ap_int_opt(parser, "offset o", 0);

    // Parse the command line arguments.
    ap_parse(parser, argc, argv);

    // Default to reading from stdin.
    FILE* file = stdin;
    if (ap_has_args(parser)) {
        char* filename = ap_arg(parser, 0);
        file = fopen(filename, "rb");
        if (file == NULL) {
            fprintf(stderr, "Error: cannot open the file '%s'.\n", filename);
            exit(1);
        }
    }

    // Try seeking to the specified offset.
    int offset = ap_int_value(parser, "offset");
    if (offset != 0) {
        if (fseek(file, offset, SEEK_SET) != 0) {
            fprintf(stderr, "Error: cannot seek to the specified offset.\n");
            exit(1);
        }
    }

    int bytes_to_read = ap_int_value(parser, "num");
    int line_length = ap_int_value(parser, "line");
    dump_file(file, offset, bytes_to_read, line_length);

    fclose(file);
    ap_free(parser);
}

Most of this code is concerned with setting up and then processing our various command line options.

We default to reading from stdin if a filename hasn't been specified by the user. If a filename has been specified we try to open the file, exiting with an error message if anything goes wrong.

If the user has specified an offset using the --offset <int> option we try to seek to that offset in the file. If this fails for any reason we exit with an error message. (This code as I've written it only supports seeking forward to a positive offset from the beginning of the file. If you wanted to enhance it, you could interpret a negative offset value as meaning the user wants to seek backward from the end of the file.)

We hand the actual job of hexdumping the file off to a dump_file() function which we'll look at next. The bytes_to_read argument specifies the number of bytes we want this function to read — I'm using the 'magic' value of -1 to indicate that we want to read all the way to the end of the file.

Here's the code for the dump_file() function:

void dump_file(FILE* file, int offset, int bytes_to_read, int line_length) {
    uint8_t* buffer = (uint8_t*)malloc(line_length);
    if (buffer == NULL) {
        fprintf(stderr, "Error: insufficient memory.\n");
        exit(1);
    }

    while (true) {
        int max_bytes;

        if (bytes_to_read < 0) {
            max_bytes = line_length;
        } else if (line_length < bytes_to_read) {
            max_bytes = line_length;
        } else {
            max_bytes = bytes_to_read;
        }

        int num_bytes = fread(buffer, sizeof(uint8_t), max_bytes, file);
        if (num_bytes > 0) {
            print_line(buffer, num_bytes, offset, line_length);
            offset += num_bytes;
            bytes_to_read -= num_bytes;
        } else {
            break;
        }
    }

    free(buffer);
}

We begin by allocating a buffer to hold a single line of input from the file. The loop then reads a single line of input per iteration into this buffer and hands it off to a print_line() function to display.

We have to do an elaborate little dance to figure out the maximum number of bytes we want to read per iteration. Generally we'll want to read up to one full line of bytes, but we may want to read fewer on the last iteration if the user has specified a particular number of bytes to read with the --num <int> option.

The fread() function returns the number of bytes read. If this value is zero we've reached the end of the file (or the end of the block the user wanted to read) so we break from the loop.

Here's the code for the print_line() function that displays the output:

void print_line(uint8_t* buffer, int num_bytes, int offset, int line_length) {
    printf("%6X |", offset);

    for (int i = 0; i < line_length; i++) {
        if (i > 0 && i % 4 == 0) {
            printf(" ");
        }
        if (i < num_bytes) {
            printf(" %02X", buffer[i]);
        } else {
            printf("   ");
        }
    }

    printf(" | ");

    for (int i = 0; i < num_bytes; i++) {
        if (buffer[i] > 31 && buffer[i] < 127) {
            printf("%c", buffer[i]);
        } else {
            printf(".");
        }
    }

    printf("\n");
}

We begin by printing the line number which is given by the offset variable. We then loop over the buffer and print each byte value formatted as a two-digit hexadecimal number (or a spacer if we've run out of bytes). We add an extra space before each group of four bytes to make the output easier to read.

The second loop prints the ASCII character corresponding to each byte value if it's in the printable range, otherwise it prints a dot.

That's it, we're done! If you run make one more time you should have a working hexdump utility in your bin folder.

Final Thoughts

I'm sure you can think of ways to improve and expand on this code.

I've built a slightly more sophisticated hexdump utility of my own called Hexbomb which might give you some ideas to work from.

You can find working demonstration code for this tutorial on Github. This code has been placed in the public domain.

Notes

1

Actually, it's even better than you might suspect at first. Each hexadecimal digit aligns cleanly with four bits of the corresponding byte so the hexadecimal number 0x12 corresponds to the byte 0001_0010 and the hexadecimal number 0x34 corresponds to the byte 0011_0100. This makes it really easy to read bit patterns directly from hex notation — at least after you've had a little practice!

If you haven't met it before, the 0x prefix is used to indicate that a number is written in hexadecimal base. Similarly, 0o can be used to indicate octal (base-8) and 0b to indicate binary (base-2). If you want to play around with different number bases you might find a little utility I've written called Intspector useful.

2

Make's reliance on hard tabs has been annoying programmers for more than forty years at this point; it will probably continue annoying us for another forty years at least. It's been famously described (by Eric S. Raymond in The Art of Unix Programming) as "one of the worst design botches in the history of Unix".

3

Technically this is called a phony target as it doesn't correspond to a file name. (In general a make target is a filename and the recipe that follows is a set of instructions for building that file.) Phony targets are useful for handling project management tasks — common examples include make check for running a project's test suite, make clean for deleting temporary build files, and make install for building a binary and installing it on a user's system.