17th December 2020
Let's build a hexdump utility in C — this is a good beginner-level project if you're learning the language for the first time.
A hexdump utility lets you look at the individual byte values that make up a file. Here's some sample output from the utility we're going to build, using its own source file as input:
The value of each 8-bit byte is displayed as a pair of hexadecimal (i.e. base-16) digits. Why hexadecimal? The number of distinct values any sequence of digits can represent is given by the base to the power of the number of digits in the sequence. The 8 binary digits in a byte can represent 28 or 256 distinct values, usually interpreted as the numbers 0 to 255. Similarly, a two-digit hexadecimal number can represent 162 distinct values — also 256. This convenient correspondance is why programmers like hexadecimal notation so much — it gives us a compact way of representing all possible byte values with just two digits.1
I'm going to assume for this tutorial that you're relatively new to programming in C. I'm not going to try to teach you the language itself, I'm assuming you know the basics already, but I am going to explain each step of the build process in detail, including how to set up and compile the project files. If you already have your own preferred way of doing things you should go ahead and use it — these instructions are intended for beginners.
I'm also going to assume that you're working on the command line on some kind of unixy system — Linux, Mac, or any flavour of BSD for example. If you're using Windows you can follow along by using the WSL.
Note that you can find working demonstration code for this tutorial on Github. This code has been placed in the public domain.
First let's make sure we can get something (anything!) compiling. Create a new directory for the
project, cd
into it, and create a new file called hexdump.c
.
$ mkdir hexdump $ cd hexdump $ touch hexdump.c
(You don't need to type the $
symbols — I'm using them to indicate that this is input we're
typing in at a shell prompt.)
Next, add some simple hello world code to the file so we can verify that our toolchain is working as expected:
#include <stdio.h> int main(int argc, char** argv) { printf("hello hexdump\n"); }
On my system I can compile this file by typing the following command:
$ cc -o hexdump hexdump.c
This produces an executable binary called hexdump
in the same directory as the source file. I can
run this binary with the following command:
$ ./hexdump hello hexdump
cc
is the traditional name for the system's C compiler. On my system it's actually an alias for
clang
; on yours it might be an alias for gcc
, or your system might not ship with a
builtin C compiler and you'll have to install one yourself.
You should make sure that you can run this code (or something very like it) on your system before going any further!
If you haven't used a command-line C compiler before, don't worry — it looks intimidating at first but it's actually quite straightforward once you get used to it. I remember when I was learning C for the first time finding it frustratingly difficult to get my hands on a clear, simple guide to how the compiler worked, but you're in luck as I've written up an introductory cheatsheet which should help you get started.
Okay, now we know we have a working toolchain that can compile C code.
Next we're going to organise our project files and set up a makefile
to run the compiler for us.
First, create a src
directory and move the hexdump.c
file inside it:
$ mkdir src $ mv hexdump.c src
Next, create a file called makefile
and add the following lines:
binary: @mkdir -p bin cc -o bin/hexdump src/hexdump.c
Don't just copy and paste the code above — it won't work! The indent in a makefile needs to be an actual tab character and my website replaces tabs in code samples with spaces.2
Now we can compile our project by running the make
command from the project's root directory:
$ make
This will compile the src/hexdump.c
file and put the output binary in a directory called bin
. We can
run the binary by typing:
$ ./bin/hexdump hello hexdump
(I'm assuming that make
is already installed on your system. It usually is, but if it isn't you can
download a copy from your operating system's package manager.)
make
is an extremely useful tool — it's really designed for managing the dependencies between files
in complex projects and we're only going to scratch the surface of its capabilities here.
If you haven't used it before this tutorial is a good overview for beginners. (I'd
also recommend adding the manual to your reading list.)
So what just happened? Looking at our makefile, the line:
binary:
is called a target.3
A makefile can have multiple targets — we can tell make
to run the recipe below a particular
target by specifying its name as a command line argument:
$ make binary
If we don't specify a target make
will default to running the first target in the file, which is
why we can now compile our project just by typing make
.
The target's recipe is a set of shell commands — we can put any commands here that we could type in at a shell prompt. The first command:
@mkdir -p bin
simply creates the bin
directory if it doesn't already exist. make
normally prints each command
to the shell before running it; I've put an @
symbol at the beginning of this line to tell make
not to print it as I don't want this kind of boring housekeeping code cluttering up my shell.
The next line is the actual compiler command:
cc -o bin/hexdump src/hexdump.c
We could have typed this line in the shell ourselves and achieved exactly the same result.
One advantage of using make
is that it's shorter to type. A more important advantage is that when
we come back in six months and want to recompile our project we won't have to remember whatever
complicated set of flags and commands we ended up using last time — it will all be written down for us in
our makefile.
Our hexdump utility is going to support a handful of command line options, e.g. a --num <int>
option
for specifying the number of bytes to read.
Parsing command line options in C is a pain — there's no builtin way to do it in the standard
library — so we're going to use a simple library I've written for the purpose called (imaginatively)
Args.
Using the library is easy.
You need to download two files, args.h
and args.c
, and add them to your project's src
folder.
You can download these files from the tutorial repository on Github.
Next, change the compiler command in your makefile to the following:
cc -o bin/hexdump src/hexdump.c src/args.c
This tells the compiler to compile the hexdump.c
and args.c
files individually and link the
two resulting object files together into a single executable.
We can make sure the library is working properly by adding support for --help/-h
and --version/-v
flags to our executable.
Open the hexdump.c
file and change its contents to the following:
#include "args.h" char* helptext = "Usage: hexdump [file]\n" "\n" "Arguments:\n" " [file] File to read (default: STDIN).\n" "\n" "Options:\n" " -l, --line <int> Bytes per line in output (default: 16).\n" " -n, --num <int> Number of bytes to read (default: all).\n" " -o, --offset <int> Byte offset at which to begin reading.\n" "\n" "Flags:\n" " -h, --help Display this help text and exit.\n" " -v, --version Display the version number and exit.\n"; int main(int argc, char** argv) { // Instantiate a new ArgParser instance. ArgParser* parser = ap_new(); ap_helptext(parser, helptext); ap_version(parser, "0.1.0"); // Parse the command line arguments. ap_parse(parser, argc, argv); ap_free(parser); }
I wouldn't normally advocate cutting and pasting sample code but that helptext
literal is an exception.
The C language developers haven't gotten around to supporting multi-line strings yet (any decade
now) but we can hack it, kinda, sorta, by using the fact that C concatenates adjacent string literals.
If you recompile the code and run the binary with a -h
or --help
flag:
$ ./bin/hexdump --help
you should see the help text printed. Similarly if you use a -v
or --version
flag you should
see the version number printed.
I'm not going to explain how the argument-parsing library works in detail here — you can read the documentation if you're interested. The important point for us is that the library will handle the messy process of parsing the command line arguments, checking if they're valid, and converting any option values into integers.
It's been a long road but we're finally ready to begin writing our application code!
Here are all the standard library #include
statements we're going to need, you should add them to the top of
your hexdump.c
file:
#include <stdio.h> #include <stdlib.h> #include <stdbool.h> #include <stdint.h>
And here's our finished main()
function:
int main(int argc, char** argv) { // Instantiate a new ArgParser instance. ArgParser* parser = ap_new(); ap_helptext(parser, helptext); ap_version(parser, "0.1.0"); // Register our options, each with a default value. ap_int_opt(parser, "line l", 16); ap_int_opt(parser, "num n", -1); ap_int_opt(parser, "offset o", 0); // Parse the command line arguments. ap_parse(parser, argc, argv); // Default to reading from stdin. FILE* file = stdin; if (ap_has_args(parser)) { char* filename = ap_arg(parser, 0); file = fopen(filename, "rb"); if (file == NULL) { fprintf(stderr, "Error: cannot open the file '%s'.\n", filename); exit(1); } } // Try seeking to the specified offset. int offset = ap_int_value(parser, "offset"); if (offset != 0) { if (fseek(file, offset, SEEK_SET) != 0) { fprintf(stderr, "Error: cannot seek to the specified offset.\n"); exit(1); } } int bytes_to_read = ap_int_value(parser, "num"); int line_length = ap_int_value(parser, "line"); dump_file(file, offset, bytes_to_read, line_length); fclose(file); ap_free(parser); }
Most of this code is concerned with setting up and then processing our various command line options.
We default to reading from stdin
if a filename hasn't been specified by the user. If a filename
has been specified we try to open the file, exiting with an error message if anything goes wrong.
If the user has specified an offset using the --offset <int>
option we try to seek to that offset
in the file. If this fails for any reason we exit with an error message.
(This code as I've written it only supports seeking forward to a positive offset from the beginning of the
file. If you wanted to enhance it, you could interpret a negative offset value as meaning the user
wants to seek backward from the end of the file.)
We hand the actual job of hexdumping the file off to a dump_file()
function which we'll look at
next.
The bytes_to_read
argument specifies the number of bytes we want this function to read — I'm
using the 'magic' value of -1
to indicate that we want to read all the way to the end of the file.
Here's the code for the dump_file()
function:
void dump_file(FILE* file, int offset, int bytes_to_read, int line_length) { uint8_t* buffer = (uint8_t*)malloc(line_length); if (buffer == NULL) { fprintf(stderr, "Error: insufficient memory.\n"); exit(1); } while (true) { int max_bytes; if (bytes_to_read < 0) { max_bytes = line_length; } else if (line_length < bytes_to_read) { max_bytes = line_length; } else { max_bytes = bytes_to_read; } int num_bytes = fread(buffer, sizeof(uint8_t), max_bytes, file); if (num_bytes > 0) { print_line(buffer, num_bytes, offset, line_length); offset += num_bytes; bytes_to_read -= num_bytes; } else { break; } } free(buffer); }
We begin by allocating a buffer to hold a single line of input from the file.
The loop then reads a single line of input per iteration into this buffer and hands it off to a
print_line()
function to display.
We have to do an elaborate little dance to figure out the maximum number of bytes we want to read
per iteration.
Generally we'll want to read up to one full line of bytes, but we may want to read fewer on the last
iteration if the user
has specified a particular number of bytes to read with the --num <int>
option.
The fread()
function returns the number of bytes read. If this value is zero we've reached the end
of the file (or the end of the block the user wanted to read) so we break from the loop.
Here's the code for the print_line()
function that displays the output:
void print_line(uint8_t* buffer, int num_bytes, int offset, int line_length) { printf("%6X |", offset); for (int i = 0; i < line_length; i++) { if (i > 0 && i % 4 == 0) { printf(" "); } if (i < num_bytes) { printf(" %02X", buffer[i]); } else { printf(" "); } } printf(" | "); for (int i = 0; i < num_bytes; i++) { if (buffer[i] > 31 && buffer[i] < 127) { printf("%c", buffer[i]); } else { printf("."); } } printf("\n"); }
We begin by printing the line number which is given by the offset
variable.
We then loop over the buffer and print each byte value formatted as a two-digit hexadecimal number
(or a spacer if we've run out of bytes).
We add an extra space before each group of four bytes to make the output easier to read.
The second loop prints the ASCII character corresponding to each byte value if it's in the printable range, otherwise it prints a dot.
That's it, we're done!
If you run make
one more time you should have a working hexdump utility in your bin
folder.
I'm sure you can think of ways to improve and expand on this code.
I've used ASCII lines and dots for maximum compatibility but you could use unicode dots and box-drawing characters instead to make the output look prettier.
Adding support for negative offsets lets you use an option like --offset -128
to easily view
the last 128 bytes of a file. This capability can sometimes be very useful.
I've built a slightly more sophisticated hexdump utility of my own called Hexbomb which might give you some ideas to work from.
You can find working demonstration code for this tutorial on Github. This code has been placed in the public domain.
Actually, it's even better than you might suspect at first.
Each hexadecimal digit aligns cleanly with four bits of the
corresponding byte so the hexadecimal number 0x12
corresponds to the byte 0001_0010
and
the hexadecimal number 0x34
corresponds to the byte 0011_0100
.
This makes it really easy to read bit patterns directly from hex notation — at least after
you've had a little practice!
If you haven't met it before, the 0x
prefix is used to indicate that a number is written in
hexadecimal base. Similarly, 0o
can be used to indicate octal (base-8) and 0b
to indicate
binary (base-2).
If you want to play around with different number bases you might find a little utility
I've written called Intspector useful.
Make's reliance on hard tabs has been annoying programmers for more than forty years at this point; it will probably continue annoying us for another forty years at least. It's been famously described (by Eric S. Raymond in The Art of Unix Programming) as "one of the worst design botches in the history of Unix".
Technically this is called a phony target as it doesn't correspond to a file name.
(In general a make
target is a filename and the recipe that follows is a set of instructions for building
that file.)
Phony targets are useful for handling project management tasks — common examples include
make check
for running a project's test suite, make clean
for deleting temporary build files, and make install
for building a binary and installing it on a user's system.