Configuring Docker data storage a 101

This is a short walkthrough on configuring Docker storage options on your development machine.

I’ll use my preferred version of Hello world on Docker – “Setting up mongodb” which lends itself nicely to a walk through of the storage options.

This walkthrough assumes basic familiarity with Docker. First let’s look at setting everything up on a single container.

I started from the DockerFile described here
mongoDB Dockerfile for demoing Docker storage options

Creating the image using

docker build -t mongodb .

You will note that in this Dockerfile we use the VOLUME command to define the target data directory for mongoDB

# Define the MongoDB data directory
VOLUME ["/data/db"]

I am walking through all this on my Mac thus I am using the following lean & mean command to start a mongodb container up as a background process ( daemon) from the mongodb image created from the docker file :

docker run -p 27017:27017 --name mongo_instance_001 -d mongodb --noprealloc --smallfiles

I can then add some data to a mongodb collection ( see Data loading below) That is quick and for some quick tests as part of a SDLC that might be fine but having to recreate your database and reload each time you create a container will eventually prove limiting.
We all know that you need representative datasets for a true test and it’s likely that your datasets are going to be more than 118 records and reloading data every time you run up a mongodb container is not going to be practical!

So we have two options as to how to address the persistance requirements:

  1. Data volume
  2. Data volume container

Data Volume

We will want to create a volume that maps to a folder on your local host in my case I will be mounting a folder on my Mac called $HOME/mongodata ( replace $HOME with your folder name if you are following this through on another OS )

We then create the container from the image but the difference is we now get the container to mount the local folder using this command to create a container:

$ docker run -v $HOME/mongodata/:/data/db -p 27017:27017 --name mongo_instance_001 -d mongodb --noprealloc --smallfiles

Note that as virtualbox shared folders does not support fsync() on directories mongodb will not actually start but you can validate that the mounting of a shared folder on the host works as the logs will show the error and you will see that it created some files in the shared folder before it halted. This part of the walkthrough will work as expected using mongoDB on AWS ec2 for example and is perfectly valid for those applications that do not require fsync() if you are using virtualbox.

Data volume container

This option in my opinion is the most flexible.

First you need to create a data container

docker run -v /data/db --name mongodata busybox

The above creates a data volume contaner based on the busybox image. (Its a small image)

Next you need to start up the application container but this time mounting the data container created earlier

docker run -p 27017:27017 --name mongo_instance_001  --volumes-from mongodata -d mongodb --noprealloc --smallfiles

Load some data into mongoDB

To validate this works as expected stop container 1 then start another container using a similar start up command attaching the Data volume container

docker run -p 27017:27017 --name mongo_instance_002  --volumes-from mongodata -d mongodb --noprealloc --smallfiles

You can check that now when you start mongoDB and look at the databases and collections that the data you loaded using the previous container is available.

You can remove the application containers whenever you like and create new ones as required mounting the data volume container. Note that using the docker ps command does not give you any indication of what containers are mounted to the data volume container .
You can also tar the data volume and copy to another docker host etc see the docker docs for detail on the process

Data loading

I am assuming some familiarity with mongoDB . If you need a quick primer have a look here: Getting started with mongodb

I am using a json file that consists of a dataset of the elements of the Periodic table to populate my database. Here’s how I load my demo databases with data :

mongoimport --db demo --collection periodictable  --type json --file periodictable.json  --jsonArray 

For the purposes of this walkthrough I am using images that are on my local machine rather than pushing up to a registry and pulling back down again.

This walkthrough has been focused on the practicalities of storage with Docker for a deeper dive on storage have a read of this excelent post  on the Overview of storage scalablity in Docker on the RedHat developer blog

Loosely coupled Chef Cookbooks

I’ve been working on a mongodb installation and configuration cookbook which allows me to install &  if required make custom configurations . It allows me to Install and configure a standalone mongodb installation or a replica set.

Developing this cookbook ( still a work in progress )  has led me to take a loosely coupled approach to its development such that I did not want to force a dependency on any previous recipe.  This has  meant a number of rules would need to be followed to use the cookbook properly rather than imposing any  constraints.

So why did I come to this conclusion that flexibility and thus loose coupling was a requirement for this particular cookbook: 

The use of a replica set and the fact you may want to seed the mongodb set up with data  from a backup did gave me food for thought.  When spiking the various configuration  scenarios I found that if I updated my current master via a data dump where I  had decided to stop the master mongodb instance  while I copied the data into the data folder  I found  the data wasn’t being replicated. This was because  one of the slaves had then taken over as the master and you can’t really force a master( without pain )  in a  replica set  ( maybe 10gen can advise on that one although I guess if I’d made sure the owner of the files was correct before the copy I may not have hit the mongo having a fit stage) so I needed to cater for that one.

Seeding mongo before a replica set is created seems like a nice approach to me anyway.

In a standalone mongodb instance I’m probably not worried about Raid so the recipe to create the Raid device shouldn’t be a constraint and while I’m at it why can’t you set up a replica that just uses instances with local storage?  

 I want to add recipes to create the job to undertake regular backups and maybe one to do a restore. But I may not want to use them.

Suddenly the list of things I want my mongo cookbook to do is growing. So I have done what is required  to deliver the functionality that is required  by  the client  I needed to do the Chef work for and   now I can pimp my cookbooks till I’m happy enough to share it with the community .

(The whole ‘just enough’ ethos is something I mean to talk about here but not now)

I want the recipes to be easy to use and understood by people new to Chef and also to mongodb as I do not believe just because you have a sophisticated tool like Chef that should mean your cookbooks should be overly complicated. Keeping it simple makes maintenance easy and encourages others to expand upon it appropriately if they follow the rules. Mongodb is very easy to get up and running so why use a tool to make it suddenly obtuse?

So the  rules to date :

Obviously you need to have mongo installed as a starting point.  I couldn’t really mandate the use of the installation recipe as it may be an existing set up . ( I have to modify this so it brings down a specified version rather than just the latest version from the 10gen repository) .

Each recipe is to be used to carry out a  single function e.g install mongdb, configure the configuration file , start mongodb etc. The combining of functions is discouraged

Each subsequent recipe can be run independently of the others or be combined as a role this meant making sure I had a recipe to start mongodb so this could be dropped in as say part of a role or workflow .

The use of templates and variables to encourage flexibility .

When I get to a point I feel the cookbook is pimped appropriately I’ll post a  dissection and some guidance .