I’ve been meeting folks who are having a hard time understanding the practicality of a document database and dynamic schema. I often hear, “I’m just trying to wrap my head around the concept.”

Some of the main concerns are:

  • Shouldn’t we be normalizing?
  • What about joins?
  • What about ORM?

To help with the introduction and transition we’ll need to do a little unraveling than wrapping. Let’s kick off this short discussion with a literal example of why a document store (in this case MongoDB) works well.

Imagine for a moment that it’s time for your yearly visit to the doctor. You’ve been put on a few supplements to improve your health and the doctor is monitoring your progress. Time for a checkup.

You walk into the clinic and give your name. The secretary looks you up in their booking system and confirms your internal patient number. “There you are Mr. Solutions (Patient #000-00-0001). You’re right on time, please have a seat.” says the secretary. Since they have more than one Mr. Solutions its a good thing they have some sort of ID. The secretary goes to get your file as you wait patiently to be called in.

Let me clue you into something that happens behind the scenes. Having worked in a hospital I know the secretary goes into the back-office (database) and searches for the file #000-00-0001 (query). They only get one file. They don’t get a set of files that have a primary key and join them together. There are no joins. They don’t put your name/id number into a mapper and look you up.

If you took a peek into the back-office there are small files and big files. Some files have yellow sticky notes and others have blue. They are (indexed) by patient number. While a handful of files look very similar in structure (schema) others seem to be quite different. One way or another there is just one file per patient.

What’s the point? Welcome to a document database that works.

Many medical professionals made the choice  that a single file/folder/grouping of papers was sufficient to store your medical history. Some clinics are messy, others are neat, yet this simple system works for them. Back at the clinic it seems there is some information needed from a cardiologist that is in their files.

The secretary quickly calls and asks for a copy of the information. When he gets the copy he places it in your file.

You probably got the point a few paragraphs ago.

The same concept holds true with a document database. More often than not (yes there are exceptions) you’ll find that storing a majority of the pertinent data in a single document works well. What do I mean by a single document?

Taking our medical example, this is what it could look like if your medical history was digitized using the document database – MongoDB.

"_id" : ObjectId("4c537d4b82fd211170000000"),
	"name" : "Mr. LightCube Solutions"
	"date_created" : "Sat Jan 2 2010 21:32:58 GMT-0400 (EDT)",
        "DOB" : "Sat Jan 1 1970 00:01:01 GMT-0400 (EDT)",
	"billing" : {
		"description" : "Home",
		"telephone" : "",
		"address" : "43 Happy Lane",
		"address_2" : "",
		"city" : "Light Land",
		"state" : "NY",
		"zip" : "111222",
		"country" : "USA",
	"medications" : [
			"_id" : "4c537b3382fd21df6f040000",
			"name" : "Activase"
			"description" : " tissue plasminogen activator",
			"dose" : "100mg Vial",
			"date_started" : "Fri Jul 30 2010 21:32:58 GMT-0400 (EDT)"
	"appointments" : [
			"date" : "Fri Jul 30 2010 21:32:58 GMT-0400 (EDT)",
			"summary" : "Started patient on Activase due to heart disease."
			"scheduled_checkup" : "Mon Aug 30 2010 11:00:00 GMT-0400 (EDT)",

	"conditions" : {
		High Blood Pressure,
		Excessive Vitamin D

What would that document look like in a SQL database? I would guess 4 or 5 different tables requiring joins would be in place.

Why normalize all that valuable information?

If its not practical to do this in reality then why do it digitally?

Store your data in a way that is practical and dynamic. For instance, maybe Mr. Solutions will be put on some special program that will need to be monitored different than previous trials. Why build a new database just for that? Just adapt the schema and embed.

A document database offers the flexibility, speed and simplicity that we already live by in other systems.

Hopefully this brief trip to the doctor helped to clear things up. Now go take your MongoDB medicine and call me in the morning. :)

If you were like me you started throwing all kinds of files into MongoDB with GridFS. When you took a look at the db.fs.files collection you saw something like this for a document:

	"_id" : ObjectId("4c40affcce64e5e275c60100"),
	"filename" : "My First File.jpg",
	"uploadDate" : "Fri Jul 16 2010 15:16:12 GMT-0400 (EDT)",
	"length" : 55162,
	"chunkSize" : 262144,
	"md5" : "46aa378be7f6f1f3660efd7de5c1cbb6"

Did you see the MD5 hash? It’s there for a reason you know.

Since my PHP/MongoDB application has an administrative backend multiple people are loading up files. There is always a possibility that they will upload the same file. Of course this would be a very inefficient use of storage especially when the file is a video or picture.  That’s where the MD5 field in fs.files comes in handy.

In PHP you can use the md5_file() method to get the MD5 hash before you save the file to MongoDB. Running a findOne query using the md5 of your tmp file will let you know if  a document for that file already exists. If it does exist, then you’ll get back the fs.files document of the preloaded file. Then you can use the _id as a reference and don’t bother saving the file. Can you imagine all the money you save in storage fees on Amazon S3?

This is a very common and reliable way of doing things since byte for byte you know the files are the same. The sample script below is a snapshot of code in a Lithium application (Lithium is a new PHP 5.3+ framework). I’m basically running a findOne({“md5″ : “$md5″}) query:

protected function write() {
		$success = false;
		$grid = File::getGridFS();
		$this->fileName = $this->request->data['Filedata']['name'];
		$md5 = md5_file($this->request->data['Filedata']['tmp_name']);
		$file = File::first(array('conditions' => array('md5' => $md5)));
		if ($file) {
			$success = true;
			$this->id = (string) $file->_id;
		} else {
			$this->id = (string) $grid->storeUpload('Filedata', $this->fileName);
			if ($this->id) {
				$success = true;
		return $success;