If you were like me you started throwing all kinds of files into MongoDB with GridFS. When you took a look at the db.fs.files collection you saw something like this for a document:

	
{
	"_id" : ObjectId("4c40affcce64e5e275c60100"),
	"filename" : "My First File.jpg",
	"uploadDate" : "Fri Jul 16 2010 15:16:12 GMT-0400 (EDT)",
	"length" : 55162,
	"chunkSize" : 262144,
	"md5" : "46aa378be7f6f1f3660efd7de5c1cbb6"
}

Did you see the MD5 hash? It’s there for a reason you know.

Since my PHP/MongoDB application has an administrative backend multiple people are loading up files. There is always a possibility that they will upload the same file. Of course this would be a very inefficient use of storage especially when the file is a video or picture.  That’s where the MD5 field in fs.files comes in handy.

In PHP you can use the md5_file() method to get the MD5 hash before you save the file to MongoDB. Running a findOne query using the md5 of your tmp file will let you know if  a document for that file already exists. If it does exist, then you’ll get back the fs.files document of the preloaded file. Then you can use the _id as a reference and don’t bother saving the file. Can you imagine all the money you save in storage fees on Amazon S3?

This is a very common and reliable way of doing things since byte for byte you know the files are the same. The sample script below is a snapshot of code in a Lithium application (Lithium is a new PHP 5.3+ framework). I’m basically running a findOne({“md5″ : “$md5″}) query:

	
protected function write() {
		$success = false;
		$grid = File::getGridFS();
		$this->fileName = $this->request->data['Filedata']['name'];
		$md5 = md5_file($this->request->data['Filedata']['tmp_name']);
		$file = File::first(array('conditions' => array('md5' => $md5)));
		if ($file) {
			$success = true;
			$this->id = (string) $file->_id;
		} else {
			$this->id = (string) $grid->storeUpload('Filedata', $this->fileName);
			if ($this->id) {
				$success = true;
			}
		}
		return $success;
}

2 thoughts on “Use the MongoDB MD5 field in GridFS

  1. Pingback: List of useful Lithium (Li3) resources « Web mellom fjell

  2. You forget probably that by default there is no index on the md5 field so your query will be slow sooner or later. Furthermore you can have collisions that the another file without the same content can produce the same md5 has thus you will have a false condition. The right way would be to query for filename + md5 to make sure you have no duplicates though.

    Reply

Leave a reply

required

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>