If you were like me you started throwing all kinds of files into MongoDB with GridFS. When you took a look at the db.fs.files collection you saw something like this for a document:
{
"_id" : ObjectId("4c40affcce64e5e275c60100"),
"filename" : "My First File.jpg",
"uploadDate" : "Fri Jul 16 2010 15:16:12 GMT-0400 (EDT)",
"length" : 55162,
"chunkSize" : 262144,
"md5" : "46aa378be7f6f1f3660efd7de5c1cbb6"
}
Did you see the MD5 hash? It’s there for a reason you know.
Since my PHP/MongoDB application has an administrative backend multiple people are loading up files. There is always a possibility that they will upload the same file. Of course this would be a very inefficient use of storage especially when the file is a video or picture. That’s where the MD5 field in fs.files comes in handy.
In PHP you can use the md5_file() method to get the MD5 hash before you save the file to MongoDB. Running a findOne query using the md5 of your tmp file will let you know if a document for that file already exists. If it does exist, then you’ll get back the fs.files document of the preloaded file. Then you can use the _id as a reference and don’t bother saving the file. Can you imagine all the money you save in storage fees on Amazon S3?
This is a very common and reliable way of doing things since byte for byte you know the files are the same. The sample script below is a snapshot of code in a Lithium application (Lithium is a new PHP 5.3+ framework). I’m basically running a findOne({“md5″ : “$md5″}) query:
protected function write() {
$success = false;
$grid = File::getGridFS();
$this->fileName = $this->request->data['Filedata']['name'];
$md5 = md5_file($this->request->data['Filedata']['tmp_name']);
$file = File::first(array('conditions' => array('md5' => $md5)));
if ($file) {
$success = true;
$this->id = (string) $file->_id;
} else {
$this->id = (string) $grid->storeUpload('Filedata', $this->fileName);
if ($this->id) {
$success = true;
}
}
return $success;
}
