If you were like me you started throwing all kinds of files into MongoDB with GridFS. When you took a look at the db.fs.files collection you saw something like this for a document:
{ "_id" : ObjectId("4c40affcce64e5e275c60100"), "filename" : "My First File.jpg", "uploadDate" : "Fri Jul 16 2010 15:16:12 GMT-0400 (EDT)", "length" : 55162, "chunkSize" : 262144, "md5" : "46aa378be7f6f1f3660efd7de5c1cbb6" }
Did you see the MD5 hash? It’s there for a reason you know.
Since my PHP/MongoDB application has an administrative backend multiple people are loading up files. There is always a possibility that they will upload the same file. Of course this would be a very inefficient use of storage especially when the file is a video or picture. That’s where the MD5 field in fs.files comes in handy.
In PHP you can use the md5_file() method to get the MD5 hash before you save the file to MongoDB. Running a findOne query using the md5 of your tmp file will let you know if a document for that file already exists. If it does exist, then you’ll get back the fs.files document of the preloaded file. Then you can use the _id as a reference and don’t bother saving the file. Can you imagine all the money you save in storage fees on Amazon S3?
This is a very common and reliable way of doing things since byte for byte you know the files are the same. The sample script below is a snapshot of code in a Lithium application (Lithium is a new PHP 5.3+ framework). I’m basically running a findOne({“md5″ : “$md5″}) query:
protected function write() { $success = false; $grid = File::getGridFS(); $this->fileName = $this->request->data['Filedata']['name']; $md5 = md5_file($this->request->data['Filedata']['tmp_name']); $file = File::first(array('conditions' => array('md5' => $md5))); if ($file) { $success = true; $this->id = (string) $file->_id; } else { $this->id = (string) $grid->storeUpload('Filedata', $this->fileName); if ($this->id) { $success = true; } } return $success; }