Download Files from MongoDB GridFS

Well you’ve gone and done it now. You were so clever and used GridFS to store large files in MongoDB. It worked awesome, and you’ve had zero problems. But now you have a problem: Let’s say you want a hardcopy of those files — for reasons ranging from archiving, to migrating to a different database system. Well I have bad news for you, because if your files collection count is more than a few dozen, there is no convenient way to retrieve the actual files from that filesystem except rolling your own — or following the instructions here, that is.

TL;DR Here is the script

A quick review of the file system I am working with: Let’s say you have a People collection and a Files collection. People might look something like:

> db.people.findOne()
{
    "_id" : "XXXXXXXXXXXXXXXXX",
    "status" : "Active",
    "name" : "Joe Smith",
    "files" : [ 
        { "file" : "aaaaaaaaaaaaaaaaa" }, 
        { "file" : "bbbbbbbbbbbbbbbbb" }, 
        { "file" : "ccccccccccccccccc" } 
    ]
}

and Files is a little more complex, but the aspect we care about might look something like:

> db.cfs.files.filerecord.findOne()
{
        "_id" : "YYYYYYYYYYYYYYYYY", 
	"original" : {
		"name" : "puppies.png",
		"updatedAt" : ISODate("2017-08-09T21:45:31Z"),
		"size" : 15645,
		"type" : "image/png"
	},
	"copies" : {
		"filesStore" : {
			"name" : "puppies.png",
			"type" : "image/png",
			"size" : 15645,
                        "key" : "zzzzzzzzzzzzzzzzzzzzzzzz", 
			"updatedAt" : ISODate("2017-08-09T21:45:31Z"),
			"createdAt" : ISODate("2017-08-09T21:45:31Z")
		}
	}
}

The important things to note are the _id of the file, as well as copies.filestore.key — that’s what we’ll be using to access & associate our files.

One of the great things about GridFS and the way it stores files is that there is no restriction on file names. If you want to have 30,000 files all named puppies.png you can do that.

MongoDB also has an excellent utility for file access & manipulation, mongofiles. It gives you complete CRUD control over the files in GridFS from the command line. However, if you do have 30,000 files all named puppies.png, mongofiles won’t be a ton of help as it’s primary features focus on using the name of the file for access. You could, for example,

mongofiles get puppies.png

and the resulting puppies.png file would be thusly exported to your local filesystem — but only one. In order to get a specific puppies.png file, you would need to pass the ObjectId of the file you want like so:

mongofiles -d db get_id 'ObjectId("56feac751f417d0357e7140f");

But this only works once you have the ObjectId of the file in question. There is a mongofiles command for listing all the files — but it only works to return you all of their names. Furthermore, the ObjectId you need to pass it is not the _id from the above example, but the copies.filestore.key node mentioned earlier (eg, zzzzzzzzzzzzzzzzzzzzzzzz ).

Of course, all of this is compounded by the fact that we don’t just want to download all 30,000 puppies.png files, every one of them holding a special and irreplaceable puppies image. We also want to maintain association with the Person object they are attached to, which does reference the _id of the File object.

The solution I’ve come up with was based originally on this gist, which I’ve modified to include the _id as a parent directory, with the file (under it’s original name) contained within. When run inside a backup_files dir, it will create a directory structure somewhat like this:

backup_files
├── YYYYYYYYYYYYYYYYY
│   └── puppies.png
├── IIIIIIIIIIIIIIIII
│   └── puppies.png
├── NNNNNNNNNNNNNNNNN
│   └── puppies.png
└── MMMMMMMMMMMMMMMMM
    └── puppies.png

The file’s copies.filestore.key is used for retrieval via mongofiles get_id, the file’s _id param is used as the directory name, and the file’s original name is preserved inside that dir. The dir’s name can then be referenced from the People collection to maintain a reference to a Person’s files (of puppies).

Here is the script:

Note that you will need to replace the name of your database (meteor in my case) on lines 10 & 12.

PS if you know of an easier way of doing this please tell me.

Share Your Thoughts

Leave a Reply