storing files in postgresql

Phillip_Smith · October 22, 2014, 4:29pm

I have in mind an application that could end up storing thousands of files of varying size in Postgresql. Total size of the files could be hundreds of gigs. Now I know its generally not good practice to store files directly in a database but for simplicity’s sake, I am at least exploring the idea. Can it handle file storage like that? Or am I being totally stupid to even consider it? Or is there a way to send files to Postgresql and have it store them outside of the database? I’m not all that familiar with Postgresql and have just started to look into it.

Thanks,

Phil

Rich_H · October 22, 2014, 4:50pm

I would recommend NOT storing files in the database and instead, have a file pointer system where the files are referenced on the outside.

you could create a storage module in your application that does the following:

insert file pointer references into the database.
can store the file to the file pointer reference outside of the database.
can retrieve the file to the requester.

Bob_Keeney · October 22, 2014, 6:49pm

Of course, this assumes that you have some sort of middleware app on the server that communicates to client apps and can access the server file system. That way you can just store the reference to the file in the database (very small).

If you don’t have that then you absolutely must store it in the database. I don’t recommend it but you can. I’d recommend having a separate table for ‘chunks’ of file data. When you save a file break it up in the smaller chunks of data (say 1 MB) and when you ask for a file you have a routine that puts the chunks together in the proper order. And of course check the length and hash and all that to make sure it was assembled properly.

Rich_H · October 22, 2014, 6:56pm

another good reason for why you want to keep your files on the outside.

database backups.

if the file sizes you are going to add to the DB are they size you say they are… you will need to calculate how long it will take to do a full backup. if the database backup takes HOURS to perform… will this be acceptable?

Maximilian_Tyrtania · October 22, 2014, 8:30pm

One of the PG installations I take care of has 0.4 Tb worth of data, 85% of which are PDFs stored internally. Had to mount a backup once - took 2 days. Wasn’t fun.

Phillip_Smith · October 23, 2014, 2:33am

Hmmm… I’m starting to really not like the idea of storing the files inside the database. The other problem is that this needs to be dead simple on the server end and since most database servers require some administration I"m leaning towards rolling my own server to handle the file storage. But I shudder at the amount of time that could take.

Rich_H · October 23, 2014, 4:51am

The application or service that runs the storage module is the one that needs access to the database and file store. This is no different than your original thought of storing files in a database… something would have to perform this task.

Without seeing the full view of your project scope… its hard to advise.

Daniel_Huttenberger · October 23, 2014, 7:38am

I recommend this wiki entry for your final decision:

https://wiki.postgresql.org/wiki/BinaryFilesInDB

Rich_H · October 23, 2014, 1:08pm

[quote=137343:@Daniel Hüttenberger]I recommend this wiki entry for your final decision:

https://wiki.postgresql.org/wiki/BinaryFilesInDB[/quote]

This is a great article with a lot of good information.

The one subject that it doesn’t cover is the hosted environment for the database. If you are going to use a database as a file dump, the requirements for hardware / storage is going to be different than a database with file pointer design.

Phillip_Smith · October 23, 2014, 3:36pm

Basically what I’m trying to do is create a client server backup system. By day I am a Tivoli Storage Manager administrator and I guess backups are in my blood.

This is not going to be enterprise class backup, more like a home or (very) small office backup system. I’ve looked at using rsync and I’m still considering it but would really like to do a true client server system.

Thanks for the wiki article. I’ll check it out.

Norman_P · October 23, 2014, 3:47pm

rsync would work fine
You can sync to a remote system then disconnect
You’d have to do the same with a client server backup system

Phillip_Smith · October 23, 2014, 4:26pm

I want to make the server side as painless as possible. The target audience is home users and most home users are not skilled in maintaining databases, setting up ssh connections, etc…

So, I’m definitely ruling out storing files in Postgresql. Not a good idea. If I was doing this just for me, I think I would do it as my backup needs are small. I rarely release my programs but whenever I start a new project, I try to create it as if I’m going to release it, just in case I end up with something that I think could be useful to others.

Rsync requires the end user to do some things that could be tricky, at least in my opinion. But I do have some ideas on how to get around that.

Actually, the perfect solution could’ve been to use Studio Stable Database as it has a file API that would be perfect. However, its not intended for large files as it sends all of the file in one shot. I do have a source code license and have looked into modifying it. I know its discontinued but since I have the source I can maintain it, at least for my application.

The other option is to just write my own server. Not sure if that’s what I want to do because of the time involved, but it could also be fun.

Daniel_Huttenberger · October 23, 2014, 4:48pm

[quote=137397:@Rich Hatfield]This is a great article with a lot of good information.

The one subject that it doesn’t cover is the hosted environment for the database. If you are going to use a database as a file dump, the requirements for hardware / storage is going to be different than a database with file pointer design.[/quote]

The requirements for storage are explained in the docs for “Large Objects”.
http://www.postgresql.org/docs/current/static/lo-implementation.html

[quote]The large object implementation breaks large objects up into “chunks” and stores the chunks in rows in the database. A B-tree index guarantees fast searches for the correct chunk number when doing random access reads and writes.

The chunks stored for a large object do not have to be contiguous. For example, if an application opens a new large object, seeks to offset 1000000, and writes a few bytes there, this does not result in allocation of 1000000 bytes worth of storage; only of chunks covering the range of data bytes actually written. A read operation will, however, read out zeroes for any unallocated locations preceding the last existing chunk. This corresponds to the common behavior of “sparsely allocated” files in Unix file systems.

As of PostgreSQL 9.0, large objects have an owner and a set of access permissions, which can be managed using GRANT and REVOKE. SELECT privileges are required to read a large object, and UPDATE privileges are required to write or truncate it. Only the large object’s owner (or a database superuser) can delete, comment on, or change the owner of a large object. To adjust this behavior for compatibility with prior releases, see the lo_compat_privileges run-time parameter.[/quote]

scott_boss · October 23, 2014, 11:05pm

I have PostgreSQL database’s that are 10-100x that size with files inside of it and the backup/restore of them takes for ever. I have extremely fast hardware and it was still down for over a week. Whenever a DBA tells me that they are storing files in the database (doesnt matter which one), my heart skips a beat or dozen.

I agree with Bob. I can shoot myself in the foot with my paintball gun. But I prefer not to (it hurts a LOT at that range). Same thing. you can save files in the database. but you probably shouldnt.

Rich_H · October 23, 2014, 11:34pm

I think storing files in a relational database is a bad idea.

Hadoop on the other hand… Different story

scott_boss · October 23, 2014, 11:36pm

hadoop is not a database… but it sorta is but isnt… it is a beast of its own.

Rich_H · October 23, 2014, 11:47pm

never defined hadoop as a database just a different technology better suite for large file store.

Rich_H · October 23, 2014, 11:51pm

ok… so I have been thinking about the additional info from the OP…

if the intent of this backup tool is to be user friendly and designed for the non-tech head… then I wouldn’t even think about having to host a database.

just an idea… if files and folders are what you are aimming to backup… why not create a tokenlist in each directly on the backup folder. each tokenlist would contain the file contents of the folder and the various changes that happen within the directory.

this would help keep things simple with very little maintenance.

Norman_P · October 24, 2014, 12:01am

[quote=137450:@Phillip Smith]I want to make the server side as painless as possible. The target audience is home users and most home users are not skilled in maintaining databases, setting up ssh connections, etc…
[/quote]
A really nice front end on Rsync so it’s easy to use is probably a ton less work & in the end just as, or more reliable, than something you could invent.
You could set up everything including ssh tunnels etc from inside whatever front end you write
I’ve done that

I wouldn’t rule out rsync (etc) just because it requires some technical know how
Thats where you come in

Phillip_Smith · October 24, 2014, 3:02pm

[quote=137617:@Norman Palardy]A really nice front end on Rsync so it’s easy to use is probably a ton less work & in the end just as, or more reliable, than something you could invent.
You could set up everything including ssh tunnels etc from inside whatever front end you write
I’ve done that

I wouldn’t rule out rsync (etc) just because it requires some technical know how
Thats where you come in :)[/quote]

Yes, good point.

I’m doing some more playing around with rsync now. I’m thinking that may be the better solution.

Thanks guys!

Phil