This is a read-only archive!

Who needs a DB?

My blog is still working, in spite of my best efforts to crash it. So that's good. But lately I've been thinking that an SQL database is a lot of overkill just to run a little blog like this.

My blog only has around 450 posts total (over the course of many years), and about an equal number of user comments (thanks to all commenters!). Why do I need a full-blown database for that? All of my posts plus comments plus all meta-data is only 2 MB as a flat text file, 700k gzipped.

By far the most complicated part of my blog engine is the part that stuffs data into the database and gets it back out again in a sane manner (translating Clojure data to SQL values, and back again; splitting up my Clojure data structures into rows for different tables, and then re-combining values joined from multiple tables into one data structure). Eliminating that mess would be nice.

Inevitably I ended up with some logic in the database too: enforcing uniqueness of primary keys, marking some fields as NOT NULL, giving default values and so on. But a lot of other logic was in my Clojure code, e.g. higher-level semantic checking, and some things I wanted to set as default values were impossible to implement in SQL.

Wouldn't it be nice for all the logic to be in Clojure? And the data store on disk to be a simple dump of a Clojure data structure? I can (and did) write a few macros to give me SQL-like field declaration and data validation, for uniqueness of IDs and data types etc. For my limited needs it works OK.

The next question is what format to use for dumping to disk. Happily Clojure is Lisp, so dumping it as a huge s-exp via pr-str works fine, and reading it back in later via read-string is trivial.

Some Java data types can't be printed readably by default, for example java.util.Dates, which print like this:

#<Date Wed May 20 22:39:00 PDT 2009>

The #<> reader macro deliberately throws an error if you try to read that back in, because the reader isn't smart enough to craft Date objects from strings by default. But Clojure is extensible; you can specify a readable-print method for any data type like this:

(defmethod clojure.core/print-method java.util.Date [o w]
  (.write w (str "#=" `(java.util.Date. ~(.getTime o)))))

Now dates print as

#=(java.util.Date. 1242884415044)

and if you try to read that via read-string, it'll create a Date object like you'd expect.

user> (def x (read-string "#=(java.util.Date. 1242884415044)"))
user> (class x)
user> (str x)
"Wed May 20 22:40:15 PDT 2009"

Storing data in a plain file has another benefit of letting me grep my data from a command line, or even edit the data in a text editor and re-load it into the blog (God help me if that's ever necessary).

Having multiple threads banging on a single file on disk is a horrible idea, but Clojure refs and agents and transactions handle that easily. But I do have to work out how not to lose all my data in case the server crashes in the middle of a file update. (I've lost data (in a recoverable way) due to a server crash in the middle of a MySQL update too, so this is a problem for everyone.) Perhaps I'll keep a running history of my data, each update being a new timestamped file, so old files can't possibly be corrupted. Or use the old write-to-tmp-file-and-rename-to-real-file routine. Or heck, I could keep my data in Git and use Git commands from Clojure. It'd be nice to have a history of edits.

If this idea works out I'll upload code for everything to github, as usual.

May 20, 2009 @ 3:53 PM PDT
Cateogory: Programming


Jose A Ortega
Quoth Jose A Ortega on May 20, 2009 @ 8:53 PM PDT

You might be interested in tekuti, a blogging engine implemented in Scheme and using git as its backend.

Quoth Legooolas on May 21, 2009 @ 12:15 AM PDT

No reason why you couldn't have some sort of synchronized data structure as the store, and just write it out to disk occasionally. Maybe some transaction-based affair?

You'll have the same locking problem with the comments (or entries, if you're into parallel writing of blog entries) and adding them to the data structure, so it shouldn't be too much more of a problem to overcome when writing it to disk.

Dan Fego
Quoth Dan Fego on May 21, 2009 @ 12:40 AM PDT

My only thought is scalability. I mean, your current needs don't demand a database, but what if you suddenly become a posting maniac? What if your blog gets really popular, or you get dugg? I'm aware there are variables other than how your data is stored, but you certainly don't want it to be a bottleneck. Databases are designed to handle data storage and retrieval efficiently. Of course, everything is fast for small "n". :)

I suppose the ideal case would be to craft your program such that you can swap out data-store backends easily. That way, if you decide to change your mind, or decide you're suddenly in love with sqlite, for example, you can just change it without much pain.

Quoth rzezeski on May 21, 2009 @ 2:46 AM PDT

For the "server crashes during file update" problem maybe you could glean something from CouchDB? Right now the details escape me, but CouchDB goes with the philosophy that unexpected behavior is the norm, and that the persisted data should not be allowed to enter a corrupted state just because of a server crash. It is so ingrained in the implementation that there is no way to "stop" CouchDB, you simply kill it.

Anyways, another great post, keep em' coming!

Quoth Brian on May 21, 2009 @ 3:25 AM PDT

@Jose A Ortega thanks, I'll look at that.

@Legooolas Right, that's what Clojure refs would do. You can set a watcher on a ref so something is done every time the ref's value changes, in a concurrency-safe way. If everything is in one huge ref, there's no need to worry about locks or parallel comments/posts etc.

@Dan Fego I'm blogging as hard as I can right now, and still not producing much. Probably not much need to worry. But yeah I don't want it to be a bottleneck. Any update to site content will happen quickly in-memory and the disk write will be in a background thread, so no one should notice.

@rzezeski I'll have a look at CouchDB. Yaeh I need something foolproof, I'm pretty much certain I'm going to have a server crash sooner or later.

Brandon Gray
Quoth Brandon Gray on May 25, 2009 @ 12:29 PM PDT

Thanks for the great information for handling java.util.Date! In the past, I typically just saved off the time (.getTime) and did the conversion from there. I'll definitely be using this method in the future.

Quoth Phil on June 17, 2009 @ 4:35 AM PDT

I keep my blog posts on disk and managed them with Git; it's a great way to handle things. But I'd second the use of CouchDB if you're on a server where it can be easily installed. (I guess if you're running Clojure you can install whatever you want.)

Awesome date printing tip too; thanks!