This is a read-only archive!

Goodbye Tokyo Cabinet, hello PostgreSQL?

The first version of this blog used MySQL; then I switched to Tokyo Cabinet. But now I've switched back to PostgreSQL. Here's why.

Why did I switch to TC to begin with?

  1. There weren't any good ORM-type libraries for Clojure at the time (over a year ago). So there was a bit of an impedance mismatch trying to query and work with my data. In the DB I have separate tables for posts, comments, tags, categories. But 90% of the time I want to fetch a post and end up in Clojure with all related tags, comments etc. (A JOIN won't work here without a lot of work to un-mangle the results; you usually need multiple queries.)

    With TC I could store anything in the DB, so I just dumped a post into the DB as serialized hash-maps with all the comments, tags, and categories as sub-keys. So querying was easy. (Or was it? More below.)

  2. MySQL was really slow. This is largely because my queries were terrible, as I tried to solve the problem from #1 via brute force. TC on the other hand is fast.

  3. I tried to solve #2 by using a Clojure ref as a cache. But tying the STM to a database's transaction system is as far as I know difficult or impossible right now (per many threads on the mailing list). I had a lot of potential race conditions, which (as far as I know) never bit me, but probably would've eventually. I had to deal with keeping the cache up-to-date as comments were posted and posts were added and deleted and renamed. Remember:

"There are only two hard problems in Computer Science: cache invalidation and naming things." --Phil Karlton

So why did I stop using TC?

  1. I have no idea how to use a key/value store database properly. TC will take anything you dump into it, which is both a strength and a weakness.

    There's a lot of crap you have to do by hand that a proper database does for you. Consider checking for null values, for example; I ended up with a lot of nils in my data because my validations weren't 100% foolproof, or because I imported data via code that didn't run the validations and I never noticed.. Or enforcing uniqueness of values; I had tag objects in the database with the same key but different values (due to capitalization differences), which screwed up a lot of stuff.

    On the other hand, there's a lot of information about how to use RDBMS properly, and I have a lot of experience with it already. Constraints are easy to set up. Columns have types, which is nice. (Strange that I gravitate toward statically-type databases while I gravitate toward dynamically-typed programming languages.)

  2. I have to compile and install Tokyo Cabinet by hand on my Linux distro. It's probably not worth distro maintainers to maintain a package that so few people use. MySQL and PostgreSQL have lots of people working on keeping them running OK on most Linux distros.

  3. Some kinds of queries were still awkward in TC. "Give me post X" was great: I'd also get all the tags, categories etc. for free. But then how do you query to get all tags across all posts? Or all comments? Fetch all the posts and iterate over them, collecting their tags, then uniquify the resulting list? Not so pretty, and not so fast. So I was back to caching again, which still gave me nightmares about race conditions and dirty data.

So now why am I using an RDBMS again?

  1. An RDBMS is exactly what I really need, if I could just query the thing concisely and get it to run fast. Thankfully there are some ORM-like libraries for Clojure in the works nowadays, already usable for a hobby project like this blog. There are clj-record, Carte, ClojureQL, and my own Oyako, and possibly others in the works.

  2. For my tiny blog's database, Oyako gives me slightly slower performance than TC, but along the same order of magnitude, which is good enough.

  3. Via Oyako I can (fairly concisely) fetch posts and get the associated tags, comments etc. But I can also easily fetch all tags, or all comments, since they're in their own tables. The "relational" part of RDBMS does come in handy sometimes.

Summary version

I switched to TC to begin with because I was using SQL wrong, and it was too slow and clumsy. Once I figured out how to use SQL correctly, it was a no-brainer to go back.

June 29, 2010 @ 4:32 AM PDT
Cateogory: Programming


Quoth baz on June 29, 2010 @ 5:40 AM PDT

I have to compile and install Tokyo Cabinet by hand on my Linux distro.

What distro are you using? Seems like most of the major distros package it:

Quoth Brian on June 29, 2010 @ 5:51 AM PDT

A very old version of Debian, then a somewhat old version of Ubuntu. I don't think tokyocabinet-bin comes with the Java bindings that Clojure needs. The Java lib still needs to be compiled.

This could be different now, I haven't checked in around 6 months.

Quoth Tim on June 29, 2010 @ 6:43 AM PDT

For more complicated queries, you can use Tyrant with the Lua extension.... the Lua serverside scripts get JIT compiled so it's fast.

Quoth Achiel on June 29, 2010 @ 7:55 AM PDT

(note before reading the rest: I use couchdb with python/django mostly)

Have you tried couchdb? quickgoogle gave me, no idea on stability.

The map/reduce features that it offers will probably address your querying issues. As to your data validation issues (issue 1 on your list), the problem with key/value DBs is that they accept anything, that's where part of the performance boost comes from. I'm not really familiar with clojure, but I know many k/v datastores offer middleware for languages that'll enable you to automate this.

Quoth Brian on June 29, 2010 @ 8:15 AM PDT

I didn't try couchdb, no. I don't think that Clojure library existed when I wrote the TC version of this site, or else it existed and I never found it.

If I switch a fourth time, I'll keep it in mind though. I pray that won't be necessary. There are too many options to assess. I'd spend a year just trying them all to see which work the best.

Quoth Bill on June 29, 2010 @ 3:04 PM PDT

So why did you switch from MySQL over to PG?

Quoth Brian on June 29, 2010 @ 3:41 PM PDT

Just wanted to learn it. I like how Postgres is more strict in certain areas (case-sensitive string comparisons etc.) I like the user/role system it uses. But no real reason.

Quoth Anonymous on June 29, 2010 @ 8:38 PM PDT

I am sorry but you can't just blame bad programming practises for your failure of use tokyo cabinet.

Quoth greg on June 30, 2010 @ 12:04 AM PDT

I do fully agree with you, NoSQL and SQL databases are not meant to do the same things ! Key value stores are, from my point of view perfect to store flat data which need fast read access RDMBS (ie not mysql) are great for transactionnal processes and structured data

Quoth Anonymous on June 30, 2010 @ 5:31 PM PDT

IMHO you shouldn't have tried to use TC for this use case to begin with. TC or other non-relational databases are not a silver bullet that will solve any problem you throw at it. Each database was designed for a specific purpose. There are key-value stores, document stores, multi-column sotres and more. As always you need to select the right tool for the job, and it's really important to read the documentation and use the tools wisely, you wouldn't try to insert a screw with a hammer.

Chris W
Quoth Chris W on July 01, 2010 @ 4:09 AM PDT

Clojure is a JVM language, right? Couldn't you use an existing, industrial strength ORM like Hibernate?

Quoth Curtis on May 03, 2011 @ 3:52 AM PDT

Great post.

Thank you for being willing to make misyakes, try new tech and share your lessons.