Getting list of referers out of Apache logs

I use Google Analytics, but it has a noticeable lag in updating its information. When my site is being hammered, I'd like to see where all the traffic is coming from. It'd also be nice to see how many hits my RSS feed is getting, and how many images and static files are being direct-linked, which Google Analytics currently isn't tracking for me at all.

So this script will look in my Apache logs and print referers for some URL, thanks to ApacheLogRegex:

#!/usr/bin/ruby

require 'apachelogregex'

raise "USAGE: #{$0} log_filename desired_url" unless ARGV[0] and ARGV[1]

format = '%v:%p %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
parser = ApacheLogRegex.new(format)
pat = Regexp.new(ARGV[1])
refs = {}

File.readlines(ARGV[0]).each do |line|
  x = parser.parse(line)
  if pat.match(x["%r"])
    r = x["%{Referer}i"]
    refs[r] = (refs[r] || 0) + 1
  end
end
refs.sort_by{|k,v| -v}.each do |ref,count|
  puts "%s: %s" % [count,ref]
end

I used to use awstats for this, but it was too heavyweight and a hassle to set up and keep running. Google Analytics is a no-brainer to use, even though the accuracy isn't as good as parsing Apache logs. At least I get an idea of which of my blatherings people are most interested in.

Out of memory... ouch

* This page is related to "Deploying Clojure websites".

I've written before about how I'm running four Clojure-driven websites out of a single JVM on my VPS. No problems for many months, but today I tried to make a blog post and got all kinds of out-of-memory errors. Hopefully I didn't lose any / many user comments on this blog in the past couple days, but it's possible.

I restarted the JVM and gave it a bit more RAM to play with, I imagine this will fix things. But we'll see. It occurs to me now that there may be such a thing as too much caching.

Deploying Clojure websites

* This page is related to "Clojure and Compojure to the rescue, again".

On my server I'm running one Java process, which handles four of my websites on four different domains. These are all running on Clojure + Compojure. Some people asked for details of how to do this, so here's a rough outline. For the sake of brevity I'm only going to talk about two domains here, though it scales up to however many you want pretty easily.

This is surely not the only way to do this, and probably not the best way, but it's what I've arrived at after a year of goofing off.

Summary: Emacs + SLIME + Clojure running in GNU Screen; all requests are handled by Apache and mod_proxy sends them to the appropriate Jetty instance / servlet.

Clojure and Compojure to the rescue, again

I haven't posted here much recently because I've been hacking on another recently-sort-of-completed website. One of my favorite hobbies is old 8-bit video games. The first thing I ever programmed was a website about Final Fantasy for the old NES, and I've fiddled with it for the past 10 years or so.

A while back I decided to rewrite the whole thing using Clojure + Compojure with data in mysql. This went really well. I know lines of code isn't that great a metric, but it can give a rough estimate: this whole website is done in 3,400 lines of Clojure, which includes all of the HTML "templates" and the DB layer I had to write. And it's turtles Clojure all the way down. The only thing not written in Clojure are a couple bits of Javascript here and there and the stylesheet.

I suspect the target audience of this blog and the target audience of that website don't overlap that much, but I figured someone might be interested in some of the detail of how it's implemented. A few things I learned...

Comments work again

I broke the ability to leave comments a couple days ago. Thanks to everyone who let me know. It's fixed now.

I broke it while uploading yet another website I finished a couple days ago. It's yet another Compojure/Clojure site, this time a bit more ambitious than my humble blog. I plan to write about that whole experience once I have a bit of time.

Lame comment spam management that works

It's been nine months since I ditched Wordpress and moved to a blog system I wrote from scratch (in Clojure). This was a great move in so many ways. One of those ways is comment spam. My site is as popular now (or maybe slightly more popular now) as it was when I was running Wordpress, so I think comparing before and after is valid.

With Wordpress, every morning I'd do the ritual of deleting overnight spambot droppings. Typically I got between 1 and 5 every night. I had a default Wordpress install and all I used for spam filtering was Askimet. Askimet did a surprisingly good job, catching literally if not thousands of spams every week which otherwise would've been ruining my site. But inevitably some would still get through. And what's worse, there were a lot more false positives than I could tolerate.

Since I started counting with my new system, which is around 6 months, to the best of my knowledge I've gotten zero spambot-produced comments that made it through my filters. This is pleasant, to say the least.

The system I'm using is stupid. None of it is stuff I thought of myself, I got ideas from other lots of other blogs or articles I read, but the implementation is mine and it's not sophisticated. It would take a bot author a few seconds to work around it. But no one has bothered. Why bother writing a bot for my one-man blog, when you can write a bot for Wordpress and have it work on tens of thousands of blogs? And I can change my system to defeat the bots with a few lines of code just as easily as they can work around it.

So here's why I think it's working.

Godaddy sucks

I'm in the process of moving all my domains the heck off of Godaddy. I'm trying Namecheap which seems slightly less evil, if the sheer amount of ad banners and upselling bullcrap is any indication. But probably only slightly less evil.

Honestly Godaddy has so many ads I can't even find the button to renew my domains. The process of buying anything takes you through 6 or 7 pages of the most garish, fanatical sleaze-peddling that you are likely to encounter on a website.

Domain registrars are the used car salesmen of the internet.

Blog source code updated

I updated the source code of my blog on github. I'm too tired to write much about it at the moment.

Suffice it to say I rewrote it all from scratch for the purpose of sharing, because lots of people were asking for it. It uses Tokyo Cabinet instead of mysql now, which is nice. I gutted the codebase so it's about 700 lines now (down from 1500, not bad). I plan to write up some posts later exploring various parts of it, for those who are interested.

Hope someone gets something out of it. Use at your own risk.

Moved to Linode

My web host for a good long while was Futurehosting. My OS was Debian 4.0 (Etch). Strike one: as of now there's still no option to upgrade to a newer version of Debian. Debian lags so much to begin with, it's really painful ify ou want to use anything released in the past two years.

I had an unmanaged VPS. I ran a bunch of funky non-standard stuff on there and it ran mostly OK. I had to upgrade to get more RAM just so SBCL would run on it, which sucked but I don't know that another host would've been any better.

The good thing about Futurehosting was that they responded very fast to tickets. The bad thing was the fact that I had ample opportunity to know this. The server would go down randomly once every month or two. I'd open a ticket saying "Hi my server is down", then things would be working again in a half hour, but why did this happen so often? I don't know. An awful lot of "failed switches". I wonder how often this happened without my knowing about it, given how often it happened in the middle of my using the server for something.

With all the hardware they were burning through I would've expected upgrades or price reductions over time, given that I was a steady customer for so long and that disk space and memory keeps becoming cheaper and cheaper in the world. But the prices always stayed the same, which was another strike.

Being hosted there was annoying but never annoying enough to switch. And migrating all of my sites and data to another server seemed like a huge pain. Momentum: the worst enemy of progress.

I moved to a new host on a whim recently: Linode. It was far less painful than I expected. Thanks to Linux and plaintext config files, it was mostly a SCP-it-all-over and tweak process. It took me one evening and a bit of time the next morning. Linode offers a lot of OSes which is also nice.

I pay less for Linode than I did at FH (and I get fewer resources at Linode, but I don't need much). Thus far I'm astonished how much faster things are running on the server. Even goofing off at a terminal, the shell is more responsive. My email loads instantly in kmail instead of lagging for a second. I never knew what I was missing. Linode's DNS control panel is also pretty braindead simple to use.

Futurehosting gets a C+ from me. It worked and my website existed, but it didn't knock my socks off. Hopefully Linode is better.

Who needs a DB?

My blog is still working, in spite of my best efforts to crash it. So that's good. But lately I've been thinking that an SQL database is a lot of overkill just to run a little blog like this.

My blog only has around 450 posts total (over the course of many years), and about an equal number of user comments (thanks to all commenters!). Why do I need a full-blown database for that? All of my posts plus comments plus all meta-data is only 2 MB as a flat text file, 700k gzipped.

By far the most complicated part of my blog engine is the part that stuffs data into the database and gets it back out again in a sane manner (translating Clojure data to SQL values, and back again; splitting up my Clojure data structures into rows for different tables, and then re-combining values joined from multiple tables into one data structure). Eliminating that mess would be nice.

Inevitably I ended up with some logic in the database too: enforcing uniqueness of primary keys, marking some fields as NOT NULL, giving default values and so on. But a lot of other logic was in my Clojure code, e.g. higher-level semantic checking, and some things I wanted to set as default values were impossible to implement in SQL.

Wouldn't it be nice for all the logic to be in Clojure? And the data store on disk to be a simple dump of a Clojure data structure? I can (and did) write a few macros to give me SQL-like field declaration and data validation, for uniqueness of IDs and data types etc. For my limited needs it works OK.

The next question is what format to use for dumping to disk. Happily Clojure is Lisp, so dumping it as a huge s-exp via pr-str works fine, and reading it back in later via read-string is trivial.

Some Java data types can't be printed readably by default, for example java.util.Dates, which print like this:

#<Date Wed May 20 22:39:00 PDT 2009>

The #<> reader macro deliberately throws an error if you try to read that back in, because the reader isn't smart enough to craft Date objects from strings by default. But Clojure is extensible; you can specify a readable-print method for any data type like this:

(defmethod clojure.core/print-method java.util.Date [o w]
  (.write w (str "#=" `(java.util.Date. ~(.getTime o)))))

Now dates print as

#=(java.util.Date. 1242884415044)

and if you try to read that via read-string, it'll create a Date object like you'd expect.

user> (def x (read-string "#=(java.util.Date. 1242884415044)"))
#'user/x
user> (class x)
java.util.Date
user> (str x)
"Wed May 20 22:40:15 PDT 2009"

Storing data in a plain file has another benefit of letting me grep my data from a command line, or even edit the data in a text editor and re-load it into the blog (God help me if that's ever necessary).

Having multiple threads banging on a single file on disk is a horrible idea, but Clojure refs and agents and transactions handle that easily. But I do have to work out how not to lose all my data in case the server crashes in the middle of a file update. (I've lost data (in a recoverable way) due to a server crash in the middle of a MySQL update too, so this is a problem for everyone.) Perhaps I'll keep a running history of my data, each update being a new timestamped file, so old files can't possibly be corrupted. Or use the old write-to-tmp-file-and-rename-to-real-file routine. Or heck, I could keep my data in Git and use Git commands from Clojure. It'd be nice to have a history of edits.

If this idea works out I'll upload code for everything to github, as usual.