This is a read-only archive!

Lame comment spam management that works

It's been nine months since I ditched Wordpress and moved to a blog system I wrote from scratch (in Clojure). This was a great move in so many ways. One of those ways is comment spam. My site is as popular now (or maybe slightly more popular now) as it was when I was running Wordpress, so I think comparing before and after is valid.

With Wordpress, every morning I'd do the ritual of deleting overnight spambot droppings. Typically I got between 1 and 5 every night. I had a default Wordpress install and all I used for spam filtering was Askimet. Askimet did a surprisingly good job, catching literally if not thousands of spams every week which otherwise would've been ruining my site. But inevitably some would still get through. And what's worse, there were a lot more false positives than I could tolerate.

Since I started counting with my new system, which is around 6 months, to the best of my knowledge I've gotten zero spambot-produced comments that made it through my filters. This is pleasant, to say the least.

The system I'm using is stupid. None of it is stuff I thought of myself, I got ideas from other lots of other blogs or articles I read, but the implementation is mine and it's not sophisticated. It would take a bot author a few seconds to work around it. But no one has bothered. Why bother writing a bot for my one-man blog, when you can write a bot for Wordpress and have it work on tens of thousands of blogs? And I can change my system to defeat the bots with a few lines of code just as easily as they can work around it.

So here's why I think it's working.

1. It's not Wordpress

Just by using something slightly different from Wordpress, I'm think I'm already ahead. For example if you have a blog where a form posts comments to /wp-comments-post.php, a bot doesn't even need to look at your site to spam you. They can blast your server with POST data at that URL in a format they already know Wordpress will accept. My site is all custom code, so everything is different enough that default bot attempts fail immediately.

I think this is the reason that only 1853 spam comments have even been POSTed at me in the last six months. That's an improvement of one or two orders of magnitude already.

2. Honeypot text field

So what about the comments that are actually POSTed? They are presumably the result of bots that parse sites' HTML looking for comment forms and try to POST data that satisfies the form.

So in my comment form I have a field called referer. A "How did you find my site?" kind of thing. In fact I don't care how you found this site, this field is a honeypot. The div containing this field is hidden via CSS.

div#referer-row {
    display: none;
}

So you shouldn't ever see it if you're a human using a browser. But bots parsing the HTML see this field, unless they also bother to parse my CSS and see that it's hidden, which would be expensive and apparently they don't do it.

If you put anything in this referer field, your comment will be rejected as spam. Simple enough.

Many blogs require you to fill in every field or else the comment is rejected, so it seems reasonable to expect most bots to fill in all of your form fields. (My blog actually requires you to fill in nothing but the comment text; author will be set to Anonymous Cow if you don't fill it in.)

In fact this seems to be the case; of 1853 spam comments since March, 1810 put something into this field. Most of the time it's a random string of letters. Not even a URL. Sometimes it's a couple words like "insurance quotes" or something about drugs or casinos.

The downside of this is if you're a human using a browser that doesn't understand CSS, you will see this field. Then if you type something into it and try to comment, it'll end up as spam. So Lynx users and time travelers from 1987 trying to leave me comments might be confused at first.

However as far as I can tell, no intelligible data has ever been entered into this field by a human, so I don't think it's a concern. Six times, the word "None" was entered, but I don't think this is a human because that's nonsense answer to "How did you find this site?". But you never know.

3. Lame static CAPTCHA

That leaves 43 spam comments that made it this far. My other anti-spam measure is a word you have to type. But it's always the same word, and the word is COWS. This CAPTCHA caught the remaining 43. It looks like this:

CAPTCHA

There's a normal text field with a default value of <= Type this word specified right in the HTML.

<input type="text" value="<= Type this word" name="test" id="test"/>

There are no other instructions besides "Type this word". I'm assuming either that commenters are familiar enough with CAPTCHAs to know what I want, or can figure it out using common sense. Given that my target audience is computer geeks and programmers, this should be a safe assumption. In fact I've had less than a dozen false positives in the past six months via people failing this; see below for details.

To post a comment, the value of this field must contain the word "COWS" somewhere in it, case-insensitively. Otherwise it's spam. Easy enough to implement.

If you have Javacsript enabled, clicking on this field will clear out the default value. If you unfocus the field without typing anything, Javascript will put the default value back in. This is only for convenience. If you don't have Javascript enabled, you have to highlight and backspace over the default text. I don't think this is a huge burden.

Of 1853 spam comments, here's the breakdown for what values end up in this field.

1131: "&lt;= Type this word"
691:  "<= Type this word"
21:   Random letters and numbers
6:    empty
2:    A bunch of URLs
2:    Human beings making typos, e.g. "COW" or "COS".

So most bots are too stupid even to remove the default value from the field, and none of them entered the correct value. 691 times the bot was somehow smart enough to un-escape &lt; into <, which is interesting, but didn't help it defeat the filter. A lot of the random words look like they were made by Markov chains, e.g. 'fridwolfur' and 'lyndonvolk' and 'calbertdom'. If I need to write a childerns' poem I'll know where to look for ideas. One time a bot or spammer managed to type "vows" somehow, but this might or might not be coincidence.

I think this is better than a normal CAPTCHA because:

  1. It's always the word "COWS" in a normal font, so it requires no thought or eye strain to figure out.
  2. It's black and white, so hopefully people with minor vision problems and color blindness can see it.
  3. It fits thematically with my blog layout (it's COWS and it has cow spots).
  4. It's kind of silly, so hopefully people chuckle rather than become ticked off.

It's worse than a real CAPTCHA because it requires no effort to break. So it wouldn't work for Wordpress or VBulletin or something with a million users. But I wonder if every Wordpress had a single static word as a CAPTCHA, but a different word for every blog (generated at install-time maybe), would it work better or worse than the random mangled multi-color CAPTCHAs no one can read? Real randomly-generated CAPTCHAs don't work anyways; bots can already beat them via OCR or other means. A simple word would be less annoying for a human, to be sure.

The other downside is that this is not very accessible to the blind or other people using screen readers, or browsers without image support. This is unfortunate and I'm still trying to figure out how to get around this. Right now the ALT text for the CAPTCHA image is This says 'COWS'; I don't know if this is enough help for people in those situations.

Of course I'll never know how many people see my CAPTCHA and storm away in a rage without even trying to post a comment. But I've never heard a complaint. If this level of CAPTCHA ticks you off personally, please swallow your anger and leave me a comment here saying so, if you feel obliged; I'd love to hear it.

False positives

As best I can tell, there are no false positives from people filling in the honeypot field. But even as simple as it is, some people don't succeed at the CAPTCHA image. Either they typo it or they ignore it entirely.

I just checked and I counted around 6 comments by real humans where the CAPTCHA was ignored and the default <= Type this word ended up in the spam DB. 4 of those people re-posted their comment successfully immediately afterwards by filling in the CAPTCHA. I'm not sure I'm ever going to get much better than that.

Spam that makes it through

I have still gotten spam. Maybe a dozen or so in the past six months. It's all been in the form of a human typing a normal-looking and relevant comment, about open source software or BASH for example, but with a spammy URL buried in it, e.g. a link to a really dodgy-looking blog trying to sell something, or some scummy SEO site. It's either a human or a very sophisticated (or lucky) bot; the comment text in these is indistinguishable from a real comment other than the spam URLs. I have to delete these by hand.

But I was getting these with Wordpress too. No automted anti-spam system is going to defeat a human being, so I don't worry about it.

That's it

The moral of this story is that it doesn't take much to protect yourself from comment spam if you write the code yourself. As long as it's unique, you'll probably be fine.

The other moral is that you don't have to annoy the hell out of your users to filter spam effectively. I'm making the assumption here that my COWS method is not that annoying; tell me if I'm wrong.

I don't know how well this scales. Probably not so well. My blog isn't that highly trafficked. If my site were more popular it might be worse for me. But the improvement over Wordpress is unquestionable.

I've seen all kinds of complicated measures suggested elsewhere, like trying to predict if it's a bot by how many milliseconds it takes between page load and comment posting, or measuring keypress speed, or escaping the HTML of your forms and un-escaping it at loadtime it via Javascript, or setting and retrieving cookies and such. But a lot of this stuff seems fragile and if your browser doesn't suppoort Javascript or cookies (or your users block them), you're screwed. I block these things myself, so I expect visitors to do the same.

If everyone wrote their own blog engines, the world would be a slightly less spammy place. Or else we'd have much smarter bots.

December 05, 2009 @ 6:34 PM PST
Cateogory: Programming

8 Comments

Michael
Quoth Michael on December 05, 2009 @ 9:40 PM PST

Have you got an estimate of how long it took you to code the blog? And why did you decide not to invest that time in tweaking WP?

Brian
Quoth Brian on December 06, 2009 @ 4:08 AM PST

A couple months' worth of weekends. I went with something custom because

  1. I don't like PHP
  2. I wanted a challenge
  3. There were way too many things to change, like Markdown support in comments and properly escaping "code" tags and the comment spam and so on
  4. Tweaking wordpress means rewriting all your tweaks nearly every time Wordpress is upgraded, so an endless maintenance nightmare
  5. Wordpress has too many features I didn't care about, the admin interface is too complicated, writing themes for it was too much work etc., I wanted to get rid of the bloat.
Anonymous Cow
Quoth Anonymous Cow on December 06, 2009 @ 7:50 AM PST

I'm tempted to spam you now ;)

Bleys
Quoth Bleys on December 06, 2009 @ 7:38 PM PST

You should mock ./ by using "Anonymous COWard" as the default name.

((No, you shouldn't))

John
Quoth John on December 19, 2009 @ 12:13 PM PST

In the ChangeLog section of the READM for cows-blog, it says, "No more CRUD. Tokyo Cabinet. Removed cows." What do you mean by no more CRUD? You're still creating, reading, updating, and deleting, no? Not seeing the distinction here.

Also could you comment on why you chose Tokyo Cabinet over some other dbm implementation?

John
Quoth John on December 19, 2009 @ 12:17 PM PST

Typos:

s/READM/README/

s/cows-blog/cow-blog/

Also, I won't ask how many cows you removed, or how they got in there in the first place. Perhaps the Emacs Clojure major mode automatically inserts cows when it thinks you're not looking. :)

Brian
Quoth Brian on December 19, 2009 @ 12:50 PM PST

Previously I was using a CRUD library I wrote myself on top of mysql. I meant that I ditched that library. I could never figure out how to make it thread-safe, that was my main motivation for rewriting everything to use TC. The code now uses agents to do all the ref-persisting thread-safely and with a lot less effort.

I used TC because it's pretty much the first library I tried and it worked. It's very extremely lightweight and simple. It's not distributed, you don't have to use any weird serialization format, you just throw a string into the DB under some key and fetch it out later. There's no server to run (unless you want one), just a flat file and some C libraries to play with it.

I don't doubt that another library would work just as well. If I had to scale this website up to be run on multiple machines or something, I might pick another, but my blog is small and TC is enough.

Removing cows was mostly simplifying the layout and removing the cow spot background and "COWS" CAPTCHA. I wish Emacs inserted cows into my code, that would clearly be awesome.

Roy
Quoth Roy on March 29, 2010 @ 2:19 PM PDT

Darn it, I missed the window, I liked the original concept of all-in-clojure simplicity. I forked cow-blog to try to use it on my own site as a very simple way to get myself continuously working with Clojure. But now I have to get a sense of Tokyo Cabinet and stuff but I'm not so sure that I want to learn Tokyo Cabinet and configuring stuff like that if there's any complexity to it. I'm sure it's simple an' all, but having to read through docs for the setup, bah.