5 Posts Tagged 'Regex' RSS

Vim regex - remove kind-of-matching lines

I have a file where every line starts with a number (followed by whitespace and a bunch of other stuff). Every number appears on either one or two lines, and if two, the second line always has a b after the number.

I need to delete every line for which there's a corresponding b line. But if there's no corresponding b line I want to leave the original line there.

Before:

123 foo bar
456 blarg
789 quux
123b foo baz
789b quux blurble

After:

123b foo baz
456 blarg
789b quux blurble

Except in my real file, I have a thousand lines and it'd take a year to do by hand. Vim to the rescue:

:sort
:%s/^\v((\d+).*\n)(\2b.*)/\3/

And that is why Vim is awesome. Can you think of a shorter way to do this, in Vim or Emacs?

June 20, 2010 @ 10:49 PM PDT
Cateogory: Programming
Tags: Vim, Regex

Let's parse

Is there anything more fun than parsing strings? I submit to you that there is not. I'm currently reading my way through Parsing Techniques - A Practical Guide, which has a first edition free online. (I'm hoping Santa brings me a copy of the 2nd edition this year.)

This is a good book, with enough math to be rigorous but not so much that it's completely unreadable. It starts from the absolute basics ("What's a grammar?") and goes through the Chomsky hierarchy and then dives into parsing techniques in great detail, in a language-agnostic way.

Languages and grammars are fascinating. In high school I studied Spanish, French, Latin and German, largely in my spare time. When I was 16, if people asked what I wanted to do for a living, I said "translator".

The plan to become a translator failed partly because the quality of my early education was horrendous and partly because mastering a language is extremely difficult and at 16 I wasn't motived enough. And then computers showed up in my life, which gave me a never-ending supply of languages to play with, while being fun (and profitable) in so many other ways. But I still took two years of Japanese classes in college for no reason other than enjoyment, and I'm still trying (and failing) to learn Japanese in my spare time 8 years later.

Perl was my first favorite language probably for no reason other than regular expressions. I can understand how people call PCRE syntax line-noise, but to me it's beautiful line noise. I live and breathe regular expressions nowadays. My favorite CS class in college was one where we went through and laboriously built finite-state automata and pushdown automata and Turing machines. Seeing the equivalence of these simple machines with the different classes of grammars was a huge epiphany. Such a simple concept with such huge consequences.

Dijkstra said:

Besides a mathematical inclination, an exceptionally good mastery of one's native tongue is the most vital asset of a competent programmer.

I strongly agree with that sentiment. People tell me at times that I'm good at written communication. I have my doubts, and anyways I find it funny because I'm so terrible at verbal communication. I think if I have any success at writing, it's because I view writing as a mechanical process.

I told a prof in college once that I felt like my papers wrote themselves once I had an idea in mind. There are rules of grammar and style, and you learn them and follow them, or break them deliberately if you have a good reason to. You write some prose, then you debug it until it "works" mentally. I don't care about typos and I split infinitives and comma-splice on purpose, but ambiguous or awkward phrases usually stand out to me like compiler bugs in my brain.

What's more important than language? Few things. Language is important enough to be nearly hard-wired into our brains. Children learn it instinctively. Human beings can still easily and effortlessly out-perform the best supercomputer at the task of parsing and interpreting speech. We think in words. The programming languages computers understand are dirt-simple by comparison, but writing code still feels like writing "thoughts for the computer" sometimes.

There are very few times you'll hear me say "What a wonderful world we live in". But one of those times is when I have the opportunity to explore an area of study like language. It's such an enjoyable experience to struggle and try to master such a thing. It's an amazing universe where we have these weird little rules and they work and we can understand them and manipulate them and produce things with them.

December 15, 2009 @ 8:46 PM PST
Cateogory: Programming

Now I have two problems

I'm converting one of my websites from Ruby on Rails to Clojure in my spare time. I stupidly put a bunch of RoR-style links inline into certain bits of plaintext content, so in my DB there are a bunch of text fields with <%= link_to ... %> in the middle.

It was easy to fix with a regex though:

(defn clean [txt]
  (re-gsub #"<%=\s*link_to\s+(\"[^\"]+\"|'[^']+')\s*(?:,\s*'([^']+)'\s*)?(?:,\s*image_path\(['\"]([^'\"]+)['\"]\)\s*)?(?:,\s*:controller\s*=>\s*(?::(\S+)|['\"]([^\"']+)['\"])\s*)?(?:,\s*:action\s*=>\s*(?::(\S+)|['\"]([^\"']+)['\"])\s*)?(?:,\s*:id\s*=>\s*(?:(\d+)|:(\S+)|['\"]([^\"']+)['\"])\s*)?\s*%>"
           (fn [[_ s & parts]] (let [href (str-join "/" (filter identity parts))]
                           (str "<a href=\"/" href "\">" (re-gsub #"^[\"']|[\"']$" "" s) "</a>")))
           txt))

And by easy, I mean not easy.

Note to self, try something other than a regex next time.

Note to self, don't bury some framework's funky-syntax DSL in the middle of plaintext content. Next time use HTML or do the conversion from DSL to HTML early rather than late.

Silly how two years ago I thought I'd be using Ruby for that site forever.

September 25, 2009 @ 11:02 PM PDT
Cateogory: Programming

Vim regexes are awesome

Two years ago I wrote about how Vim's regexes were no fun compared to :perldo and :rubydo. Turns out I was wrong, it was just a matter of not being used to them.

Vim's regexes are very good. They have all of the good features of Perl/Ruby regexes, plus some extra features that don't make sense outside of a text editor, but are nonetheless very helpful in Vim.

Here are a few of the neat things you can do.

Very magic

Vim regexes are inconsistent when it comes to what needs to be backslash-escaped and what doesn't, which is the one bad thing. But Vim lets you put \v to make everything suddenly consistent: everything except letters, numbers and underscores becomes "special" unless backslash-escaped.

Without \v:

:%s/^\%(foo\)\{1,3}\(.\+\)bar$/\1/

With \v:

:%s/\v^%(foo){1,3}(.+)bar$/\1/

Far easier to read. Along with \c to turn on and off case sensitivity, these are good options to make a habit of prepending to regexes when needed. It eventually becomes second-nature. See also :h /\v

Spanning newlines

One thing that :perldo and :rubydo can't do is span newlines; you can't combine two lines and you can't break one line into two.

But Vim's regexes can span newlines if you use \_. instead of .. I find this to be a lot more aesthetically pleasing than Perl's horrible s and m modifiers tacked onto the end of a regex. e.g. this strips <body> tags from a text document.

:%s@<body>\v(\_.+)\V</body>@\1@

(Note: in real life, never use a regex to parse HTML or XML. Down that path lies madness. The above is OK because I'd expect only one <body> tag to appear in any document.)

(Note^2: being able to turn on and off magic in the middle of a regex is awfully helpful.)

(Note^4: You can use arbitrary delimiters like @ for the regex, which is useful if your pattern includes literal /'s.)

See also :h \_.

\zs

Vim lets you demand that some text match, but ignore that text when it comes to the substitution part. This is handy for certain specific kinds of regexes. Normally if you want to match some text and then leave it alone in the substitution, you have to capture it and then put it back manually; \zs lets you avoid this.

Say you want to chop some text off the end of a line, but leave the rest of the line alone. Normally you'd have to do this:

:%s/\v^(foobar)(baz)/\1/

to put the foobar back. Of course you can also use a zero-width lookbehind assertion:

:%s/\v(^foobar)@<=baz//

But that's even more line-noise. This is the easiest way:

:%s/^foobar\zsbaz//

See :h /\zs. (And :h /\@<= if you're so inclined.)

Expressions

Using \=, you can put arbitrary expressions on the right side of a regex substitution. For example say you have this text:

~/foo ~/bar

If you do this:

:%s/\v(\S+)/\=expand(submatch(1))/g

You end up with:

/home/user/foo /home/user/bar

Because you can also call your own user-defined functions in the expression part, this can end up being pretty powerful. For example it can be used to insert incrementing numbers into arbitrary places in your text. See :h sub-replace-\=.

And so on

Read :h regexp if you haven't already. Tons of other features in there that can make your life easy if you manage to internalize them. It is difficult to get used to Vim's funky syntax if you're very familiar with Perl/Ruby-style regexes, but I think it's worth it. Only took me two years! (OK, more like a couple days of concerted effort after a year-and-a-half delay.)

April 18, 2009 @ 2:47 PM PDT
Cateogory: Programming
Tags: Vim, Regex

Vim - escaping quotes

This Vim regex escapes (by doubling) every double-quote on a line except the first one, last one, or any that are already doubled:

:s/\v(^[^"]*)@<!"@<!""@!([^"]*$)@!/""/g

Sometimes I kind of understand that old humorous quote: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. But regexes are still pretty darn useful. I can't imagine a good replacement for them that wouldn't have all the same problems with escaping and magic characters and whatnot, without the replacement being so verbose that no one would ever use them.

October 16, 2008 @ 4:52 PM PDT
Cateogory: Programming
Tags: Vim, Regex