This is a read-only archive!

Vim and plaintext data files

I use Vim to work with plaintext datasets. Here are some habits and code snippets I've picked up which make data files a bit easier to deal with.

Tab-delimited files

If you have a file full of tab-separated values, the columns may not line up very well due to the variable display-length of tabs. You'll often end up with something that looks like this:

foo bar
longvaluehere    quux

But things will line up if you set tabstop to a value greater than the longest column.

:setlocal tabstop=16

Then it'll look like this:

foo             bar
longvaluehere   quux

Now you can use visual-block mode to add and remove columns easily.

If you're working with TSV files, you'll want to set the listchars option in such a way that tabs are displayed specially. I use this, stolen/adapted from here:

:set listchars=eol:\ ,tab:»-,trail:·,precedes:…,extends:…,nbsp:‗

Remember also that you can insert a literal tab into your file no matter what you have expandtab set to, by hitting CTRL-v first.

Those things are just about all I need to work with TSV files comfortably.

Dealing with long lines

Sometimes I'll run some data through a statistics program and it'll tell me "Invalid character at line 576, column 9438". How do you jump to that location in your data? Easily enough in normal mode:


G jumps you to a line, | jumps you to a column. A good mnemonic to remember these: "g = goto line" and "| looks like a column".

Suppose (like me) you turn scrollbars off in your GVim window, or you work from a terminal. How do you scroll horizontally? With zL and zH. But those are hard to type and harder to remember, so I map them to CTRL+Shift+Right and Left.

nnoremap <C-S-Right> zL
nnoremap <C-S-Left> zH

If you have a lot of long lines, it behooves you to set up a mapping to turn soft line-wrapping on and off quickly:

nnoremap <Leader>w :setlocal nowrap!<CR>

Now you can jump back and forth between wrapped and unwrapped views of your data via \w.

Diffs on long lines

When using vimdiff, ]c and [c will jump you to lines that contain differences. But if your lines are thousands of characters long, it doesn't always help to know that two lines are different: you want to know where they differ. ]c puts you at the first column in the line, which isn't helpful.

This function will jump you to the column where the difference between two lines starts:

function! IsDiff(col)
    let hlID = diff_hlID(".", a:col)
    return hlID == 24

function! FindDiffOnLine()
    let c = 1
    while c < col("$")
        if IsDiff(c)
            call cursor(".", c)
        let c += 1

This seems fragile, so if you know a better way please leave a comment and let me know.

Diff a buffer against itself

Sometimes I have files with lots of lines of data, and some lines might be duplicate. It's really easy to get rid of the duplicate lines:

:sort u

But what if I want to see what lines were just removed? I don't know of a built-in way to do it, so I use this simple function:

function! MarkDuplicateLines()
    let x = {}
    let count_dupes = 0
    for lnum in range(1, line('$'))
        let line = getline(lnum)
        if has_key(x, line)
            exe lnum . 'norm I *****'
            let count_dupes += 1
            let x[line] = 1
    echomsg count_dupes . " dupe(s) found"

That'll put a bunch of asterisks at the beginning of every line that's a duplicate of a previous line.

Fix broken punctuation

Ever had to work with textual data that someone else sent you in MS Word or Excel? I have. It hurts. If you copy/paste from an MS document into a plaintext file, you'll probably end up with a bunch of funky unreadable or undisplayable characters.

This function fixes the most common Word-spawned "smart" punctuation characters that you're likely to run across:

function! FixInvisiblePunctuation()
    silent! %s/\%u2018/'/g
    silent! %s/\%u2019/'/g
    silent! %s/\%u2026/.../g
    silent! %s/\%uf0e0/->/g
    silent! %s/\%u0092/'/g
    silent! %s/\%u2013/--/g
    silent! %s/\%u2014/--/g
    silent! %s/\%u201C/"/g
    silent! %s/\%u201D/"/g
    silent! %s/\%u0052\%u20ac\%u2122/'/g
    silent! %s/\%ua0/ /g

This function was built up during years of choking on special characters that crept up in my data, so I'm sure I'll be adding more to it in the future.

\%u#### here is a regex escape to represent a character as a hexidecimal Unicode codepoint.

Many of these characters show up as blank (whitespace) if you lack the font to display them. If you ever run across a character you can't see and you want to inspect it, put the cursor on it and do


That's a bad name for the command, since it works on Unicode characters too. That'll give you the numeric code you can use in a regex to replace them all with something sane.

July 12, 2010 @ 3:42 AM PDT
Cateogory: Programming
Tags: Vim


Michał Marczyk
Quoth Michał Marczyk on July 19, 2010 @ 8:39 AM PDT

Oh bother, I must have forgotten to provide the secret keyword last time round... I guess an ungulate mammal has eatten my (previous attempt at posting a) comment. Anyway:

You can use ga instead of :ascii. Also, seeing the impressive column number in the | example makes me want to ask if you're doing anything special to deal with super-long lines of this sort...? My experience is that both Vim and Emacs choke pretty badly when dealing with large files with very large average line length. (That's even when I turn syntax highlighting off and clear matchpairs.)

I'll be stealing your listchars and the punctuation fix, by the way -- thanks!

Quoth Brian on July 19, 2010 @ 9:22 AM PDT

Oops, sorry, I haven't checked my comment backlog in a couple days.

I'm looking at a file with 18k-character-long lines right now, and it seems to be running smoothly enough. I can jump to a line or column instantly. I can do searches OK etc. I make sure to leave filetype = blank (none), but I don't do anything special otherwise.

I only start seeing problems if the file itself is extremely large (hundreds of MB). My files tend to have very long lines but only ~1k lines total, which is only 10-20MB, so maybe that's why I don't have problems.

I've had problems in the past with screwy plugins causing massive lag in unexpected situations. I had a Ruby plugin once that would slow my machine to a crawl. It might be worth it to start Vim with a blank config file and see if it's more responsive.

Thanks for the ga reminder, I knew there had to be a normal mode command but I couldn't remember it.

Quoth TorbenH on August 05, 2010 @ 9:54 PM PDT

Like you I needed a quick way to jump to the column where two files differs when using vimdiff, so I made up a function similar to yours. I see I'm not the only one in need of such a function.

My function requires one to navigate to the line in question (e.g. with ]c ) before using it.

function! NextChangeCol()
    let l:col = ''
    let l:line = line('.')

    let l:bufs = filter(range(1, bufnr('$')), 'buflisted(v:val) && bufloaded(v:val)')
    if len(l:bufs) != 2
            throw "Only supported with 2 loaded buffers"

    let l:str1 = getbufline(l:bufs[0], l:line)
    let l:str2 = getbufline(l:bufs[1], l:line)

    if len(l:str1) != 1 || len(l:str2) != 1

    let l:col = Unmatch(l:str1[0], l:str2[0], 0)
    exec "normal" l:col . "|"

function! Unmatch(str1,str2, ix)
    let l:max = min([strlen(a:str1),strlen(a:str2)])

    let l:ix = 0
    while l:ix < l:max && a:str1[l:ix] == a:str2[l:ix]
            let l:ix = l:ix + 1
    return l:ix + 1

I mapped the function to gc:

noremap <silent> gc :call NextChangeCol()<CR>