This is a read-only archive!

Vim and plaintext data files

I use Vim to work with plaintext datasets. Here are some habits and code snippets I've picked up which make data files a bit easier to deal with.

Tab-delimited files

If you have a file full of tab-separated values, the columns may not line up very well due to the variable display-length of tabs. You'll often end up with something that looks like this:

foo bar
longvaluehere    quux

But things will line up if you set tabstop to a value greater than the longest column.

:setlocal tabstop=16

Then it'll look like this:

foo             bar
longvaluehere   quux

Now you can use visual-block mode to add and remove columns easily.

If you're working with TSV files, you'll want to set the listchars option in such a way that tabs are displayed specially. I use this, stolen/adapted from here:

:set listchars=eol:\ ,tab:»-,trail:·,precedes:…,extends:…,nbsp:‗

Remember also that you can insert a literal tab into your file no matter what you have expandtab set to, by hitting CTRL-v first.

Those things are just about all I need to work with TSV files comfortably.

Dealing with long lines

Sometimes I'll run some data through a statistics program and it'll tell me "Invalid character at line 576, column 9438". How do you jump to that location in your data? Easily enough in normal mode:

576G9438|

G jumps you to a line, | jumps you to a column. A good mnemonic to remember these: "g = goto line" and "| looks like a column".

Suppose (like me) you turn scrollbars off in your GVim window, or you work from a terminal. How do you scroll horizontally? With zL and zH. But those are hard to type and harder to remember, so I map them to CTRL+Shift+Right and Left.

nnoremap <C-S-Right> zL
nnoremap <C-S-Left> zH

If you have a lot of long lines, it behooves you to set up a mapping to turn soft line-wrapping on and off quickly:

nnoremap <Leader>w :setlocal nowrap!<CR>

Now you can jump back and forth between wrapped and unwrapped views of your data via \w.

Diffs on long lines

When using vimdiff, ]c and [c will jump you to lines that contain differences. But if your lines are thousands of characters long, it doesn't always help to know that two lines are different: you want to know where they differ. ]c puts you at the first column in the line, which isn't helpful.

This function will jump you to the column where the difference between two lines starts:

function! IsDiff(col)
    let hlID = diff_hlID(".", a:col)
    return hlID == 24
endfunction

function! FindDiffOnLine()
    let c = 1
    while c < col("$")
        if IsDiff(c)
            call cursor(".", c)
            return
        endif
        let c += 1
    endwhile
endfunction

This seems fragile, so if you know a better way please leave a comment and let me know.

Diff a buffer against itself

Sometimes I have files with lots of lines of data, and some lines might be duplicate. It's really easy to get rid of the duplicate lines:

:sort u

But what if I want to see what lines were just removed? I don't know of a built-in way to do it, so I use this simple function:

function! MarkDuplicateLines()
    let x = {}
    let count_dupes = 0
    for lnum in range(1, line('$'))
        let line = getline(lnum)
        if has_key(x, line)
            exe lnum . 'norm I *****'
            let count_dupes += 1
        else
            let x[line] = 1
        endif
    endfor
    echomsg count_dupes . " dupe(s) found"
endfunction

That'll put a bunch of asterisks at the beginning of every line that's a duplicate of a previous line.

Fix broken punctuation

Ever had to work with textual data that someone else sent you in MS Word or Excel? I have. It hurts. If you copy/paste from an MS document into a plaintext file, you'll probably end up with a bunch of funky unreadable or undisplayable characters.

This function fixes the most common Word-spawned "smart" punctuation characters that you're likely to run across:

function! FixInvisiblePunctuation()
    silent! %s/\%u2018/'/g
    silent! %s/\%u2019/'/g
    silent! %s/\%u2026/.../g
    silent! %s/\%uf0e0/->/g
    silent! %s/\%u0092/'/g
    silent! %s/\%u2013/--/g
    silent! %s/\%u2014/--/g
    silent! %s/\%u201C/"/g
    silent! %s/\%u201D/"/g
    silent! %s/\%u0052\%u20ac\%u2122/'/g
    silent! %s/\%ua0/ /g
    retab
endfunction

This function was built up during years of choking on special characters that crept up in my data, so I'm sure I'll be adding more to it in the future.

\%u#### here is a regex escape to represent a character as a hexidecimal Unicode codepoint.

Many of these characters show up as blank (whitespace) if you lack the font to display them. If you ever run across a character you can't see and you want to inspect it, put the cursor on it and do

:ascii

That's a bad name for the command, since it works on Unicode characters too. That'll give you the numeric code you can use in a regex to replace them all with something sane.

July 12, 2010 @ 3:42 AM PDT
Cateogory: Programming
Tags: Vim

3 Comments

Michał Marczyk
Quoth Michał Marczyk on July 19, 2010 @ 8:39 AM PDT

Oh bother, I must have forgotten to provide the secret keyword last time round... I guess an ungulate mammal has eatten my (previous attempt at posting a) comment. Anyway:

You can use ga instead of :ascii. Also, seeing the impressive column number in the | example makes me want to ask if you're doing anything special to deal with super-long lines of this sort...? My experience is that both Vim and Emacs choke pretty badly when dealing with large files with very large average line length. (That's even when I turn syntax highlighting off and clear matchpairs.)

I'll be stealing your listchars and the punctuation fix, by the way -- thanks!

Brian
Quoth Brian on July 19, 2010 @ 9:22 AM PDT

Oops, sorry, I haven't checked my comment backlog in a couple days.

I'm looking at a file with 18k-character-long lines right now, and it seems to be running smoothly enough. I can jump to a line or column instantly. I can do searches OK etc. I make sure to leave filetype = blank (none), but I don't do anything special otherwise.

I only start seeing problems if the file itself is extremely large (hundreds of MB). My files tend to have very long lines but only ~1k lines total, which is only 10-20MB, so maybe that's why I don't have problems.

I've had problems in the past with screwy plugins causing massive lag in unexpected situations. I had a Ruby plugin once that would slow my machine to a crawl. It might be worth it to start Vim with a blank config file and see if it's more responsive.

Thanks for the ga reminder, I knew there had to be a normal mode command but I couldn't remember it.

TorbenH
Quoth TorbenH on August 05, 2010 @ 9:54 PM PDT

Like you I needed a quick way to jump to the column where two files differs when using vimdiff, so I made up a function similar to yours. I see I'm not the only one in need of such a function.

My function requires one to navigate to the line in question (e.g. with ]c ) before using it.

function! NextChangeCol()
    let l:col = ''
    let l:line = line('.')

    let l:bufs = filter(range(1, bufnr('$')), 'buflisted(v:val) && bufloaded(v:val)')
    if len(l:bufs) != 2
            throw "Only supported with 2 loaded buffers"
    endif

    let l:str1 = getbufline(l:bufs[0], l:line)
    let l:str2 = getbufline(l:bufs[1], l:line)

    if len(l:str1) != 1 || len(l:str2) != 1
            return
    endif

    let l:col = Unmatch(l:str1[0], l:str2[0], 0)
    exec "normal" l:col . "|"
endfunction

function! Unmatch(str1,str2, ix)
    let l:max = min([strlen(a:str1),strlen(a:str2)])

    let l:ix = 0
    while l:ix < l:max && a:str1[l:ix] == a:str2[l:ix]
            let l:ix = l:ix + 1
    endwhile
    return l:ix + 1
endfunction

I mapped the function to gc:

noremap <silent> gc :call NextChangeCol()<CR>