Vim and plaintext data files
I use Vim to work with plaintext datasets. Here are some habits and code snippets I've picked up which make data files a bit easier to deal with.
Tab-delimited files
If you have a file full of tab-separated values, the columns may not line up very well due to the variable display-length of tabs. You'll often end up with something that looks like this:
foo bar
longvaluehere quux
But things will line up if you set tabstop to a value greater than the longest column.
:setlocal tabstop=16
Then it'll look like this:
foo bar
longvaluehere quux
Now you can use visual-block mode to add and remove columns easily.
If you're working with TSV files, you'll want to set the listchars option in such a way that tabs are displayed specially. I use this, stolen/adapted from here:
:set listchars=eol:\ ,tab:»-,trail:·,precedes:…,extends:…,nbsp:‗
Remember also that you can insert a literal tab into your file no matter what you have expandtab set to, by hitting CTRL-v first.
Those things are just about all I need to work with TSV files comfortably.
Dealing with long lines
Sometimes I'll run some data through a statistics program and it'll tell me "Invalid character at line 576, column 9438". How do you jump to that location in your data? Easily enough in normal mode:
576G9438|
G jumps you to a line, | jumps you to a column. A good mnemonic to remember these: "g = goto line" and "| looks like a column".
Suppose (like me) you turn scrollbars off in your GVim window, or you work from a terminal. How do you scroll horizontally? With zL and zH. But those are hard to type and harder to remember, so I map them to CTRL+Shift+Right and Left.
nnoremap <C-S-Right> zL
nnoremap <C-S-Left> zH
If you have a lot of long lines, it behooves you to set up a mapping to turn soft line-wrapping on and off quickly:
nnoremap <Leader>w :setlocal nowrap!<CR>
Now you can jump back and forth between wrapped and unwrapped views of your data via \w.
Diffs on long lines
When using vimdiff, ]c and [c will jump you to lines that contain differences. But if your lines are thousands of characters long, it doesn't always help to know that two lines are different: you want to know where they differ. ]c puts you at the first column in the line, which isn't helpful.
This function will jump you to the column where the difference between two lines starts:
function! IsDiff(col)
let hlID = diff_hlID(".", a:col)
return hlID == 24
endfunction
function! FindDiffOnLine()
let c = 1
while c < col("$")
if IsDiff(c)
call cursor(".", c)
return
endif
let c += 1
endwhile
endfunction
This seems fragile, so if you know a better way please leave a comment and let me know.
Diff a buffer against itself
Sometimes I have files with lots of lines of data, and some lines might be duplicate. It's really easy to get rid of the duplicate lines:
:sort u
But what if I want to see what lines were just removed? I don't know of a built-in way to do it, so I use this simple function:
function! MarkDuplicateLines()
let x = {}
let count_dupes = 0
for lnum in range(1, line('$'))
let line = getline(lnum)
if has_key(x, line)
exe lnum . 'norm I *****'
let count_dupes += 1
else
let x[line] = 1
endif
endfor
echomsg count_dupes . " dupe(s) found"
endfunction
That'll put a bunch of asterisks at the beginning of every line that's a duplicate of a previous line.
Fix broken punctuation
Ever had to work with textual data that someone else sent you in MS Word or Excel? I have. It hurts. If you copy/paste from an MS document into a plaintext file, you'll probably end up with a bunch of funky unreadable or undisplayable characters.
This function fixes the most common Word-spawned "smart" punctuation characters that you're likely to run across:
function! FixInvisiblePunctuation()
silent! %s/\%u2018/'/g
silent! %s/\%u2019/'/g
silent! %s/\%u2026/.../g
silent! %s/\%uf0e0/->/g
silent! %s/\%u0092/'/g
silent! %s/\%u2013/--/g
silent! %s/\%u2014/--/g
silent! %s/\%u201C/"/g
silent! %s/\%u201D/"/g
silent! %s/\%u0052\%u20ac\%u2122/'/g
silent! %s/\%ua0/ /g
retab
endfunction
This function was built up during years of choking on special characters that crept up in my data, so I'm sure I'll be adding more to it in the future.
\%u#### here is a regex escape to represent a character as a hexidecimal Unicode codepoint.
Many of these characters show up as blank (whitespace) if you lack the font to display them. If you ever run across a character you can't see and you want to inspect it, put the cursor on it and do
:ascii
That's a bad name for the command, since it works on Unicode characters too. That'll give you the numeric code you can use in a regex to replace them all with something sane.

3 Comments
Oh bother, I must have forgotten to provide the secret keyword last time round... I guess an ungulate mammal has eatten my (previous attempt at posting a) comment. Anyway:
You can use
gainstead of:ascii. Also, seeing the impressive column number in the|example makes me want to ask if you're doing anything special to deal with super-long lines of this sort...? My experience is that both Vim and Emacs choke pretty badly when dealing with large files with very large average line length. (That's even when I turn syntax highlighting off and clearmatchpairs.)I'll be stealing your
listcharsand the punctuation fix, by the way -- thanks!Oops, sorry, I haven't checked my comment backlog in a couple days.
I'm looking at a file with 18k-character-long lines right now, and it seems to be running smoothly enough. I can jump to a line or column instantly. I can do searches OK etc. I make sure to leave
filetype= blank (none), but I don't do anything special otherwise.I only start seeing problems if the file itself is extremely large (hundreds of MB). My files tend to have very long lines but only ~1k lines total, which is only 10-20MB, so maybe that's why I don't have problems.
I've had problems in the past with screwy plugins causing massive lag in unexpected situations. I had a Ruby plugin once that would slow my machine to a crawl. It might be worth it to start Vim with a blank config file and see if it's more responsive.
Thanks for the
gareminder, I knew there had to be a normal mode command but I couldn't remember it.Like you I needed a quick way to jump to the column where two files differs when using vimdiff, so I made up a function similar to yours. I see I'm not the only one in need of such a function.
My function requires one to navigate to the line in question (e.g. with ]c ) before using it.
I mapped the function to gc:
Speak your Mind
Preview