JSON pattern matching with sed, perl and regular expressions

Why VIM?

Sooner or later there comes the day when your easy-to-use IDE becomes useless for handling huge files. There aren’t many editors capable of working with very large files, like production logs for instance.

I’ve recently had to analyze a 100 MB one-line JSON file and once more VIM saved the day. VIM, like many other Unix utilities, is both tough and brilliant. Git interactive rebase requires you to know it, and if you’re still not convinced, maybe this great article will make you change your mind.

Let’s see how easily you can pretty print a JSON file with VIM. First we will download a one-line JSON file from Reddit.

$ wget http://www.reddit.com/r/programming.json
--2014-01-24 12:21:04--  http://www.reddit.com/r/programming.json
Resolving www.reddit.com (www.reddit.com)... 77.232.217.122, 77.232.217.113
Connecting to www.reddit.com (www.reddit.com)|77.232.217.122|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28733 (28K) [application/json]
Saving to: `programming.json'

100%[======================================>] 28,733      --.-K/s   in 0.03s

2014-01-24 12:21:04 (1021 KB/s) - `programming.json' saved [28733/28733]

This is how it looks like:

vim_json_one_line

Pretty printing

Python comes along with most Unix distributions, so running the following VIM command manages to do the trick:

%!python -m json.tool

vim_json_pretty

Let’s save the pretty printed JSON file and put other Unix tools to work.

:w programming_pretty.json

Matching time

Let’s say we want to extract all “domain” related values:

"domain": "mameworld.info"

Sed to the rescue

$ sed -nr 's/^.*"domain":\s*"(.*?)".*$/\1/p' <programming_pretty.json | sort -u
blog.safaribooksonline.com
chadfowler.com
cyrille.rossant.net
dot.kde.org
evanmiller.org
fabiensanglard.net
galileo.phys.virginia.edu
github.com
halffull.org
ibuildings.nl
jaxenter.com
jobtipsforgeeks.com
kilncode.com
libtins.github.io
mameworld.info
miguelcamba.com
minuum.com
notes.tweakblogs.net
perfect-pentago.net
periscope.io
reuters.com
tech.blog.box.com
tmm1.net
vocalbit.com
youtube.com

Multi-line matching

Sed is line oriented, and while it offers multi-line support, it’s no match for perl. Let’s say I want to match all authors in the following JSON pattern:

"data": {   
   "author": "justrelaxnow", 
}

This is how I do it:

$ perl -0777 -n -e 'print "$2\n" while (m/("data":\s*\{.*?"author":\s*"(.*?)"[,|\s*\}].*?\},)/sgmp)' programming_pretty.json | sort -u
AmericanXer0
azth
bionicseraph
bit_shiftr
charles_the_hard
Gexos
jakubgarfield
johnwaterwood
joukoo
justrelaxnow
Kingvash
krets
mariuz
mopatches
nyphrex
pseudomind
rluecke3
sltkr
solidus-flux
steveklabnik1
sumstozero
swizec
vocalbit
Wolfspaw

Conclusion

Unix tools are old school, some of those being written forty years ago. The learning curve might be steep, but learning them is a great investment. A great software library stands the test of time and Unix tools are a good reminder that tough jobs call for tough tools.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, you just need to follow my blog.

About these ads

2 thoughts on “JSON pattern matching with sed, perl and regular expressions

  1. I was confused what you meant with the command !python -m json.tool so I used cat programming.json | python -m json.tool > programming_formatted.json to created the formatted json file. Thanks for the great work!

  2. Oh now I see this is in the context of vim command mode, cool!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s