JSON pattern matching with sed, perl and regular expressions

Why VIM?

Sooner or later there comes the day when your easy-to-use IDE becomes useless for handling huge files. There aren’t many editors capable of working with very large files, like production logs for instance.

I’ve recently had to analyze a 100 MB one-line JSON file and once more VIM saved the day. VIM, like many other Unix utilities, is both tough and brilliant. Git interactive rebase requires you to know it, and if you’re still not convinced, maybe this great article will make you change your mind.

Continue reading “JSON pattern matching with sed, perl and regular expressions”

MongoDB and the fine art of data modeling


This is the third part of our MongoDB time series tutorial, and this post will emphasize the importance of data modeling. You might want to check the first part of this series, to get familiar with our virtual project requirements and the second part talking about common optimization techniques.

When you first start using MongoDB, you’ll immediately notice it’s schema-less data model. But schema-less doesn’t mean skipping proper data modeling (satisfying your application business and performance requirements). As opposed to a SQL database, a NoSQL document model is more focused towards querying than to data normalization. That’s why your design won’t be finished unless it addresses your data querying patterns.

Continue reading “MongoDB and the fine art of data modeling”

A beginner’s guide to MongoDB performance turbocharging


This is the second part of our MongoDB time series tutorial, and this post will be dedicated to performance tuning. In my previous post, I introduced you into our virtual project requirements.

In short, we have 50M time events, spanning from the 1st of January 2012 to the 1st of January 2013, with the following structure:

    "_id" : ObjectId("52cb898bed4bd6c24ae06a9e"),
    "created_on" : ISODate("2012-11-02T01:23:54.010Z")
    "value" : 0.19186609564349055

We’d like to aggregate the minimum, the maximum, and the average value as well as the entries count for the following discrete time samples:

  1. all seconds in a minute
  2. all minutes in an hour
  3. all hours in a day

Continue reading “A beginner’s guide to MongoDB performance turbocharging”

MongoDB time series: Introducing the aggregation framework

In my previous posts I talked about batch importing and the out-of-the-box MongoDB performance. Meanwhile, MongoDB was awarded DBMS of the year, so I therefore decided to offer a more thorough analyze of its real-life usage.

Because a theory is better understood in a pragmatic context, I will first present you our virtual project requirements.


Our virtual project has the following requirements:

  1. it must store valued time events represented as v=f(t)
  2. it must aggregate the minimum, maximum, average and count records by:
    • seconds in a minute
    • minutes in an hour
    • hours in a day
    • days in a year
  3. the seconds in a minute aggregation is calculated in real-time (so it must be really fast)
  4. all other aggregations are calculated by a batch processor (so they must be relatively fast)

Continue reading “MongoDB time series: Introducing the aggregation framework”

A beginner’s guide to ACID and database transactions


Transactions are omnipresent in today’s enterprise systems, providing data integrity even in highly concurrent environments. So let’s get started by first defining the term and the context where you might usually employ it.

A transaction is a collection of read/write operations succeeding only if all contained operations succeed.


Inherently a transaction is characterized by four properties (commonly referred as ACID):

  1. Atomicity
  2. Consistency
  3. Isolation
  4. Durability

Continue reading “A beginner’s guide to ACID and database transactions”