Archive for category Data Processing
A few weeks ago I introduced openpivot on this blog. It is a powerful command line utility to build pivot tables. Assume you have the following simple file :
company Name;year;NbEmployee;profit;sector Microsoft;2001;30000;1.5;IT Microsoft;2003;33001;1.5;IT Apple;2001;23001;1.5;IT Apple;2003;23001;1.5;IT Google;2001;43001;1.5;IT Google;2003;43001;1.5;IT
What a complex data set ! Pivot tables are very powerful tools allowing you to make summary of your data from the point of view of your choice. Suppose you want to have per year and per sector, the average number of employees accross all entries. Here it is implicit that these entries will be only different companies as here there is 1 line per company/year. Note also that there is only 1 sector. So as there are 2 different years represented, and only 1 sector, you will get 2 statistics. Doing this with openpivot is simple. You only need to describe what pivot table you want in an xml file configuration and it will look like this:
<?xml version="1.0" encoding="UTF-8"?> <problem> <defaultaccumulation value="sum"/> <columnlist> <col id="NbEmployee" accumulation"average"/> <col id="profit"/> <!-- will use default accumulation --> </columnlist> <rowsequence> <row id="year"/> <row id="sector"/> </rowsequence> </problem>
This configuration is almost the pivot table we want for our example, except that there is an additional column we want to have a statistics for. Note that what we call row here the variable sequence from which you want to pivot. There are taken into account in the order specified. Just like in excel, changing the order of variables will offer a diferent view on your data. You are now ready to have your pivot table, all you need is to call openpivot to compute this from your sample data :
% openpivot -p configfile.xml -o outputfile.csv inputfile.csv
Et voila ! This will write your pivot table in the outputfile.csv. Please drop me a line if you have problem to build openpivot. I have tested it on mac os 10.6 and various linux distribution, and it seemed to work
I am pleased to announce the release of my first open source project . Openpivot is designed to efficiently reproduce the functionality of MS excel pivot tables. The problem of MS excel is that old versions (priori to 2007) cannot deal files larger than 65 000 rows. At work a lot of people are always writing scripts to pre-aggregate the data they have to then be able to play with it in excel.
Openpivot was written in c++ an I have been focusing on performance so that it can been used in large amount of data processing.
The principle is that you have a [large] csv file with headers and you would like to have an aggregated view of this file. You may want to aggregate a certain column or see the sum of a certain columns by entry of another column.
Check it out on google code :
I will post a tutorial in a few days to show how to use it, and you will see that it is simple. So far it is a command line tool and only works on mac os and linux. I will try to have it compatible with windows when I get to a stable point