Sunday 24 October 2010

Too many choices for the modern analytics Solution Architect

We analytics practitioners have always had the luxury of alternatives to the RDBM as part of our data architectures. OLAP of one form or another has been providing what one of my colleagues calls ‘query at the speed of thought’ for well over a decade. However, the range of options available to a solutions architect today is bordering on overwhelming.

First off, the good old RDBMS offers hashing, materialised views, bitmap indexes and other physical implementation options that don’t really require us to think too differently about the raw SQL. The columnar database and implementations of it in products like Sybase IQ are another option. The benefits are not necessarily obvious. We data geeks always used to think the performance issues where about joining but then the smart people at InfoBright, Kickfire et al told us that shorter rows are the answer to really fast queries on large data volumes. There is some sense in this given that disk i/o is an absolute bottleneck so less columns means less redundant data reading. The Oracle and Microsoft hats are in the columnar ring (if you will excuse the mixed geometry and metaphor) with Exadata 2 and Gemini/Vertipaq so they are becoming mainstream options.


Data Warehouse appliances are yet another option. The combined hardware, operating systems and software solution usually using massively parallel (MPP) deliver high performance on really large volumes. And by large we probably mean Peta not Tera. Sorry NCR, Tera just doesn’t impress anyone anymore. And whilst we are on the subject of Teradata, it was probably one of the first appliances but then NCR strategically decided to go open shortly before the data warehouse appliance market really opened up. The recent IBM acquisition of Netezza and the presence of Oracle and NCR is reshaping what was once considered niche and special into the mainstream. 


We have established that the absolute bottleneck is disk i/o so in memory options should be a serious consideration. There are  in-memory BI products but the action is really where the data is.Databases include TimesTen (now Oracle’s) and IBM’s solidDB. Of course, TM1 fans will point out that they had in-memory OLAP when they were listening to Duran Duran CD’s and they would be right.

The cloud has to get a mention here because it is changing everything. We can’t ignore those databases that have grown out of the need for massive data volumes like Google’s BigTable, Amazon’s RDS and Hadoop. They might not have been built with analytics in mind but they are offering ways of dealing with unstructured and semi-structured data and this is becoming increasingly important as organisations include data from on-line editorial and social media sources in their analytics. All of that being said, large volumes and limited pipes are keeping many on-premises for now.

So, what’s the solution? Well that is the job of the Solutions Architect. I am not sidestepping the question (well actually, I am a little) However, it’s time to examine the options and identify what information management technologies should form part of your data architecture. It it is no longer enough to simply chose an RDBMS.