We analytics practitioners have always had the luxury of alternatives to the RDBM as part of our data architectures. OLAP of one form or another has been providing what one of my colleagues calls ‘query at the speed of thought’ for well over a decade. However, the range of options available to a solutions architect today is bordering on overwhelming.
First off, the good old RDBMS offers hashing, materialised views, bitmap indexes and other physical implementation options that don’t really require us to think too differently about the raw SQL. The columnar database and implementations of it in products like Sybase IQ are another option. The benefits are not necessarily obvious. We data geeks always used to think the performance issues where about joining but then the smart people at InfoBright, Kickfire et al told us that shorter rows are the answer to really fast queries on large data volumes. There is some sense in this given that disk i/o is an absolute bottleneck so less columns means less redundant data reading. The Oracle and Microsoft hats are in the columnar ring (if you will excuse the mixed geometry and metaphor) with Exadata 2 and Gemini/Vertipaq so they are becoming mainstream options.
Data Warehouse appliances are yet another option. The combined hardware, operating systems and software solution usually using massively parallel (MPP) deliver high performance on really large volumes. And by large we probably mean Peta not Tera. Sorry NCR, Tera just doesn’t impress anyone anymore. And whilst we are on the subject of Teradata, it was probably one of the first appliances but then NCR strategically decided to go open shortly before the data warehouse appliance market really opened up. The recent IBM acquisition of Netezza and the presence of Oracle and NCR is reshaping what was once considered niche and special into the mainstream.
We have established that the absolute bottleneck is disk i/o so in memory options should be a serious consideration. There are in-memory BI products but the action is really where the data is.Databases include TimesTen (now Oracle’s) and IBM’s solidDB. Of course, TM1 fans will point out that they had in-memory OLAP when they were listening to Duran Duran CD’s and they would be right.
The cloud has to get a mention here because it is changing everything. We can’t ignore those databases that have grown out of the need for massive data volumes like Google’s BigTable, Amazon’s RDS and Hadoop. They might not have been built with analytics in mind but they are offering ways of dealing with unstructured and semi-structured data and this is becoming increasingly important as organisations include data from on-line editorial and social media sources in their analytics. All of that being said, large volumes and limited pipes are keeping many on-premises for now.
So, what’s the solution? Well that is the job of the Solutions Architect. I am not sidestepping the question (well actually, I am a little) However, it’s time to examine the options and identify what information management technologies should form part of your data architecture. It it is no longer enough to simply chose an RDBMS.
Hi Dale,
ReplyDeleteGreat post on all the options you are discussing here. As the CTO of Infobright, I thought I would mention that you will find us a very good solution as an open columnar database that performs very fast on low end commodity hardware. Our compression also requires less storage. Finally our MySQL compatibility enables many MySQL skilled professionals to take advantage of our solution.
Agreed that "the action is really where the data is." One more in-memory database system for your list is McObject's eXtremeDB (http://www.mcobject.com/extremedbfamily.shtml) which shares with TimesTen and SolidDB the distinction of having been created, from the ground up, AS an in-memory database system. So many DB vendors are jumping on the IMDS bandwagon, often on the basis of minor tweaks (or no tweaks at all) in their code.
ReplyDelete