Hortonwork’s Hadoop Hype is Misleading

Hortonworks is making a play for “Hadoop everywhere” by pushing for Hadoop to augment a data warehouse in the enterprise and thereby create a “data lake”.  See their report here: http://hortonworks.com/blog/enterprise-hadoop-journey-data-lake/

The concept of a data lake basically says “store everything” in hadoop and you can access it at any time, do a “schema-on-read” after the fact and discover new insights at will.  What this ignores is the fact that the Hadoop HDFS is optimized for map reduce jobs on large homogeneous data sets.  That will not work for a “store everything” strategy.  Additionally, while the “schema on read” moniker gets thrown around a lot – I have yet to see many practical examples on the net that prove this.  I understand that Hive uses it but when you look at Hive tutorials they all still create the schema up front though I get it that you could change the schema afterwards because the schema is only enforced on the read.  Well, enforcing on the read and defining on the read are two different things.  I will be delving into this in much more detail in the ensuing weeks…

Overall, this report by Hortonworks really stretches the boundaries of Hadoop’s current capabilities and, to me, crosses the line.  Senior executives don’t need this type of far-fetched promises just to push a “hadoop-everywhere” agenda.  The simple truth should be – you use technology where it is best suited and NOT where the marketeers want you to use it.  Buyer beware.