The big-data landscape just got a little more interesting with the release of EMC’s Pivotal HD distribution of Hadoop. Pivotal HD takes Apache Hadoop and extends it with a data loader and command center capabilities to configure, deploy, monitor and manage Hadoop. Pivotal HD, from EMC’s Pivotal Labs division, integrates with Greenplum Database, a massively parallel processing (MPP) database from EMC’s Greenplum division, and uses HDFS as the storage technology. The combination should help sites gain from big data a key part of its value in information optimization.
Topics: EMC, MapR, HAWQ, HDFS, Pivotal HD, Business Analytics, Business Intelligence, Cloud Computing, Cloudera, Hadoop, Hortonworks, Information Applications, Information Management, Location Intelligence, Cirro, Hive, Tableau Software
Last week I attended the IBM Big Data Symposium at the Watson Research Center in Yorktown Heights, N.Y. The event was held in the auditorium where the recent Jeopardy shows featuring the computer called Watson took place and which still features the set used for the show – a fitting environment for IBM to put on another sort of “show” involving fast processing of lots of data. The same technology featured prominently in IBM’s big-data message, and the event was an orchestrated presentation more like a TV show than a news conference. Although it announced very little news at the event, IBM did make one very important statement: The company will not produce its own distribution of Hadoop, the open source distributed computing technology that enables organizations to process very large amounts of data quickly. Instead it will rely on and throw its weight behind the Apache Hadoop project – a stark contrast to EMC’s decision to do exactly that, announced earlier in the week. As an indication of IBM’s approach, Anant Jhingran, vice president and CTO for information management, commented, “We have got to avoid forking. It’s a death knell for emerging capabilities.”
The event brought together organizations presenting interesting and diverse use cases ranging from traditional big-data stories from Web businesses such as Yahoo to less well known scenarios such as informatics in life sciences and healthcare, by Illumina and the University of Ontario Institute of Technology (UOIT), respectively, low-latency financial services by eZly and customer demographic data by Axciom.
Eric Baldeschwieler, vice president of Hadoop development at Yahoo, shared some impressive statistics about its Hadoop usage, one of the largest in the world with over 40,000 servers. Yahoo manages 170 petabytes of data with Hadoop and runs more than 5 million Hadoop jobs every month. The models it uses to help prevent spam and others that do ad-targeting are in some cases retrained every five minutes to ensure they are based on up-to-date content. As a point of reference CPU utilization on Yahoo’s Hadoop computing resources averages greater than 30% and at its best is greater than 80%. It appears from these figures that the Hadoop clusters are configured with enough spare capacity to handle spikes in demand.
During the discussions, I detected a bit of a debate about who is the driving force behind Hadoop. According to Baldeschwieler, Yahoo has contributed 70% of the Apache Hadoop project code, but on April 12, Cloudera claimed in a press release, “Cloudera leads or is among the top three code contributors on the most important Apache Hadoop and Hadoop-related projects in the world, including Hadoop, HDFS, MapReduce, HBase, Zookeeper, Oozie, Hive, Sqoop, Flume, and Hue, among others.” Perhaps Yahoo wants to reestablish its credentials as it mulls whether to spin out its Hadoop software unit. If such a spinoff were to occur, it could further fracture the Hadoop market.
I found it interesting that the customers IBM brought to the event, while having interesting use cases, were not necessarily leveraging IBM products in their applications. This fact led me to the initial conclusion that the event was more of a show than a news conference. Reflecting further on IBM’s stated direction of supporting the Apache Hadoop distribution, I wondered what IBM Hadoop-related products they would use. IBM will be announcing version 1.1 of InfoSphere BigInsights in both a free basic edition and an enterprise edition. The product includes Big Sheets, which can integrate large amounts of unstructured Web data. InfoSphere Streams 2.0, announced in April, adds Netezza TwinFin, Microsoft SQLServer and MySQL support to other SQL sources already supported. But this event was not about those products. It was about IBM’s presence in and knowledge of the big-data marketplace. Executives did say that the IBM product portfolio will be extended “in all the places you would expect” to support big data but offered few specifics.
IBM emphasized the combination of streaming data, via InfoSphere Streams, and big data more than other big-data vendors do. The company painted a context of “three V’s” (volume, velocity and variety) of data, which attendees, Twitter followers and eventually the IBM presenters expanded to include a fourth V, validity. To illustrate the potential value of combining streaming data and big data, Dr. Carolyn McGregor, chair in health informatics at UOIT, shared how the institute is literally saving lives in neonatal intensive care units by monitoring and analyzing neonatal data in real time.
Rob Thomas, IBM vice president of business development for information management explained the role of partners in the IBM big data ecosystem. As stated above, IBM will rely on Apache Hadoop as the foundation of its work, but will partner with vendors further up the stack. Datameer, Digital Resaoning, and Karmasphere all participated in the event as examples of the types of partnerships IBM will seek.
IBM has already demonstrated, via Watson, that it knows how to deal with large-scale data and Hadoop, but to date, if you want those same capabilities from IBM, it will have to come mostly in the form of services. The event made it clear that IBM backs the Apache Hadoop effort but not in the form of new products. In effect, IBM used its bully pulpit (not to mention its size and presence in the market) to discourage others from fragmenting the market. The announcements may also have been intended to buy time for further product developments. I look for more definition from IBM on its product roadmap. If it wants to remain competitive in the big-data market, IBM needs to articulate how its products will interact with and support Hadoop. In my soon to be released Hadoop and Information Management benchmark research that I am completing will provide some facts on whether or not IBM is making the right bet on Hadoop.
Earlier this week EMC announced it will create its own distribution for Apache Hadoop. Hadoop provides distributed computing capabilities that enable organizations to process very large amounts of data quickly. As I have written previously, the Hadoop market continues to grow and evolve. In fact, the rate of change may be accelerating. Let’s start with what EMC announced and then I’ll address what the announcement means for the market.
EMC announced three new offerings, slated for the third quarter of 2011, that leverage its acquisition of Greenplum last year, ranging from an open source version to incorporation in its data warehouse appliance.
The EMC Greenplum HD Community Edition is a free, open source version of the Apache Hadoop stack comprising HDFS, MapReduce, Zookeeper, Hive and HBase. EMC extends Hadoop with fault tolerance for the Name Node and Job Tracker, both of which are well-known points of failure in standard Hadoop implementations.
The EMC Greenplum HD Enterprise Edition, interface-compatible with the Apache Hadoop stack, provides several additional features including snapshots, wide-area replication, a Network File System (NFS) interface and some management tools. EMC also claims performance increases of two to five times the performance over standard packaged versions of Apache Hadoop.
The EMC Greenplum HD Data Computing Appliance integrates Apache Hadoop with the Greenplum database and computing hardware. The appliance configuration provides SQL access and analytics to Hadoop data residing on the Hadoop Distributed File System (HDFS) as external tables, eliminating the need to materialize the data in the Greenplum database.
Until now Cloudera has dominated the emerging commercial Hadoop market and faced little or no competition since it introduced the Cloudera Distribution for Hadoop (CDH). The EMC announcements are both good and bad news for Cloudera. On the one hand they suggest – you might even say validate – that Cloudera has chosen a valuable market. EMC seems to be willing to invest heavily to try to get a share of it. On the other hand, Cloudera now faces a competitor that has significant resources. For customers competition is generally a good thing, of course, as it pushes vendors to innovate and improve their products to win more business.
EMC’s approach to the market differs dramatically from IBM’s strategy. IBM announced on Twitter at its Big Data Symposium held this week that it is putting all its weight behind Apache Hadoop in the hope of avoiding the fragmentation that plagued the UNIX market for years. EMC’s Enterprise Edition promises to tackle issues well known to the Hadoop market, but EMC faces competition from others who are also tackling these issues. If lower-cost or free competitive offerings adequately address these issues it could seriously undercut the market for EMC’s Enterprise Edition. While EMC brings more enterprise credentials to the Hadoop market than Cloudera, it has less experience with Hadoop. Multiple vendors are attempting to bring enterprise class capabilities to Hadoop, and it’s too soon to see who will succeed. However, overall, the Hadoop market will benefit from all the attention and investment.
I find it interesting and a little ironic that prior to its acquisition by EMC, Greenplum (along with Aster Data, now part of Teradata) helped popularize MapReduce, one of Hadoop’s most commonly used components, by embedding MapReduce as part of its databases. These proprietary implementations could be credited with helping to bring Hadoop into the mainstream big-data market because they combined data warehousing with MapReduce. It spawned a debate in which database guru Mike Stonebraker at first dismissed MapReduce and then embraced it. The debate attracted attention, a key ingredient in building any new market. Now EMC Greenplum completes the circle by embracing Hadoop.
To its credit, EMC aligned a dozen partners around these announcements, creating an ecosystem of third-party products and services. Concurrent, CSC, Datameer, Informatica, Jaspersoft, Karmasphere, MicroStrategy, Pentaho, SAS, SnapLogic, Talend and VMware all announced their support for the EMC products in one form or another. Most of these companies also partner with Cloudera, so this is a good move but not a coup for EMC.
The Hadoop market continues to evolve. We are now analyzing the data collected in our benchmark research on the state of the large-scale or now called the big data market, including Hadoop. Stay tuned for the results. It will be interesting to see where the market ends up. I expect more changes and innovation driven in part by the increased competition.
The Hadoop market is no longer a one-elephant race.
David Menninger – VP & Research Director
Topics: Aster Data, Big Data, BigData, EMC, Social Media, Operational Performance, Apache Hadoop, Business Analytics, Business Collaboration, Business Intelligence, Cloud Computing, Cloudera, Customer & Contact Center, Greenplum, Hadoop, Information Applications, Information Management