You are currently browsing the tag archive for the ‘Hadoop’ tag.


Big data has great promise for many organizations today, but they also need technology to facilitate integration of various data stores, as I recently pointed out. Our big data integration benchmark research makes it clear that organizations are aware of the need to integrate big data, but most have vr_BDI14_performance_01_overallyet to address it: In this area our Performance Index analysis, which assesses competency and maturity of organizations, concludes that only 13 percent reach the highest of four levels, Innovative. Furthermore, while many organizations are sophisticated in dealing with the information, they are less able to handle the people-related areas, lacking the right level of training in the skills required to integrate big data. Most said that the training they provide is only somewhat adequate or inadequate.

Big data is still new to many organizations, and they face challenges in integrating big data that prevent them from gaining full value from their existing and potential investments. Our research finds that many lack confidence in processing large volumes of data. More than half (55%) of organizations characterized themselves as only somewhat confident or not confident in their ability to accomplish that task. They have even less confidence in their ability to process data that arrives at high velocity: Only 29 percent said they are somewhat confident or not confident in that. In dealing with the variety of big data, confidence is somewhat stronger, as more than half (56%) declared themselves confident or very confident. Assurance in one aspect is often found in others: 86 percent of organizations that said they are very confident in their ability to integrate the variety of big data are satisfied with how they manage the storage of big data. Similarly 91 percent of those that are confident or very confident with their data quality are satisfied with the way they manage the storage of big data.

Turning to the technology being used, we find only one-third (32%) of organizations satisfied with their current data integration technology, but twice as many (66%) are satisfied with their data integration pro­cesses for loading and creating big data. A substantial majority (86%) of those very confident in their ability to integrate the needed variety of big data are vr_BDI_03_plans_for_big_data_technologysatisfied with their existing data integration processes. Those that are not satisfied said the process is too slow (61%), analytics are hard to build and maintain (50%) and data is not readily available (39%). These findings indicate that making a commitment to data integration, for big data and other­wise, can pay off in confidence and satisfaction with the processes for doing it. Additionally, organizations that use dedicated data integration technology (86%) are satisfied much more often than those that don’t use dedicated technology (52%).

New types of big data technologies are being introduced to meet expanding demand for storage and use of information across the enterprise. One of those fast-growing technologies is the open source Apache Hadoop and commercial enterprise versions of it that provide a distributed file system to manage large volumes of data. The research finds that currently 28 percent of organizations use Hadoop and about as many more (25%) plan to use it in the next two years. Nearly half (47%) have Hadoop-specific skills to support big data integration. For those that have limited resources, open source Hadoop can be affordable, and to automate and interface with it, adopters can use SQL in addition to its native interfaces; about three in five organizations now use each of these options. Hadoop can be a capable tool to implement big data but must be integrated with other information and operational systems.

Big data is not found only in conventional in-house information environments. Our research finds that data integration processes are most often applied between systems deployed vr_BDI_07_types_of_data_integration_processeson-premises (58%), but more than one-third  (35%) are integrating cloud-based systems, which reflects the progress cloud computing has made. Nonetheless, cloud-to-cloud integration remains least common (18%). In the next year or two 20 to 25 percent of organizations plan additional support for all types of integration; those being considered most often are cloud-to-cloud (25%) and on-premises-to-cloud (23%), further reflecting movement into the cloud. In addition, nearly all (95%) organizations using cloud-to-cloud integration said they have improved their activities and proces­ses. This finding confirms the value of inte­gration of big data regardless of what types of systems hold it. With a growing number of organi­za­tions using cloud computing, data inte­gra­tion is a critical requirement for big data projects; more than one-quarter (28%) of organizations are deploying big data integration into cloud computing environments.

Because of the intense need of business units and process for big data, integration requires IT and business people to work together to build efficient processes. The largest percentage of organizations in the research (44%) have business analysts work with IT to design and deploy big data integration. Another one-third assign IT to build the integration, and half that many (16%) have IT use a dedicated data integration tool. The research finds some distrust in involving the business side. Almost one in four (23%) said they are resistant or very resistant to allowing business users to integrate big data that IT has not prepared first, and the majority (51%) resist somewhat. For more than half (58%) the IT group responsible for BI and data warehouse systems also is the key stakeholder for designing and deploying big data integration; no other option is used by more than 11 percent.

It is not surprising that IT is the department that most often facilitates big data and needs integration the most (55%). The most frequent issue arising between business units and IT is entrenchment of budgets and priorities (in 42% of organizations). Funding of big data initiatives most often comes from the general IT budget (50%); line-of-business IT budgets (38%) are the second-most commonly used. It is understandable that IT dominates this heavily technical function, but big data is beneficial only when it advances the organization’s goals for information that is needed by business. Management should ensure that IT works with the lines of business to enable them to get the information they need to improve business processes and decision-making and not settle for creating a more cost-effective and efficient method to store it.

Overcoming these challenges is a critical step in the planning process for big data. My analysis that big data won’t work well without integration is confirmed by the research. We urge organizations to take a comprehensive approach to big data and evaluate dedicated tools that can mitigate risks that others have already encountered.

Regards,

Mark Smith

CEO and Chief Research Officer


Organizations should consider multiple aspects of deploying big data analytics. These include the type of analytics to be deployed, how the analytics will be deployed technologically and who must be involved both internally and externally to enable success. Our recent big data analytics benchmark research assesses each of these areas. How an organization views these deployment considerations may depend on the expected benefits of the big data analytics program and the particular business case to be made, which I discussed recently.

According to the research, the most important capability of big data analytics is predictive analytics (64%), but among companies vr_Big_Data_Analytics_08_top_capabilities_of_big_data_analyticsthat have deployed big data analytics, descriptive analytic approaches of query and reporting (74%) and data discovery (64%) are more readily available than predictive capabilities (57%). Such statistics may be a function of big data technologies such as Hadoop, and their associated distributions having prioritized the ability to run descriptive statistics through standard SQL, which is the most common method for implementing analysis on Hadoop. Cloudera’s Impala, Hortonworks’ Stinger (an extension of Apache Hive), MapR’s Drill, IBM’s Big SQL, Pivotal’s HAWQ and Facebook’s open-source contribution of Presto SQL all focus on accessing data through an SQL paradigm. It is not surprising then that the technology research participants use most for big data analytics is business intelligence (75%) and that the most-used analytic methods — pivot tables (46%), classification (39%) and clustering (37%) — are descriptive and exploratory in nature. Similarly, participants said that visualization of big data allows analysts to perform faster analysis (49%), understand context better (48%), perform root-cause analysis (40%) and display multiple result sets (40%), but visualization does not provide more advanced analytic capabilities. While various vendors now offer approaches to run advanced analytics on big data, the research shows that in terms of big data, organizational capabilities still revolve around more basic analytic access.

For companies that are implementing advanced analytic capabilities on big data, there are further analytic process considerations, and many have not yet tackled those. Model building and model deployment should be manageable and timely, involve specialized personnel, and integrate into the broader enterprise architecture. While our research provides an in-depth look at adoption of the different types of in-database analytics, deployment of advanced analytic sandboxes, data mining, model management, integration with business processes and overall model deployment, that is beyond the topic here.

Beyond analytic considerations, a host of technological decisions vr_Big_Data_Analytics_13_advanced_analytics_on_big_datamust be made around big data analytics initiatives. One of these is the degree of customization necessary. As technology advances, customization is giving way to more packaged approaches to big data analytics. According to our research, the majority (54%) of companies that have already implemented big data analytics did custom builds using big data-specific languages and interfaces. The most of those that have not yet deployed are likely to purchase a dedicated or packaged application (44%), followed by a custom build (36%). We think that this pre- and post-deployment comparison reflects a maturing market.

The move from custom approaches to standardized ones has important implications for the skills sets needed for a big data vr_Big_Data_Analytics_14_big_data_analytics_skillsanalytics initiative. In comparing the skills that organizations said they currently have to the skills they need to be successful with big data analytics, it is clear that companies should spend more time building employees’ statistical, mathematical and visualization skills. On the flip side, organizations should make sure their tools can support skill sets that they already have, such as use of spreadsheets and SQL. This is convergent with other findings about training needs, which include applying analytics to business problems (54%), training on big data analytics tools (53%), analytic concepts and techniques (46%) and visualizing big data (41%). The data shows that as approaches become more standardized and the market focus shifts toward them from customized implementations, skill needs are shifting as well. This is not to say that demand is moving away from the data scientist completely. According to our research, organizations that involve cross-functional teams or data scientists in the deployment process are realizing the most significant impact. It is clear that multiple approaches for personnel, departments and current vendors play a role in deployments and that some approaches will be more effective than others.

Cloud computing is another key consideration with respect to deploying analytics systems as well as sandbox modelling and testing environments. For deployment of big data analytics, 27 percent of companies currently use a cloud-based method, while 58 percent said they do not and 16 percent do not know what is used. Not surprisingly, far fewer IT professionals (19%) than business users (40%) said they use cloud-based deployments for big data analytics. The flexibility and capability that cloud resources provide is particularly attractive for sandbox environments and for organizations that lack big data analytic expertise. However, for big data model building, most organizations (42%) still utilize a dedicated internal sandbox environment to build models while fewer (19%) use a non-dedicated internal sandbox (that is, a container in a data warehouse used to build models) and others use a cloud-based sandbox either as a completely separate physical environment (9%) or as a hybrid approach (9%). From this last data we infer that business users are sometimes using cloud-based systems to do big data analytics without the knowledge of IT staff. Among organizations that are not using cloud-based systems for big data analytics, security (45%) is the primary reason that they do not.

Perhaps the most important consideration for big data analytics is choosing vendors to partner with to achieve organizational objectives. When we understand the move from custom technological approaches to more packaged ones and the types of analytics currently being implemented for big data, it is not surprising that a majority of research participants (52%) are looking to their business intelligence systems providers to supply their big data analytics solution. However, a significant number of companies (35%) said they will turn to a specialist analytics provider or their database provider (34%). When evaluating big data analytics, usability is the most important vendor consideration but not by as wide a margin as in categories such as business intelligence. A look at criteria rated important and very important by research participants reveals usability is the highest ranked (94%), but functionality (92%) and reliability (90%) follow closely. Among innovative new technologies, collaboration is important (78%) while mobile access (46%) is much less so. Coupled with the finding that communication and knowledge sharing combined is an important benefit of big data analytics, it is clear that organizations are cognizant of the collaborative imperative when choosing a big data analytics product.

Deployment of big data analytics starts with forethought and a well-defined business case that includes the expected benefits I discussed in my previous analysis. Once the outcome-driven framework is established, organizations should consider the types of analytics needed, the enabling technologies and the people and processes necessary for implementation. To learn more about our big data analytics research, download a copy of the executive summary here.

Regards,

Tony Cosentino

VP & Research Director

Twitter Updates

Top Rated

Blog Stats

  • 81,356 hits
Follow

Get every new post delivered to your Inbox.

Join 125 other followers

%d bloggers like this: