Big Data​: Who Are The Best Hadoop Vendors In 2017?

Originally published by Bernard Marr on LinkedIn: Big Data​: Who Are The Best Hadoop Vendors In 2017?

Hadoop is the open source software framework at the heart of much of the Big Data and analytics revolution. It provides solutions for enterprise data storage and analytics with almost unlimited scalability. Since its release in 2011 it has rapidly grown in popularity and a strong ecosystem of distributors, vendors and consultants has emerged to support its use across industry.

At its core, Hadoop is an Open Source system, which, among other considerations, means it is essentially free for anyone to use. However the requirement for it to be aligned to the needs of individual organizations has resulted in the emergence of many commercial distributions. These generally come packaged with support or additional features designed to streamline its deployment or allow users to build additional analytics, security or data handling into their framework.

Competition in this market is fierce and the landscape is constantly shifting – for example all the top distributions now include the Apache Spark parallel processing framework, whereas a few years ago this was not the case. The growing prominence of Spark has resulted in many vendors increasing the resources dedicated to Spark deployment and support.

One important factor to consider in choosing a Hadoop distribution is whether you want an on-premises or cloud-based solution. If there is no room to compromise when it comes to maintaining complete control and ownership of your data, an on-site solution still theoretically offers the highest level of security. In recent years, though, cloud solutions have become less expensive, more flexible and easier to scale.

Most of the vendor products here can be installed on a cloud or on-premises. However, some cannot be run on-site. These are generally products from web service providers, such as Amazon or Microsoft, running either Hadoop distributions from other, platform-focused vendors such as Hortonworks or MapR, or their own distributions.

Beyond that, all of the top distributions have subtle differences which could make them more or less suitable for your business. Here’s a non-exhaustive guide to some of the most popular on the market today.

Amazon Elastic Map Reduce

Amazon offers a cloud-only Hadoop-as-a-service platform through its Amazon Web Services arm. A key advantage of the pay-as-you-go model offered by cloud-only service providers is the scalability offered, with storage and data processing able to be ramped up or wound down as demands change. Amazon has recently announced that customers can now use the Apache Flink stream processing framework for real-time data analytics on the platform, along with other popular tools such as Kafka and Presto. It also seamlessly connects (as you would expect) with Amazon’s other cloud services infrastructure such as EC2 for cloud processing, Amazon S3 and DynamoDB for storage and AWS IoT to collect data from Internet of Things-enabled devices.