What is Hadoop?
Hadoop is an open-source project from the Apache Foundation, introduced in 2006 and developed in Java. It has to objective of offering a working environment that is appropriate for the demands of Big Data (the 4 V’s). As such, Hadoop is designed to work with large Volumes of data, both structured and unstructured (Variety), and to process them in a secure and efficient way (Veracity and Velocity).
In the beginning, as the volume of data created was increasing, the traditional solution consisted in investing in more powerful computers, with greater storage and processing capacities, which were of course more expensive. Soon, people realized that there had to be a better way, that things couldn’t continue as they were. The key was in distribution, both in terms of storing information as well as processing it. It is distributed between various computers working together in “clusters”. These clusters have one or more master nodes charged with managing the distributed files where the information is stored in different blocks, as well as coordinating and executing the different tasks among the cluster’s members.
When working in a distributed way, the key challenges are: being able to access the data, quickly processing it, and avoiding information loss if one of the nodes fails.
Hadoop answered these problems by providing:
- Reliability. It distributes the data and tasks between different nodes. If one node fails, the task will be automatically reassigned to another. Additionally, data isn’t lost because it is replicated in other nodes of the cluster.
- Horizontal Scalability. One can test the program on one machine and then scale up to 1000, or even 4000 machines.
- Simple APIs, both for processing and accessing data.
- Power. Hadoop divides the processing between different machines, which allows users to process enormours volumes of data in efficient times.
It’s capabilities to distribute storage and processing between a large number of machines, and to offer redundancy based on software, is another key advantage of using Hadoop. One doesn’t have to buy special hardware, or costly RAID systems, one can instead use “commodity hardware”, which offers great flexibility and useful savings.
Another advantage of Hadoop is its scalability, especially when run on cloud platforms such as Microsoft Azure, Amazon Web Services and the Google Compute Engine. This allows user to add and remove resources as business needs change. In these cases one normally users these platforms’ storage systems in order to separate the computing and storage processes. In this way, the computing focuses on processing and analyzing data instead of performing system maintenance tasks.
As happens with other Open Source projects, some manufacturers offer stable distributions, with which they include their own tools and support. The most common are Cloudera Hadoop, HortonWorks, MapR, Microsoft HD Insight, IBM InfoShere BigInsights, AWSEMR (Elastic MapReduce) etc.