Home » articles » Hadoop Ecosystem And Core Components

Hadoop Ecosystem And Core Components

Hadoop Ecosystem is an interconnected system of Apache Hadoop Framework, its core components, open source projects and its commercial distributions.

Hadoop framework itself cannot perform various big data tasks. It is only possible when Hadoop framework along with its components and open source projects are brought together.

These projects extend the capability of Hadoop framework. Therefore, the whole Hadoop ecosystem is responsible for handling all the big data issues.

We have described below core components of Hadoop, its open source projects and commercial Hadoop distributions that all together form the Hadoop ecosystem.

Hadoop Core Components

Hadoop Common- The utilities and libraries that provide support to other Hadoop modules are collectively known as Hadoop Common. These were first known as Hadoop Core which was changed to Hadoop Common in July 2009.

HDFS- Hadoop Distributed File System(HDFS) is used to store a large amount of data across the computer cluster. Its main feature fault tolerance enables the data to be easily recovered, if in case it is lost from one of the machines in the cluster, by replicating the data on other machines.

MapReduce- It is basically a programming model used to process a large amount of data. MapReduce is highly scalable.

YARN- Yet Another Resource Negotiator(YARN) is used to manage cluster resources and for scheduling Hadoop jobs.

Hadoop Open Source Projects

We have described below the open source projects of Hadoop that extends the capability of Hadoop to deal with big data issues.

Apache Spark

It is a distributed in-memory computing framework which is developed to overcome the limitations of MapReduce that executes workloads on disk.

It is comparatively fast and provides a programming interface along with fault tolerance and parallel data processing capabilities in order to execute machine learning, ETL processing, and SQL workloads.

Apache Spark can be integrated with Hadoop YARN or Apache Mesos for managing the cluster. It can also interface with MapR-FS, HDFS etc for distributed storage of large datasets.

Apache Hive

Hive is an open source data warehouse software used for organizing and storing a large amount of data in Hadoop cluster through SQL queries. A JDBC driver is used to connect users with Hive to read, write and manage the data of all kinds.

Apache Pig-

Pig is a high-level scripting platform used for processing and analyzing a large amount of data stored in the Hadoop cluster.

It can perform the following big data operations:

ETL processing
Large-scale Iterative Data Processing
Raw data processing for research purposes.

Apache HBase-

HBase is a distributed, NoSQL database for Hadoop Framework written in Java. It enables programmers to handle large data sets having a large number of rows and columns.

With Apache HBase, you can read or write data stored in Hadoop. It is best suited to store semi-structured data such as log files etc.

Apache Oozie-

Apache Oozie is a Java-based workflow scheduler for managing jobs within the Hadoop cluster. It can be integrated with YARN and supports MapReduce, HDFS, Pig, Hive, and Sqoop for managing Hadoop jobs.

Apache Mahout-

Mahout is basically a library for writing machine learning algorithms using MapReduce paradigm. Through its data science tools, it extracts meaningful information or patterns from the big data stored on HDFS. It can be used for collaborative filtering, classification of unassigned items, clustering etc.

Apache ZooKeeper-

It is a simple, reliable, fast and scalable software framework used in configuration management, naming and synchronization service, notification system etc.

There are several other Hadoop related projects used for extending the capability of Hadoop framework for handling large data sets. These are listed below.

Apache Sqoop
Apache Ambari
Apache Avro
Apache Cassandra
Apache Chukwa
Apache Impala
Apache Flume
Apache Tez
Apache Kafka
Apache Tajo
Apache Falcon
Apache Atlas
Apache Accumulo
Apache Storm
Apache Ranger
Apache Solr
Apache Knox
Apache Phoenix
Apache Nifi
Apache HAWQ
Apache Zeppelin
Apache Slider
Apache Metron

Commercial Hadoop Distributions

We have listed below vendors of Apache Hadoop that provides big data solutions through their Hadoop distributions.

Cloudera’s Distribution including Apache Hadoop(CDH)
Hortonworks Data Platform and Data Flow(HDP and HDF)
MapR Hadoop Distribution
Amazon EMR(Elastic MapReduce)
IBM Open Platform
Microsoft HDInside
Pivotal HD
Intel Distribution for Apache Hadoop
Datastax Enterprise Analytics
Teradata Enterprise for Hadoop
Dell, Cloudera Apache Hadoop Solutions

Hadoop Ecosystem And Core Components

Hadoop Core Components

Hadoop Open Source Projects

Commercial Hadoop Distributions

Important Links

Contact Details

Get Started