Hadoop Architecture

Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.

commonHadoop Common

A module containing the utilities that support the other Hadoop components.

hdfsHDFS

A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.

maprrrrMapReduce

A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.

resource-negotiatorYet Another Resource Negotiator

The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.

Data Access Projects

Pig

A programming language designed to handle any type of data, helping users to focus more on analyzing large data sets and less on writing map programs and reduce programs.

Hive

A Hadoop runtime component that allows those fluent with SQL to write Hive Query Language (HQL) statements, which are similar to SQL statements. These are broken down into MapReduce jobs and executed across the cluster.

Flume

A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data. Its main goal is to deliver data from applications to the HDFS.

HCatalog

A table and storage management service for Hadoop data that presents a table abstraction so the user does not need to know where or how the data is stored.

Jaql

A query language designed for JavaScript Object Notation (JSON), which is primarily used to analyze large-scale semi-structured data. Core features include user extensibility and parallelism.

Avro

An Apache open source project that provides data serialization and data exchange services for Hadoop.

Spark

An open-source cluster computing framework with in-memory analytics performance that is up to 100 times faster than MapReduce, depending on the application.

Sqoop

An ELT tool to support the transfer of data between Hadoop and structured data sources.

HBase

A column-oriented non-relational (noSQL) database that runs on top of HDFS and is often used for sparse data sets.

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. HDFS is now an Apache Hadoop subproject.

datatype-3

© Copyright 2015. Fusion Systems Inc. All rights reserved | Design and Developed by www.qualinsoft.com