Things to Know About Hadoop Distributed File System

Tarun Manrai
3 min readMar 2, 2020

--

hadoop technology stack

The Hadoop Distributed File System (HDFS) is a distributed file system and data storage system used by Hadoop applications. It is designed to run on commodity hardware. HDFS is an integral part of multiple Hadoop ecosystem technologies, as it offers a reliable means for managing bulk of big data and supporting related big data analytics applications. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.

Features:

· Distribute data storage.

· Blocks reduce seek time.

· High fault tolerance

· Data is highly available as the similar block is also there at different data nodes.

· Highly reliable as users can work even if there are multiple data nodes.

How HDFS works?

HDFS supports the transfer of data between computer nodes. It is closely coupled with a programmatic framework, MapReduce. When HDFS receives a data, it breaks the provided information into different blocks and distribute them in varied nodes in a cluster. HDFS is specially designed to be fault tolerant. Its file system copies or replicate every piece of data several times and then distribute the copies to individual nodes and places minimum one copy on a different rack.

HDFS Deamons:

Deamons are the process that run in the background. HDFS employs a Name Node and Data Node architecture to implement a distributed file system. Providing high-performance access to the data.

NameNodes

· Run on the master node

· Store metadata such as file path, number of blocks, block ids and more.

· Require high quantity of RAM.

· Store metadata in RAM for quick retrieval.

DataNodes:

· Run on slave nodes

· Need high memory as data is stored in it.

Assumptions and Goals

Hardware failure: Hardware failure is a norm. HDFS consists of plenty of server machines, each storing multiple data. With a large number if components, the probability of failure implies that at least some components of HDFS are non-functional. Therefore, one of the core architectural goals of HDFS is to detect faults and recover them automatically.

Large data sets: HDFS have large data sets which could be of gigabyte or terabyte in size. Hence, HDFS supports tens of millions of files in a single instance.

Streaming data access: As the applications that run on HDFS require streaming access to their data sets, HDFS is designed to process and emphasis on high throughput.

Simple coherency model: HDFS application require write once read many access models for files. These files could not be changed and there are assumptions that simplify data coherence issues.

Move computation than the data: HDFS offers interfaces for application and move themselves closer to where the data is located.

Limitations of HDFS

Small file issue: HDFS is an inefficient data access pattern when applied to the movement pf small files from one datanode to another.

Low latency data access: Applications that need low-latency access does not work well with HDFS as HDFS is designed catering to high-throughput data.

Why HDFS?

As HDFS is typically deployed as a part of large-scale implementations, its low-cost commodity hardware is a useful feature. Systems that run wen search and related applications can range from hundreds of petabytes to thousands of nodes. LinkedIn, Facebook, Twitter are some of the web companies that use HDFS for big data analytics.

visit www.entradasoft.com for more learning about companies offering solutions in HDFS

--

--

Tarun Manrai
Tarun Manrai

No responses yet