Hadoop Interview questions and answers

Top Hadoop Interview Questions and Answers

Hadoop is an open-source software framework that stores massive amounts of data. It runs applications on clusters of commodity hardware. Hadoop can hand virtually unlimited concurrent jobs or tasks. Hadoop also provides a vast amount of storage space for any data.

Big data & Hadoop have excellent processing power. Especially in the energy industry, analytics powered by Hadoop is used for predictive maintenance. It feeds data into big data programs, with input from the Internet of Things(IoT) . The demand for Hadoop experts is steadily on the rise. The Common question arises on how to prepare for a Hadoop interview or Hadoop admin interview. So, it is a brilliant idea to get upskilled as a Hadoop expert. Hadoop experts get a lucrative salary in the Big Data job market.

The interview panel asks the candidates who apply for hadoop developer or hadoop testing a few general questions as well as hadoop scenario based questions to check knowledge. So, a comprehensive list of probable questions and their answers helps prospective applicants be prepared for such interviews with a clearcut difference between big data and data science. We will also look into questions from the entire Hadoop ecosystem, which includes HDFS, MapReduce, YARN, Hive, Pig, HBase, and Sqoop. 

Hadoop Interview Questions

1.Explain Big Data

 A collection of massive and complicated is known as big data. It poses difficulty in processing traditional data. Big Data has provided a suitable opportunity for companies. It provides companies with a suitable opportunity for advanced business-making capacity.

2.What are the Five V’s of Big Data?

Big Data helps companies to derive value successfully from their data. The Five V’s of data is as follows:

  1. Volume: The amount of data that is growing very fast is known as volume. E.g.:- Exabytes.
  2. Velocity: The rate in which data is growing, which is very fast, is known as velocity. Yesterday’s data is considered to be old data today.Of late, social media has become a significant contributor to the velocity of increasing data.
  3. Variety: The data available in a variety of formats like video and audio is known as a variety of data. These different formats of data represent their variety.
  4. Veracity:– Often, there may be some inconsistency or incompleteness in data. As a result, uncertainty or data surrounds available data. This incompleteness makes the available data unreliable. It is known as veracity. It becomes challenging to control accuracy and quality, with the different forms of big data.
  5. Value: Big data Needs to add to the benefits of the organization. It should also help the organization achieve a substantial return on investment. It adds value to the organization by providing benefits to it and helping it achieve a substantial return on investment.

3.What is Hadoop and components of Apache Hadoop?

Apache Hadoop is a framework that provides different tools and services to process Big Data. The tools and services also help people to store Big Data.

Apache Hadoop has two main components:

  • Processing Framework – Yarn
  • Storage unit- HDFS

Hadoop helps people analyze Big data. Experts can make some business decisions with Big data through Hadoop. Traditional systems do not allow people to take such effective decisions with Big Data.

4.What do you mean by HDFS and Yarn?

 The storage unit of Hadoop is known as HDFS. It stands for Hadoop Distributed File System.It follows the topology of the master and slave. It stores different types of data in a distributed environment as blocks.

The components of HDFS are as follows:-

  • NameNode
  • DataNode

The processing framework of Hadoop is known as Yarn. It stands for Yet Another Resource Negotiator. It manages the resources.

Two components of Yarn are as follows:-

  • ResourceManager
  • NodeManager

5.What do you mean by NameNode and Data Node?

NameNode and Data Node are the two components of HDFS.

NameNode maintains the metadata information for the blocks of data stored in HDFS. It may be described as the master node in the distributed environment. It manages all the DataNodes.

Datanodes are responsible for storing data in HDFS. They are the slave NODES.

6.What do you mean by Resource Manager and NodeManager?

ResourceManager and NodeManager are the two components of Yarn. The node manager is installed in all the data nodes. It is responsible for executing all the tasks on every Datanode.

Resource managers receive processing requests. They then transfer the parts of the requests to the corresponding NodeManagers, where the actual processing occurs.

7.In which programming language is Hadoop written?

Hadoop is written in Java.

8.Why do companies prefer to use Python with Hadoop?

Many programming developers use Python because of its supporting libraries for data analytics tasks. Python is a flexible programming language which has many libraries and resources. Many software companies need their employees to be well-versed in Python. This programming language helps in producing and reading codes. Many companies use Python with Hadoop in the following ways:

  • Product recommendation of Amazon:

Amazon recommends different products to its users, based on their past buying history. Python has been used to build the machine learning engine of Amazon. It interacts with Hadoop Ecosystem to deliver the best quality product recommendation system.

  • Face finder application on Facebook:

Facebook has gained unprecedented popularity in the field of social media. It enables HDFS to extract extensive image-based unstructured data. Then, it uses Python as the backend language for image processing applications. Facebook also uses Hadoop streaming API to edit and access the data.

9.In which modes can Hadoop run?

Hadoop can run in three modes, which are as follows:

  1. Standalone mode: It is the default mode. It uses a single Java process and the local file system to run Hadoop services.
  2. Fully-distributed mode: It uses separate nodes to run the master and slave services of Hadoop.
  3. Pseudo-distributed mode: It utilizes a single node Hadoop deployment to execute all Hadoop services.

10. Name the types of metadata a NameNode server holds?

A name node server holds two types of metadata. They are as follows:

  1. Metadata in Disk
  2. Metadata in Ram

Below are the top trending related interview questions

11.What is the difference between Hadoop 1 and Hadoop 2?

“NameNode” is the single point of failure in Hadoop 1.x. Active and Passive NameNodes are present in Hadoop 2.x

12.Tell me the reason for frequently changing nodes in Hadoop?

Hadoop utilizes commodity hardware. It leads to regular “Datanode” crashing in a Hadoop cluster.

Since the volume of data is ever-increasing, Framework is the ease of scale, according to it. So, a Hadoop administrator has the critical duty to add and remove data nodes from a Hadoop cluster.

13.Explain speculative execution in Hadoop?

 Sometimes, a node appears to execute a task slowly. In that case, a master node may execute another instance of the same task to another node redundantly. In that case, the task that finishes first is accepted. The other one is killed. This process is “speculative execution” in Hadoop.

14.What is the function of “RecordReader” in Hadoop?

 A “RecordReader” loads the data from its source. It then converts the data into suitable pairs so that the “Mapper” task may read it.

15.How is a custom partitioner written?

It is easy to write a custom partitioner for a Hadoop job, by following the steps mentioned below:

  • A new class that extends Partitioner class is written
  • Method get-Partition is overridden.
  • The set Partitioner method is added to set the custom partitioner to the job as config files.

16.Name some companies that use Hadoop?

A few companies like Yahoo, Amazon, Netflix, and Twitter use Hadoop.

17.On which concept does the Hadoop Framework work?

  Hadoop framework works on two components as follows-

  1. Hadoop MapReduce: It provides scalability across different clusters of Hadoop. This Java-based paradigm of the Hadoop network ensures that the workload gets distributed so that several tasks continue parallelly.
  2. HDFS: HDFS stands for Hadoop Distributed File System. This Java-based file system provides reliable storage of large datasets in the form of blocks.

18.What do you mean by Hadoop streaming?

The distribution of Hadoop has a generic application programming interface. This interface writes Map and Reduce jobs in programming languages like Python and Ruby. It is known as Hadoop streaming.

19.Name some most commonly defined input formats in Hadoop?

The most common input formats of Hadoop are as follows:-

  1. Text Input Format
  2. Key-Value input format
  3. Sequence File input format

20.How do people choose different file formats to store and process data in Apache Hadoop?

 Several factors choose which file people use to store and process data in Apache Hadoop. The factors are as follows:

  1. Schema evolution to add, edit, and rename fields.
  2. Read/write/transfer performance vs. block compression to save storage space.
  3. The usage pattern, like accessing four columns out of 40 vs., is used to access most of the columns.
  4. Split ability to be parallelly processed.

21.What do you mean by edge nodes is Hoop?

The interface between the Hadoop cluster and the external network is known as an edge node. It is used to run client network and client administration tools.

22.What are the techniques of using side data distribution in Hadoop?

 Side data refers to the extra read-only data that a Hadoop job needs to process the original data set. Hadoop has two side data distribution techniques:

  1. Distributed Cache
  2. Using the job configuration
  1. Distributed Cache: Veterans suggest that data should be distributed through the distributed cache mechanism of Hadoop.
  2. Using the job configuration: The job configuration technique should be used to transfer only a few kilobytes of data. If more than a few kilobytes of data are transferred, the Hadoop daemons’ memory usage may feel pressurized. It may happen, especially if the system runs several Hadoop jobs simultaneously.

23.What is the best hardware configuration to run Hadoop?

Dual processors with 4GB Ram or 8GB RAM, which uses EC memory, form the best hardware configuration. ECC memory is the best for running Hadoop. Users who have used non-ECC memories have experienced different checksum errors. The hardware configuration is also essential in managing workflow requirements. It is liable to sudden change.

24.What are the advantages of using Hadoop?

The advantages of using Hadoop are as follows:

  1. Availability: Despite frequent hardware crashing, data is highly available in Hadoop. In the case of hardware crashes, the data is available from somewhere else.
  2. Reliable: Hoop may dependably store data even after the machine fails.
  3. High scalability: People may add any number of nodes. Thus, performance undergoes a drastic change.
  4. Economic: Hadoop runs on a cluster of commodity hardware. The hardware is relatively economical.

25.Could you describe some features of Hoop?

The different features of Hadoop are as follows:

  1. Fault tolerance: Apache Hadoop is fault-tolerant to a great extent. Every block creates three replicas along with the clusters, by default. They may be changed as needed. So, the data may be recovered from some other node, if any node goes down. The Framework automatically recovers the failure of nodes.
  2. Reliable: Even if a machine fails, we may rely on Apache Hadoop to ensure that it is stored on the cluster.
  3. Scalability: People may easily add new hardware to the nodes. So, Apache Hoop is considered highly scalable.
  4. Economic: People do not need any expensive, specialized machines to use Hadoop. It runs on clusters of commodity hardware that does not burn a hole in the pocket.
  5. Open source: Hoop is an open-source software framework. Open source means that it is readily available. Users are allowed to change their source code whenever they want.

26.How would you differentiate between RDBMS and Hadoop?

There are some fundamental differences between Hoop and RDBMS. They are as follows:

RDBMS relies on the structured data and the schema of the data is always known.Any kind of data can be stored into Hadoop i.e. Be it structured, unstructured or semi-structured.
ProcessingRDBMS provides limited or no processing capabilities.Hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.
Schema on Read Vs. WriteRDBMS is based on ‘schema on write’ where schema validation is done before loading the data.On the contrary, Hadoop follows the schema on read policy.
Read/Write SpeedIn RDBMS, reads are fast because the schema of the data is already known.The writes are fast in HDFS because no schema validation happens during HDFS write.
CostLicensed software, therefore, I have to pay for the software.Hadoop is an open source framework. So, I don’t need to pay for the software.
Best Fit Use CaseRDBMS is used for OLTP (Online Trasanctional Processing) system.Hadoop is used for Data discovery, data analytics or OLAP system.

27.What are the disadvantages of Hadoop?

Hadoop has a few disadvantages. They are as follows:

  1. Small files: Hadoop finds it difficult to process small files.
  2. Vulnerability: Hadoop is written in a standard programming language, known as Java. Java is prone to attacks from cybercriminals. So, Hadoop is vulnerable to hacking and multiple security breaches.
  3. Security: Hadoop misses encryption at the storage and network level. Also, it supports Kerberos authentication, which is challenging to manage. That is a point of concern.
  4. Iterative processing: Hadoop doesn’t support cyclic data flow. It means the output of the previous stage initiates the input of the next stage. So, it is not suitable for iterative processing.

28.Differentiate Between Hadoop 2 and Hadoop 3

There are a few fundamental differences between Hadoop 2 and Hadoop 3. They are as follows:

Hadoop 2 Hadoop 3
Hadoop 2 has features that help it overcome SPOF (Single Point of Failure). Hadoop 3 doesn’t require manual intervention to overcome SPOF
Hadoop 2 supports the minimum Java version of Java 7 Hadoop 3 supports the minimum Java version of Java 8.
HDFS has 200% in overhead storage space in the case of Hadoop 2 Hadoop 3 has 50% in overhead storage space.
Hadoop 2 handles fault tolerance through replication Hadoop 3 handles fault tolerance by Erasure coding.
Hadoop 2 uses an HDFS balancer for data balancing. Hadoop 3 uses an Intra-date node balancer for a data balancer.

29.How can you achieve security in Hadoop?

Hadoop uses Kerberos to achieve security. The client needs to take three steps to access the service while using Kerberos. In each of the steps, a message is exchanged with the server. The steps are as follows:

  • Authentication: The client receives authentication from the server. Thus, it receives a ticket-granting ticket, commonly known as TGT.
  • Authorization: The client uses TGT to request a service ticket from it.
  • Service Request: The clients use the ticket to authenticate themselves to the server.

30.What do you mean by throughput in Hadoop?

 In Hadoop, throughput is the amount of work done within a specified time.

31.What is the function of .jbs command in Hadoop?

In Hadoop, the .jbs command helps people check whether the Hadoop is running or not.

32.What do you mean by HDFS?

Hdfs stand for Hadoop Distributed file system. It is the min storage system of Hadoop.

It stores large files running on a cluster of commodity hardware. HDFS follows the principle of big files in less number, instead of many small files.

33.Can many users simultaneously write in an HDFS file?

 Multiple writers cannot write in an HDFS file simultaneously. The model which Apache Hadoop follows is known as a single writer multiple reader model. NameNode grants a lease to the client who opens a file to write. If another client wants to write in that file, it seeks permission from NameNode for writing operation. Then, NameNode checks whether the access to write has been granted to someone else earlier. If the lease has already been granted to someone else earlier, NameNode will reject the second client’s writing request.

34.Why block size 128 MB by default in HDFS?

A continuous location on the hard drive, known as block stores data. FileSystem generally stores the data as a collection of blocks. HDFS stores each file as a block, and distributes it across the Hadoop cluster. The default size of the data block is 128mb in HDFS. We may configure it as per our requirement. The default size of each data block is 1258mb, by default, to reduce the disk’s size. The block size cannot be so large that the system waits for a long time for the last unit of data to finish processing.

35.Could you explain how to do indexing in HDFS?

Hadoop has an impressive way of indexing. Initially, the Hadoop framework stores the data according to the block size. Then, HDFS continues to store the last part of the data. It says where the next part of the data will be.

Conclusion: Hadoop is on its way to being the future of technology. The above-discussed Hadoop Interview questions and answers will help a candidate face the interview panel confidently. However, these questions just provide a basic overview of the interview. The candidates need to have a clear concept and an in-depth knowledge of Hadoop.

We had to spend lots of hours researching and deliberating on what are the best possible answers to these interview questions. We would love to invite people from the IT industry freshers, experienced to understand interview FAQ’s to excel the performance.