Top 20 Hadoop Questions To Crack An Interview

Table Of Content

1. What are the different vendor-specific distributions of Hadoop?
2. Name different Hadoop configuration files.
3. How many modes are there where Hadoop can run?
4. What are the differences between the regular file system and HDFS?
5. Why HDFS is fault-tolerant?
6. What are the two types of metadata that a NameNode server holds?
7. If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?
8. How can you restart NameNode and all the daemons in Hadoop?
9. Which command will help you find the status of blocks and FileSystem health?
10. What would happen if you store too many small files in a cluster on HDFS?
11. How do you copy data from the local system onto HDFS?
12. When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?
13. Is there any way to change the replication of files on HDFS after they are already written to HDFS?
14. What is the distributed cache in MapReduce?
15. What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?
16. Why is MapReduce slower in processing data in comparison to other processing frameworks?
17. Name some Hadoop-specific data types that are used in a MapReduce program.
18. What are the major configuration parameters required in a MapReduce program?
19. What is the role of the OutputCommitter class in a MapReduce job?
20. How can you set the mappers and reducers for a MapReduce job?
Conclusion

Software, tools, programming languages- all of them are as much important as the technologies we use. Don’t you think?

So, what Apache Hadoop is and what is it used for? Also, why it is so important to prepare for Apache Hadoop during your cloud computing and Big Data interviews?

Apache Hadoop is a collection of open-source software that provides a software framework for distributed storage and processing of big data using the MapReduce programming model. It was originally designed for computer clusters but has been used in Yahoo’s search web map and by social media sites like Facebook as well. And what about the cloud you may ask? Well, the cloud allows organizations to deploy Hadoop without acquiring hardware or specific setup expertise.

Hadoop adoption has become pretty widespread since 2013. You’d be shocked to know,50000 organizations of the entire world are using Hadoop currently.

So you can understand, if you know Hadoop well, that will be an added advantage to your CV. Isn’t it?

But first, you need to know what are the Hadoop questions you might get asked during the interview. This is why we picked up the top 20 questions for you which are more likely to asked by interviewers in 2020. Have a look!

1. What are the different vendor-specific distributions of Hadoop?

Some vendor-specific distributions of Hadoop are mentioned below:

Cloudera
MAPR
Amazon EMR
Microsoft Azure
IBM InfoSphere
Hortonworks

2. Name different Hadoop configuration files.

A few Hadoop configuration files are mentioned below:

Hadoop-env.sh
mapred-site.xml
core-site.xml
yarn-site.xml
hdfs-site.xml
Master and Slaves

3. How many modes are there where Hadoop can run?

There are 3 modes where Hadoop can run:

Standalone mode: Standalone mode is a default mode depended on the local FileSystem and a single Java process to run the Hadoop services.
Pseudo-distributed mode: This mode is dependent on single-node Hadoop deployment to run all Hadoop services.
Fully-distributed mode: This one is dependent on separate nodes to run Hadoop master and slave services.

4. What are the differences between the regular file system and HDFS?

In regular FileSystem, data is maintained in a single system. Hence, if the machine crashes, the data recovery becomes challenging due to low fault tolerance. Seek time stretches more and thus the total time of data processing increases.

On the other hand, in HDFS data can be distributed and maintained on multiple systems. If a DataNode crashes, the main data can still be recovered from other nodes in the cluster. Hence, the total required time to read data is comparatively more, as the coordination of data can be accessed from multiple systems.

5. Why HDFS is fault-tolerant?

HDFS is fault-tolerant because it replicates data on different DataNodes. When a block of data is replicated on three DataNodes, the data blocks can be stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes.

6. What are the two types of metadata that a NameNode server holds?

The two types of metadata that a NameNode server holds are:

Metadata in Disk - This one has information of the edit log and the FSImage
Metadata in RAM - This contains the important information about DataNodes

7. If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?

Each block in HDFS is divided into 128 MB. Hence, The blocks sizes would be, 128 MB except for the last block. For an input file of 350 MB, there will be 3 input splits in total. The size of each split is 128 MB, 128MB, and 94 MB

8. How can you restart NameNode and all the daemons in Hadoop?

You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start it again with the command ./sbin/Hadoop-daemon.sh start.
You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.

9. Which command will help you find the status of blocks and FileSystem health?

To check the status of the blocks, you can use the command:

hdfs fsck  -files -blocks

This following command is used to check the health status of FileSystem:

hdfs fsck / -files –blocks –locations > dfs-fsck.log

10. What would happen if you store too many small files in a cluster on HDFS?

Storing several small files on HDFS usually generates a lot of metadata files. To store these metadata in the RAM is a tough task as each file, block, or directory takes 150 bytes for metadata. Hence, the tentative size of all the metadata will be too large.

11. How do you copy data from the local system onto HDFS?

This command will help you to copy data from the local file system onto HDFS:

Hadoop fs –copyFromLocal [source] [destination]

12. When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

dfsadmin -refreshNodes is commonly used to run the HDFS client and refresh node configuration for the NameNode.

rmadmin -refreshNodes is mainly used to perform administrative tasks for ResourceManager.

13. Is there any way to change the replication of files on HDFS after they are already written to HDFS?

Following are ways to change the replication of files on HDFS:

You can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, when it starts replicating to the factor of that number for all the incoming new contents
To change the replication factor for a particular file or directory, you can use:

$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file

14. What is the distributed cache in MapReduce?

A distributed cache is a mechanism where the data coming from the disk can be cached and made available for all worker nodes. When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing.

15. What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?

RecordReader

This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read.

Combiner

This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase.

Partitioner

The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

16. Why is MapReduce slower in processing data in comparison to other processing frameworks?

MapReduce is slower because:

It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data.
During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.
In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.

17. Name some Hadoop-specific data types that are used in a MapReduce program.

Some Hadoop-specific data types that are used in a MapReduce program are:

IntWritable
FloatWritable
LongWritable
DoubleWritable
BooleanWritable
ArrayWritable
MapWritable
ObjectWritable

18. What are the major configuration parameters required in a MapReduce program?

The major parameters required in a MapReduce program are mentioned below:

Input location of the job in HDFS
Output location of the job in HDFS
Input and output formats
Classes containing a map and reduce functions
JAR file for mapper, reducer and driver classes

19. What is the role of the OutputCommitter class in a MapReduce job?

MapReduce relies on the OutputCommitter for the following:

Set up the job initialization
Cleaning up the job after the job completion
Set up the task’s temporary output
Check whether a task needs a commit
Committing the task output
Discard the task commit

20. How can you set the mappers and reducers for a MapReduce job?

The number of mappers and reducers can be set in the command line using the command:

-D mapred.map.tasks=5 –D mapred.reduce.tasks=2

In the code, one can configure JobConf variables:

job.setNumMapTasks(5); // 5 mappers

job.setNumReduceTasks(2); // 2 reducers

Conclusion

So, that was all about Apache Hadoop’s interview questions. What do you think? Are you prepared for your next Hadoop interview? No? Well, you can join our Apache Hadoop training sessions to have a more clear vision along with QA sessions that will help you rock the interview. Want to give it a try?

Author Details

Novelvista

SME

NOVELVISTA LEARNING SOLUTIONS PRIVATE LIMITED - an Accredited Training Organization (ATO), is a professional training certification provider, helping professionals across the industry to develop skills and expertise to get recognition and growth in the corporate world. We’re one of the leading training providers and gradually spreading our training facility amongst candidates based at different geographies. We have gained recognition over the years in professional training certification in IT industry such as PRINCE2, DevOps, PMP, Six Sigma, ITIL and many other leading courses.

Enjoyed this blog? Share this with someone who'd find this useful

Confused About Certification?

Get Free Consultation Call