Please enable JavaScript to view the comments powered by Disqus. Top 20 Hadoop Questions To Crack An Interview

 

Top 20 Hadoop Questions To Crack An Interview

NovelVista

NovelVista

Last updated 19/06/2020


Top 20 Hadoop Questions To Crack An Interview

Software, tools, programming languages- all of them are as much important as the technologies we use. Don’t you think?

So, what Apache Hadoop is and what is it used for? Also, why it is so important to prepare for Apache Hadoop during your cloud computing and Big Data interviews? 

Apache Hadoop is a collection of open-source software that provides a software framework for distributed storage and processing of big data using the MapReduce programming model. It was originally designed for computer clusters but has been used in Yahoo’s search web map and by social media sites like Facebook as well. And what about the cloud you may ask? Well, the cloud allows organizations to deploy Hadoop without acquiring hardware or specific setup expertise.

Hadoop adoption has become pretty widespread since 2013. You’d be shocked to know,50000 organizations of the entire world are using Hadoop currently.

So you can understand, if you know Hadoop well, that will be an added advantage to your CV. Isn’t it?

But first, you need to know what are the Hadoop questions you might get asked during the interview. This is why we picked up the top 20 questions for you which are more likely to asked by interviewers in 2020. Have a look!

 

1. What are the different vendor-specific distributions of Hadoop?

Some vendor-specific distributions of Hadoop are mentioned below:

  • Cloudera
  • MAPR
  •  Amazon EMR
  • Microsoft Azure
  • IBM InfoSphere
  • Hortonworks 

 

2. Name different Hadoop configuration files.

A few Hadoop configuration files are mentioned below:

  • Hadoop-env.sh
  • mapred-site.xml
  • core-site.xml
  • yarn-site.xml
  • hdfs-site.xml
  • Master and Slaves

3. How many modes are there where Hadoop can run?

There are 3 modes where Hadoop can run:

  • Standalone mode: Standalone mode is a default mode depended on the local FileSystem and a single Java process to run the Hadoop services.
  • Pseudo-distributed mode: This mode is dependent on single-node Hadoop deployment to run all Hadoop services.
  • Fully-distributed mode: This one is dependent on separate nodes to run Hadoop master and slave services.

 

4. What are the differences between the regular file system and HDFS?

In regular FileSystem, data is maintained in a single system. Hence, if the machine crashes, the data recovery becomes challenging due to low fault tolerance. Seek time stretches more and thus the total time of data processing increases.

On the other hand, in HDFS data can be distributed and maintained on multiple systems. If a DataNode crashes, the main data can still be recovered from other nodes in the cluster. Hence, the total required time to read data is comparatively more, as the coordination of data can be accessed from multiple systems.

5. Why HDFS is fault-tolerant?

HDFS is fault-tolerant because it replicates data on different DataNodes. When a  block of data is replicated on three DataNodes, the data blocks can be stored in different DataNodes. If one node crashes, the data can still be retrieved from other DataNodes. 

6. What are the two types of metadata that a NameNode server holds?

The two types of metadata that a NameNode server holds are:

  • Metadata in Disk - This one has information of the edit log and the FSImage
  • Metadata in RAM - This contains the important  information about DataNodes

 

7. If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?

Each block in HDFS is divided into 128 MB. Hence, The blocks sizes would be, 128 MB except for the last block. For an input file of 350 MB, there will be 3 input splits in total. The size of each split is 128 MB, 128MB, and 94 MB

8. How can you restart NameNode and all the daemons in Hadoop?

  • You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start it again with the command ./sbin/Hadoop-daemon.sh start.
  • You can stop all the daemons with ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.

 

9. Which command will help you find the status of blocks and FileSystem health?

To check the status of the blocks, you can use the command:

hdfs fsck <path> -files -blocks

This following command is used to check the health status of FileSystem:

hdfs fsck / -files –blocks –locations > dfs-fsck.log

 

10. What would happen if you store too many small files in a cluster on HDFS?

Storing several small files on HDFS usually generates a lot of metadata files. To store these metadata in the RAM is a tough task as each file, block, or directory takes 150 bytes for metadata. Hence, the tentative size of all the metadata will be too large.

11. How do you copy data from the local system onto HDFS? 

This command will help you to copy data from the local file system onto HDFS:

Hadoop fs –copyFromLocal [source] [destination]

 

12. When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

 dfsadmin -refreshNodes is commonly used to run the HDFS client and refresh node configuration for the NameNode. 

rmadmin -refreshNodes is mainly used to perform administrative tasks for ResourceManager.

 

13. Is there any way to change the replication of files on HDFS after they are already written to HDFS?

Following are ways to change the replication of files on HDFS:

  • You can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, when it starts replicating to the factor of that number for all the incoming new contents
  • To change the replication factor for a particular file or directory, you can use:
$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path of the file

14. What is the distributed cache in MapReduce?

A distributed cache is a mechanism where the data coming from the disk can be cached and made available for all worker nodes. When a MapReduce program is running, instead of reading the data from the disk every time, it would pick up the data from the distributed cache to benefit the MapReduce processing. 

15. What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?

RecordReader

This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read. 

Combiner

This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase. 

Partitioner

The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

 

16. Why is MapReduce slower in processing data in comparison to other processing frameworks?

MapReduce is slower because:

  • It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data. 
  • During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.
  • In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.

 

17. Name some Hadoop-specific data types that are used in a MapReduce program.

Some Hadoop-specific data types that are used in a MapReduce program are:

  • IntWritable
  • FloatWritable 
  • LongWritable 
  • DoubleWritable 
  • BooleanWritable 
  • ArrayWritable 
  • MapWritable 
  • ObjectWritable 

 

18. What are the major configuration parameters required in a MapReduce program?

The major parameters required in a MapReduce program are mentioned below:

  • Input location of the job in HDFS
  • Output location of the job in HDFS
  • Input and output formats
  • Classes containing a map and reduce functions
  • JAR file for mapper, reducer and driver classes 

 

19. What is the role of the OutputCommitter class in a MapReduce job?

MapReduce relies on the OutputCommitter for the following:

  • Set up the job initialization 
  • Cleaning up the job after the job completion 
  • Set up the task’s temporary output
  • Check whether a task needs a commit
  • Committing the task output
  • Discard the task commit

 

20. How can you set the mappers and reducers for a MapReduce job?

The number of mappers and reducers can be set in the command line using the command:

-D mapred.map.tasks=5 –D mapred.reduce.tasks=2

In the code, one can configure JobConf variables:

job.setNumMapTasks(5); // 5 mappers
job.setNumReduceTasks(2); // 2 reducers

Conclusion:

So, that was all about Apache Hadoop’s interview questions. What do you think? Are you prepared for your next Hadoop interview? No? Well, you can join our Apache Hadoop training sessions to have a more clear vision along with QA sessions that will help you rock the interview. Want to give it a try? 

Topic Related Post

Top 20 Agile Scrum Interview Questions For Your Big Breakthrough In 2020
Top 20 Agile Scrum Interview Questions For Your Big Breakthrough In 2020
Top 20 interview questions to prepare for ISO27001 Lead Auditor
Top 20 interview questions to prepare for ISO27001 Lead Auditor
Top 20 Microsoft Azure Questions For Your Next Interview
Top 20 Microsoft Azure Questions For Your Next Interview

About Author

NovelVista Learning Solutions is a professionally managed training organization with specialization in certification courses. The core management team consists of highly qualified professionals with vast industry experience. NovelVista is an Accredited Training Organization (ATO) to conduct all levels of ITIL Courses. We also conduct training on DevOps, AWS Solution Architect associate, Prince2, MSP, CSM, Cloud Computing, Apache Hadoop, Six Sigma, ISO 20000/27000 & Agile Methodologies.

 
 

SUBMIT ENQUIRY

 
 
 
 
 
 
 
 
 

Upcoming Events

ITIL-Logo-BL
ITIL

Every Weekend

AWS-Logo-BL
AWS

Every Weekend

Dev-Ops-Logo-BL
DevOps

Every Weekend

Prince2-Logo-BL
PRINCE2

Every Weekend

Topic Related

Take Simple Quiz and Get Discount Upto 50%
     
  18002122003
 
  
 
  • Disclaimer
  • PRINCE2® is a registered trade mark of AXELOS Limited. All rights reserved.
  • ITIL® is a registered trade mark of AXELOS Limited. All rights reserved.
  • MSP® is a registered trade mark of AXELOS Limited. All rights reserved.
  • DevOps® is a registered trade mark of DevOps Institute Limited. All rights reserved.