How to configure Hadoop cluster using Ansible playbook

Image for post
Image for post

Automation is one of the most vital technologies in the current industry. Most companies have to deal with a large number of systems daily which needs to be configured and maintained for the smooth functioning. Although system configuration can be done manually too it makes the process very lengthy, time-consuming and prone to more types of error.

First, let's talk about what is Hadoop

So what is Hadoop cluster?

Image for post
Image for post

Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in the distributed computing environment. Such clusters run Hadoop’s open-source distributed processing software on low-cost commodity computers.

A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets.

  • NameNode: The NameNode is the centrepiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster
  • DataNode: It stores data in the HadoopFileSystem. A functional filesystem has more than one DataNode, with data replicated across them. … It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
  • Client node: Client nodes are in charge of loading the data into the cluster. Client nodes first submit MapReduce jobs describing how data needs to be processed and then fetch the results once the processing is finished.

In this article, we will configure a Hadoop cluster using Ansible and steps to achieve this configuration are

→ Copy the required software from the controller node to the managed node

→ Install software hadoop and jdk software on managed node

→ Copy and Configure hdfs-site and core-site file

→ Create namenode and datanode directory and Format it

→ Start the hadoop services for namenode and datanode

Step1: Updating Inventory and Configuration file on Ansible Controller node

vim /root/ip.txt

Image for post
Image for post

Here we have provided username, password and the protocol by which ansible can log in into that system for configuration management.

Updating Configuration file

It is a one time process as after this we only add the IP of the nodes in the inventory file

vim /etc/ansible/ansible.cfg

Image for post
Image for post

Now we have to check if the IP is connecting or not and for that, we will use ping command of ansible

Image for post
Image for post

Step2: Copying Hadoop and JDK software in both managed node

vim hadoop.yml

Image for post
Image for post

Here we used ansible copy module which helps us to copy the required software from Controller node to Managed node here we specified src (from where to copy the file) and dest (where to paste), here our managed node is one namenode and one data node.

Step3: Installing required software in managed nodes

Image for post
Image for post

For installing JDK software we used package module where we have the path of the software and state = present which defines while running this task we want to achieve this final state.

For installing Hadoop software we have used command module of Ansible and the worst thing about command module as it is OS-specific and it is to idempotence but we have to use this as we have a little issue with Hadoop as this is Hadoop 1.2 hence it conflicts with some files hence we use command module to install it using — force.

Now let's run the playbook to check whether the above codes are working or not

Image for post
Image for post

So as you can view the playbook is completed successfully means step to copying and installing software is completed.

Step4: Copying hdfs-site and core-site to the managed node from the controller node

1: For namenode:

Image for post
Image for post

This is the hdfs file where we have to give in which directory do namenode will map all the data node storage in my case I have given “/nn” directory

Image for post
Image for post

This is a core-site file here we give information like on which port we want to run our services and who can connect with the namenode here we have to give 0.0.0.0 by this IP every can connect with the namenode as of now.

We have also had to write the code to copy these file to the managed node

Image for post
Image for post

Here we have used template module to copy files from the controller node to managed node this template module has more functionality than copy module of ansible.

2: For data node

Image for post
Image for post

This is hdfs-site.xml file same like as we did in namenode with some small changes as in place of the name we have used data as it is data node and we also have specified the folder which data node will share to namenode.

Image for post
Image for post

This is the core-site here we have some changes like tried to make this file more dynamic by providing the IP of namenode using ansible variable groups which automatically find the IP of the namenode from the inventory file and append here.

Image for post
Image for post

Here the template module comes in play as while copying a file it will also parse the variable present in the core-site.xml file and copy module don't have the ability to perform this operation.

Now let us run the playbook and check whether the code is working

Image for post
Image for post

Step5: Creating directory for namenode and data node and Formatting it

Image for post
Image for post

Here we used the file module to create a directory in the managed node and for formatting, we have used the shell module as command module doesn't support pipe symbol which is used which formating the directory.

Likewise, we did the same in the data node

Step6: Let's start the services for both namenode as well as for data node

Image for post
Image for post
Image for post
Image for post

Here also we used command module to start Hadoop services as we don't have any module to perform this operation and we have also enabled the firewall on the port 9001 as my Hadoop services are running on that port.

Now let's run the playbook to check if its actually working or not

ansible-playbook hadoop.yml

Image for post
Image for post
Image for post
Image for post

Hence our playbook has successfully run without any error and by this now our Hadoop cluster is ready

So let's check the cluster status using dfsadmin command

hadoop dfsadmin -report | less

Image for post
Image for post

Here it shows 1 data node is connected and contributing its 6.1 GB of storage to the Hadoop cluster

Now we have successfully configured Hadoop cluster using Ansible automation

Here is the Github link of the playbook and other files used above

Conclusion:

I would be writing more about Ansible so stay tuned !! Hopefully, you learn something new from the article as well as enjoy it.

Thanks for reading this article! Leave a comment below if you have any questions.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store