Automation is one of the most vital technologies in the current industry. Most companies have to deal with a large number of systems daily which needs to be configured and maintained for the smooth functioning. Although system configuration can be done manually too it makes the process very lengthy, time-consuming and prone to more types of error.
First, let's talk about what is Hadoop
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
So what is Hadoop cluster?
Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in the distributed computing environment. Such clusters run Hadoop’s open-source distributed processing software on low-cost commodity computers.
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets.
- NameNode: The NameNode is the centrepiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. The NameNode is a Single Point of Failure for the HDFS Cluster
- DataNode: It stores data in the HadoopFileSystem. A functional filesystem has more than one DataNode, with data replicated across them. … It then responds to requests from the NameNode for filesystem operations. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
- Client node: Client nodes are in charge of loading the data into the cluster. Client nodes first submit MapReduce jobs describing how data needs to be processed and then fetch the results once the processing is finished.
In this article, we will configure a Hadoop cluster using Ansible and steps to achieve this configuration are
→ Copy the required software from the controller node to the managed node
→ Install software hadoop and jdk software on managed node
→ Copy and Configure hdfs-site and core-site file
→ Create namenode and datanode directory and Format it
→ Start the hadoop services for namenode and datanode
Step1: Updating Inventory and Configuration file on Ansible Controller node
I have created my inventory on“ /root/ip.txt “ location of the Controller Node, and it will mainly consist of few details about Target Node.
Here we have provided username, password and the protocol by which ansible can log in into that system for configuration management.
Updating Configuration file
It is a one time process as after this we only add the IP of the nodes in the inventory file
Now we have to check if the IP is connecting or not and for that, we will use ping command of ansible
Step2: Copying Hadoop and JDK software in both managed node
We will write our code in yml file as ansible support this format
Here we used ansible copy module which helps us to copy the required software from Controller node to Managed node here we specified src (from where to copy the file) and dest (where to paste), here our managed node is one namenode and one data node.
Step3: Installing required software in managed nodes
For installing JDK software we used package module where we have the path of the software and state = present which defines while running this task we want to achieve this final state.
For installing Hadoop software we have used command module of Ansible and the worst thing about command module as it is OS-specific and it is to idempotence but we have to use this as we have a little issue with Hadoop as this is Hadoop 1.2 hence it conflicts with some files hence we use command module to install it using — force.
Now let's run the playbook to check whether the above codes are working or not
So as you can view the playbook is completed successfully means step to copying and installing software is completed.
Step4: Copying hdfs-site and core-site to the managed node from the controller node
Configuration files are different for namenode and data node
1: For namenode:
This is the hdfs file where we have to give in which directory do namenode will map all the data node storage in my case I have given “/nn” directory
This is a core-site file here we give information like on which port we want to run our services and who can connect with the namenode here we have to give 0.0.0.0 by this IP every can connect with the namenode as of now.
We have also had to write the code to copy these file to the managed node
Here we have used template module to copy files from the controller node to managed node this template module has more functionality than copy module of ansible.
2: For data node
This is hdfs-site.xml file same like as we did in namenode with some small changes as in place of the name we have used data as it is data node and we also have specified the folder which data node will share to namenode.
This is the core-site here we have some changes like tried to make this file more dynamic by providing the IP of namenode using ansible variable groups which automatically find the IP of the namenode from the inventory file and append here.
Here the template module comes in play as while copying a file it will also parse the variable present in the core-site.xml file and copy module don't have the ability to perform this operation.
Now let us run the playbook and check whether the code is working
Step5: Creating directory for namenode and data node and Formatting it
Here we used the file module to create a directory in the managed node and for formatting, we have used the shell module as command module doesn't support pipe symbol which is used which formating the directory.
Likewise, we did the same in the data node
Step6: Let's start the services for both namenode as well as for data node
Here also we used command module to start Hadoop services as we don't have any module to perform this operation and we have also enabled the firewall on the port 9001 as my Hadoop services are running on that port.
Now let's run the playbook to check if its actually working or not
Hence our playbook has successfully run without any error and by this now our Hadoop cluster is ready
So let's check the cluster status using dfsadmin command
hadoop dfsadmin -report | less
Here it shows 1 data node is connected and contributing its 6.1 GB of storage to the Hadoop cluster
Now we have successfully configured Hadoop cluster using Ansible automation
Here is the Github link of the playbook and other files used above
So today we created an automated hdfs cluster and the beautiful part of this playbook is that it can configure as many managed node you want running this playbook only we just only have to add the details of the system in the inventory file of the ansible rest ansible will automatically hence Ansible provides an easier and faster way of achieving this in a faster manner.
I would be writing more about Ansible so stay tuned !! Hopefully, you learn something new from the article as well as enjoy it.
Thanks for reading this article! Leave a comment below if you have any questions.