- NameNode (NN)
- NN holds the metadata of the files in HDFS, maintains the entire metadata in RAM
- It is important to run NN from a machine that has lots of RAM at its disposal. The higher the number of files in HDFS, the higher the consumption of RAM
- In case the namenode daemon is restarted
- Read fsimage from disk, load into RAM
- Read actions in the edits log and apply each action to the in-memory representation of the fsimage file.
- Write the modified in-memory representation to the fsimage file on the disk
- fsimage is the single point of failure in Hadoop 1.x, Hadoop 2.x supports HA: deploys 2 NN in an active/passive configuration
- Monitored URL port: 50070 (ports can be changed hdfs-site.xml and mapred-site.xml files)
- Secondary Namenode (SNN)
- Responsible for performing periodic housekeeping functions for NN
- Creates checkpoints of the filesystem metadata (fsimage) present in NN by merging the edits logfile and the fsimage file from the NN daemon
- In case the NN daemon fails, this checkpoint could be used to rebuild the filesystem metadata
- NOT a failover node for the NN daemon
- Monitored URL port: 50090
- DataNode (DN)
- Acts as a slave node and is responsible for storing the actual files in HDFS
- The files are split as data blocks across the cluster. The blocks are typically 64 MB (Hadoop 1.x) to 128 MB (Hadoop 2.x) size blocks
- The file blocks in a Hadoop cluster also replicate themselves to other DN for redundancy so that no data is lost in case a DN daemon fails
- The DN daemon (i) sends information to the NN daemon about the files and blocks stored in that node and (ii) responds to the NN daemon for all filesystem operations
- Monitored URL port: 50075
- Jobtracker (JT)
- Responsible for accepting job requests from a client and scheduling/assigning tasktrackers (TT) with tasks to be performed
- Data locality: The JT daemon tries to assign tasks to the TT daemon on the datanode daemon where the data to be processed is stored
- Monitored URL port: 50030
- With the appearance of YARN in Hadoop 2, JT daemon has been removed and the following 2 new daemons have been introduced: (i) ResourceManager (RM) & (ii) NodeManager (NM)
- Tasktracker (TT)
- Accepts tasks (map, reduce, and shuffle) from the jobtracker daemon –> performs the actual tasks during a MapReduce operation
- In small clusters, the NN and JT daemons reside on the same node. However, in larger clusters, there are dedicated nodes for the NN and JT daemons
- Monitored URL port: 50060
- YARN
- A general-purpose, distributed, application management framework for processing data in Hadoop clusters
- Is a part of Hadoop 2.x, to solve the following two important problems
- Support for large clusters (4000 nodes or more)
- The ability to run other applications apart from MapReduce to make use of data already stored in HDFS, for example, MPI and Apache Giraph
- ResourceManager (RM)A global master daemon that is responsible for managing the resources for the applications in the cluster.
- Consists of:
- ApplicationsManager:
- Accepts jobs from a client
- Creates the 1st container on one of the worker nodes to host the ApplicationMaster. A container, in simple terms, is the memory resource on a single worker node in cluster
- Restarts the container in an easy wordpress hosting
- Scheduler: responsible for allocating the system resources to the various applications in the cluster and also performs the monitoring of each application.
- ApplicationsManager:
- Monitored URL port: 8088
- Consists of:
- NodeManager (NM)
- Runs on the worker nodes
- Responsible for monitoring the containers within the node and its system resources such as CPU, memory, and disk –> send back to the RM daemon
- Each worker node will have exactly one NodeManager daemon running.
- Monitored URL port: 8042
- Job submission in YARN (JHS)
HDFS Commands
- To start hadoop namenode and datanode in background, we can use nohup:[bash]nohup hadoop namenode &
nohup hadoop datanode &[/bash] - When use hdfs dfs command, use HADOOP_USER_NAME=hdfs in case you need access permission in Hadoop file system. For example:[bash]HADOOP_USER_NAME=hdfs hdfs dfs -copyFromLocal test.txt /user/root[/bash]
- If you use a custom SSH port and want to start-yarn.sh, update the environment variable HADOOP_SSH_OPTS as follows:[bash]export HADOOP_SSH_OPTS=”-p YOUR_PORT”[/bash]
- To start hadoop namenode and datanode in background, we can use nohup:[bash]nohup hadoop namenode &
- Hive
- If you run Hive 2 and use schematool to start -dbType as derby, remember to not set hive.metastore.uris to thrift server as there is no thrift server here :-). You also need to remember to remove lock files (rm -f metastore_db/*.lck) if there is any problem in running commands in hive cli such as “FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient”
- To run hive cli in debug mode, use the following command[bash]hive -hiveconf hive.root.logger=DEBUG,console[/bash]
- To