Hadoop A Beginner’s Note

  1.  NameNode (NN)
    1. NN holds the metadata of the files in HDFS, maintains the entire metadata in RAM
    2. It is important to run NN from a machine that has lots of RAM at its disposal. The higher the number of files in HDFS, the higher the consumption of RAM
    3. In case the namenode daemon is restarted
      1. Read fsimage from disk, load into RAM
      2. Read actions in the edits log and apply each action to the in-memory representation of the fsimage file.
      3. Write the modified in-memory representation to the fsimage file on the disk
    4. fsimage is the single point of failure in Hadoop 1.x, Hadoop 2.x supports HA: deploys 2 NN in an active/passive configuration
    5. Monitored URL port: 50070 (ports can be changed hdfs-site.xml and mapred-site.xml files)
  2. Secondary Namenode (SNN)
    1. Responsible for performing periodic housekeeping functions for NN
    2. Creates checkpoints of the filesystem metadata (fsimage) present in NN by merging the edits logfile and the fsimage file from the NN daemon
    3. In case the NN daemon fails, this checkpoint could be used to rebuild the filesystem metadata
    4. NOT a failover node for the NN daemon
    5. Monitored URL port: 50090
  3. DataNode (DN)
    1. Acts as a slave node and is responsible for storing the actual files in HDFS
    2. The files are split as data blocks across the cluster. The blocks are typically 64 MB (Hadoop 1.x) to 128 MB (Hadoop 2.x) size blocks
    3. The file blocks in a Hadoop cluster also replicate themselves to other DN for redundancy so that no data is lost in case a DN daemon fails
    4. The DN daemon (i) sends information to the NN daemon about the files and blocks stored in that node and (ii) responds to the NN daemon for all filesystem operations
    5. Monitored URL port: 50075
  4. Jobtracker (JT)
    1. Responsible for accepting job requests from a client and scheduling/assigning tasktrackers (TT) with tasks to be performed
    2. Data locality: The JT daemon tries to assign tasks to the TT daemon on the datanode daemon where the data to be processed is stored
    3. Monitored URL port: 50030
    4. With the appearance of YARN in Hadoop 2, JT daemon has been removed and the following 2 new daemons have been introduced: (i) ResourceManager (RM) & (ii) NodeManager (NM)
  5. Tasktracker (TT)
    1. Accepts tasks (map, reduce, and shuffle) from the jobtracker daemon –> performs the actual tasks during a MapReduce operation
    2. In small clusters, the NN and JT daemons reside on the same node. However, in larger clusters, there are dedicated nodes for the NN and JT daemons
    3. Monitored URL port: 50060
  6. YARN
    1. A general-purpose, distributed, application management framework for processing data in Hadoop clusters
    2. Is a part of Hadoop 2.x, to solve the following two important problems
      1. Support for large clusters (4000 nodes or more)
      2. The ability to run other applications apart from MapReduce to make use of data already stored in HDFS, for example, MPI and Apache Giraph
    3. ResourceManager (RM)A global master daemon that is responsible for managing the resources for the applications in the cluster.
      1. Consists of:
        1. ApplicationsManager:
          1. Accepts jobs from a client
          2. Creates the 1st container on one of the worker nodes to host the ApplicationMaster. A container, in simple terms, is the memory resource on a single worker node in cluster
          3. Restarts the container hosting ApplicationMaster on failure
        2. Scheduler: responsible for allocating the system resources to the various applications in the cluster and also performs the monitoring of each application.
      2. Monitored URL port: 8088
    4. NodeManager (NM)
      1. Runs on the worker nodes
      2. Responsible for monitoring the containers within the node and its system resources such as CPU, memory, and disk –> send back to the RM daemon
      3. Each worker node will have exactly one NodeManager daemon running.
      4. Monitored URL port: 8042
    5. Job submission in YARN (JHS)
  7. HDFS Commands
    1. To start hadoop namenode and datanode in background, we can use nohup:
      nohup hadoop namenode &
      nohup hadoop datanode &
    2. When use hdfs dfs command, use HADOOP_USER_NAME=hdfs in case you need access permission in Hadoop file system. For example:
      HADOOP_USER_NAME=hdfs hdfs dfs -copyFromLocal test.txt /user/root
    3. If you use a custom SSH port and want to start-yarn.sh, update the environment variable HADOOP_SSH_OPTS as follows:
      export HADOOP_SSH_OPTS="-p YOUR_PORT"
  8.  Hive
    1. If you run Hive 2 and use schematool to start -dbType as derby, remember to not set hive.metastore.uris to thrift server as there is no thrift server here :-). You also need to remember to remove lock files (rm -f metastore_db/*.lck) if there is any problem in running commands in hive cli such as “FAILED: SemanticException org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient”
    2. To run hive cli in debug mode, use the following command
      hive -hiveconf hive.root.logger=DEBUG,console
    3. To

Leave a Reply