Friday 27 December 2013

YARN: MapReduce 2

How does YARN overcomes shortcomings of “classic” MapReduce


– By splitting the responsibilities of the jobtracker into separate entities. Now a jobtracker takes care of both job scheduling and task progress monitoring. YARN separates these two roles into two daemons: A resource manager and application master. A resource manager manages the use of resources across the cluster and application manager manage the lifecycle of applications running on the cluster. Application master negotiates with the resource manager for cluster resources (number of containers) and then runs application specific processes in these containers. The containers are monitored by node managers running on cluster nodes, which make sure that only allocated resources re used not more than that.

Yarn is more general than MapReduce in fact MapReduce is just one type YARN application. The best thing about YARN design is that different YARN Applications can coexist on the same cluster. It is also possible for users to run different versions of MapReduce on the same YARN cluster, which makes the process of upgrading MapReduce more manageable. MapReduce on YARN involves more entities than classic MapReduce(MapReduce 1).They are:

  • The client
  • The YARN Resource Manager
  • The YARN Node Manager
  • The MapReduce Application Master
  • The Distributed Filesystem

The process of running a job is shown below.

Job Submission


  • Step 1 in figure:The submit() method on Job creates an internal Jobsubmitter instance and calls submitJobInternal() on it.The job submission process implemented by Jobsubmitter does the following.
  • Step 2:Asks the resource manager a new job Id.
  • Step 3:Checks the output specification of the job,Computes input splits,Copies job resources (job JAR ,configuration,and split information)to HDFS.
  • Step 4:Finally, the job is submitted by calling submitApplication() on the resource manager.

Job Intitialization


  • Step 5a and 5b: When the resource manager receives a call to its submitApplication(), it hands off the request to the scheduler.The scheduler allocates a container,and the resource manager then launches the application master's process there , under the node manager's management.
  • Step 6:The application master initializes job by creating a number of book keeping objects to keep track of the job's progress,as it will receive progress and completion reports from the tasks.
  • Step 7:Then it receives the input splits computed in the client from the shared filesystem.


Task Assignment


  • Step 8: If the job does not qualify for running as uber task, then the application master requestes containers for all the map and reduce rasks in the job from resource manager.
Note: All requests includes inforamation about each map task's data locality, in particular the hosts and coressponding racks that the input split resides on.The scheduler uses this info to make scheduling decesions.How? It attempts to place tasks on data-local nodes(in the ideal case), but if this is not possible, it prefers rack-local placement to non=local placement.

Task Execution


  • Step 9a and 9b: After a container has been assigned to the task by resource manager's scheduler, the application master starts the container by contacting the node manager.
  • Step 10:The task is executed by a Java application whose main class is YarnChild. Before running the task it localizes the resources that task needs, which includes the job configuration and JAR file and any files from distributed cache.
  • Step 11: Finally, it runs the map or reduce task
Note: Unlike MapReduce 1 YARN does not suppoert JVM reuse,so each task runs in a new JVM.Streaming and Pipes programs work in the same way as MapReduce 1.The YarnChild launches the Streaming or Pipes process and communicates with it using standard input/output or a socket(respectively).

Progress and status updates


  • When running under YARN , the task reports its progress and status back to its application master,which has an aggregate view of the job,every three seconds over the umbilical interface.The clients polls the appkiaction master every second to receive progress updates,which are usually displayed to the user.


Job Completion


  • Every five seconds the client checks whether the job has completed by calling the waitForCompletion() method on Job.The polling interval can be set via the mapreduce.client.completon.pollinterval configuration property. On job completion,the applcation master and the task containers clean up their working stste , and the OutputCommitter's job cleanup method is called.Job information is archived by the job history server to enable later interrogation b users if desired.

4 comments:

  1. wonderful piece of information from the definitive guide , please keep sharing the info of upgrade of MR 1 to Yarn/MR 2

    ReplyDelete
  2. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating hadoop online training

    ReplyDelete
  3. Herpes Virus whether it is oral or genital. To control its symptoms, you usually do many things but it doesn’t give you the expected results. And sometimes some medicines can even give you side effects which can make your situation more critical. Personally I always prefer natural cure for herpes Or any Other Infection because they won’t give you side effects. You can cure your infection/Diseases smoothly and with less trouble with natural remedies. I Strongly Recommend Herbal doctor Razor's Traditional Medicine , Get in touch with him on his Facebook Page https://web.facebook.com/HerbalistrazorMedicinalcure He is blessed with the wisdom to get rid of this virus and other Diseases. I had suffered from this Virus since I was a child, I'd learnt to live with it but still wanted to get cured of it and DOC RAZOR simply helped me with that . All thanks To Doctor Razor Who Rescued Me. Contact him on email : drrazorherbalhome@gmail.com, . Reach Him directly on https://wa.me/message/USI4SETUUEW4H1

    ReplyDelete