2014/07/04

Control Cluster, Bootstrap and Step Statuses

In this post we are going to learn how to get information about the status of the cluster and various of its components - bootstrap actions, step actions -. In some cases waiting for a step to finish or fail is a necessary requirement; this is achieved implementing a status loops that checks the situation of a desired element each certain time (Amazon has a request limit to avoid potential attacks). This post covers how to implement such types of loop for job flows, bootstrap actions, and step actions.

If you do not know how a cluster is created, a previous post covering this lesson is accessible via this link. Even for those readers who are not interested in knowing how a cluster is created, it is a recommended post that facilitates the comprehension of the matters discussed here.

Check Job Flow (or cluster) status

Without getting into too much detail, the following code shows how to check the status for a certain job flow (multiple job flows can be analysed at the same time):


String state;
String jobFlowId = runJobFlowResult.getJobFlowId(); //Created in the previous post

DescribeJobFlowsRequest jobFlowDescRequest;
DescribeJobFlowsResult jobFlowDescResult ; 
JobFlowDetail jobFlowDetail;

STATUS_LOOP: while(true){

   jobFlowDescRequest = new DescribeJobFlowsRequest(Arrays.asList(new String[] {jobFlowId }));
   jobFlowDescResult = emrClient.describeJobFlows(jobFlowDescRequest);
            
   jobFlowDetail = jobFlowDescResult.getJobFlows().get(0);
   state = jobFlowDetail.getExecutionStatusDetail().getState().toString();
     
   if(jobFlowIsDone(state)){
          break;    
   }else{
        try {
             Thread.sleep(5000); //To avoid making requests too frequently
        } catch (InterruptedException ex) {
             //Exception treatment       
        }
   }
}

//****************************
//****** EXTRA FUNCTION ******

private static final List<JobFlowExecutionState> JOB_DONE_STATES = Arrays.asList(new JobFlowExecutionState[] {    

     JobFlowExecutionState.COMPLETED, 
     JobFlowExecutionState.FAILED, 
     JobFlowExecutionState.TERMINATED, 
     JobFlowExecutionState.WAITING,
     JobFlowExecutionState.RUNNING});

private boolean jobFlowIsDone (String value){
   return JOB_DONE_STATES.contains(JobFlowExecutionState.fromValue(value));
} 

As it can be observed, the method consists on a simple request to the EMR server, which returns a state variable that clients can observe. The functionality of the method is rather simple; after having requested the job flow status to the server, it checks whether the status is one of the following or it is not: completed, failed, terminated, waiting, running. If one of the conditions matches a single element of the list, the loop is broken and the application continues running. However, if no elements match the actual status of the job flow, the method sleeps for 5000 ms (variable) and runs the loop one again.

Check step (and bootstrap) status

In order to be able to obtain the status of the steps or bootstrap actions, only a small syntax change is needed. The changes are shown in the following code:


String state;
String jobFlowId = runJobFlowResult.getJobFlowId(); //Created in the previous post
StepDetail stepDetail;
DescribeJobFlowsRequest jobFlowDescRequest;
DescribeJobFlowsResult jobFlowDescResult ; 
JobFlowDetail jobFlowDetail;

STATUS_LOOP: while(true){

   jobFlowDescRequest = new DescribeJobFlowsRequest(Arrays.asList(new String[] {jobFlowId }));
   jobFlowDescResult = emrClient.describeJobFlows(jobFlowDescRequest);
          
   stepDetail = jobFlowDescResult.getJobFlows().get(0).getStep().get(jobFlowDescResult.getJobFlows().get(0).getSteps().size() -1 );
   state = stepDetail.getExecutionStatusDetail().getState().toString();
     
   if(stepIsDone(state)){
          break;    
   }else{
        try {
             Thread.sleep(5000); //To avoid making requests too frequently
        } catch (InterruptedException ex) {
             // Exception handling
        }
   }
}

//****************************
//****** EXTRA FUNCTION ******

private static final List<StepExecutionState> STEP_DONE_STATES = Arrays.asList(new StepExecutionState[] {    

     StepExecutionState.COMPLETED, 
     StepExecutionState.FAILED, 
     StepExecutionState.CANCELLED,
     StepExecutionState.INTERRUPTED});

private boolean stepIsDone (String value){
   return STEP_DONE_STATES.contains(StepExecutionState.fromValue(value));
} 

The code above is appropriate for getting the status for a single step. Again, it can be modified to check all the active - and non active - steps and bootstrap actions, it all depends on the application the user wants to develop.

The next post we will see how to re-size a running cluster using instance groups. Please, if this post has been useful for you, comment and share so others can also get advantage of these pieces of code.




2014/07/03

Create a Elastic MapReduce (EMR) Cluster with the Java API

In this post we will learn how to start creating and managing clusters using the Java API provided by Amazon for its Elastic MapReduce service. To start with, we need to download and extract the Amazon Web Services SDK. At the creation time of this post, the version 1.8.3 has been used (be aware that different versions may differ in several things implemented in this post). After having downloaded, extracted, and included the correspondent libraries into the chosen IDE (NetBeans IDE 7.4 in this case), we are ready to start creating the cluster.

The creation of a cluster consists, as in the CLI mode, on specifying several parameters to be understood by Amazon. Let's keep on analysing each parameter one by one:

AMI version

The AMI version determines the Amazon Machine Image to be used, which also determines the Hadoop version included in the images. Remember that clusters will automatically install Hadoop in every chosen node, so it is important to choose wisely. The main recommendation is to choose an AMI that bundles a Hadoop version that fits best with your needs.

In my project I have used a Hadoop app developed using its newest (at this time) version 2.4.0, so my selection has been the AMI version 3.1.0.

Logging

Amazon provides a useful logging infrastructure that stores nodes information in the desired location. Logs provide very useful information that can be used to investigate the cluster's behaviour and obtain information about different errors. The most common practice is to gather all the logging files inside the Amazon's Simple Storage Service or S3.

Bootstrap actions

This parameters makes the user able to execute scripts of any kind before starting to run Hadoop applications. Bootstrap actions are commonly used to install third party applications and built in applications such as Ganglia. It is also appropriate to insert libraries to be later consumed by the user's applications.

Step actions

Step actions are jobs ran once the cluster is running. These consist mostly on custom jobs such as Hadoop or Spark applications. As it is explained, there are two types 

Instance Configuration

Amazon offers a wide range of instances to be used within EMR. It is important to create an instance configuration that determines the type and quantity of instances to be used. which needs to fit to the user requirements. The following link lists all the instance types supported by EMR. 

Welcoming post - Motivation and objectives

Hi,

This is my first post at this blog. The principal motivation to create it has been the lack of official (and unofficial) information regarding to the creation and administration of clusters and Hadoop applications in Amazon's Elastic MapReduce using its Java API. In fact, most of the official information refers to the use of the command line interface. My main aim is to provide programmers facilities and useful information to create clusters and deploy Hadoop applications as fast and easy as possible. 

In my personal case, I had to read almost the entire API documentation in order to be able to create clusters and start deploying Hadoop applications. I believe it's the best way to learn how to use a certain API, although I too believe some facilities are never rejected by a newbie. I hope the readers will find the provided information useful for their purposes, and in any case I will try to be active answering any possible question or suggestion.

As you probably have already noticed, English is not my native language, so expect to found several mistakes of all kind; any correction will be accepted gratefully.