Wednesday, 8 June 2011

How to write a MapReduce job?

Skeleton of MapReduce basic program:

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MyClass /*extends Configured implements Tool*/{
/**
 * The map class of  MyClass
 */
public static class MyClassMapper
    extends Mapper<Object, Text, Text, IntWritable> {
      
  
    public void map(Object key, Text value, Context context) {
        throws IOException, InterruptedException {
             /* your code goes here */
             context.write (your_output_key_type key, your_output_value_type value);
        }
    }
}
/**
 * The reduce class of  MyClass
 */
public static class MyClassReducer
    extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context){
               /* your code goes here */
               context.write (your_output_key_type key, your_ouput_value_type value);
    }
}
/**
 * The main entry point.
 */
public static void main(String[] args) throws Exception {
    
      Configuration conf = new Configuration();
      job job = new Job(conf, "skeleton");
      job.setJarByClass(MyClass.class);
      job.setMapperClass(MyClassMapper.class);
      job.setReducerClass(MYClassReducer.class);
      job.setOutputKeyClass(your_output_key_type.class);
      job.setOutputValueClass(your_output_value_type.class);
      FileInputFormat.addInputPath(job, new Path(args[0]));
      FileOutputFormat.setOutputPath(job, new Path(args[1]));
   
      System.exit(job.waitForCompletion(true) ? 0 : 1);
   }

Note: your_output_key_type can and your_output_value_type can be of Data Types Class Text, LongWritable etc etc. provide by hadoop. Mapper's output (key, value) types should be similar to that of Reducer's input (key, value) types. But, it isn't necessary that out Reducer's (key,value) types be similar to Mapper's output (key, value) types.

2 comments:

  1. I think You may Write detail on the Map Reduce. How we can handle from starting to ending with cluster of node and amount of data sharing with example.
    Thank You
    Digbijayee

    ReplyDelete
  2. It would help better if you can give full code instead of skeleton code

    ReplyDelete