Monday, June 11, 2012

Apache Pig over Hadoop

In the last 3 blog posts we looked
   
  • Hadoop and HDFS setup
       
  • Hive installation and example
       
  • Use Jasper Hive plugin and  generate Jasper reports


Pig is another such tool to expose
Structured language which run over hadoop and HDFS

In the blog below we will try
installing and running the same sample example , where will be
extracting out the the mobile phone number and name of the persons
who's id is less then equal to 10 . We will be writing pig scripts
for the same.



We start with installing pig.

  • Download and install the  pig  debian package.
             
    • dpkg -i pig_0.10.0-1_i386.deb
       
  • Start  the dfs server and mapred service
       
        
    • start-all.sh
  • If pig has to run as local mode, then no need to perform above step
  • Connect to pig  shell (we will connect here locally)
           
    • pig -x local
     
  • Once we are  into the pig shell (Prompt name is grunt :) .. funny .. ) . We now  will load the file from local file system to HDFS using pig.

          
    • copyFromLocal export.csv person
          
  • We will now  load the the data from HDFS to a pig relation (Similar to a table in  Hive)
       
    • person = LOAD 'export.csv' USING PigStorage(',') AS      (PERSON_ID:int,NAME:chararray,FIRST_NAME:chararray,LAST_NAME:chararray,MIDDLE_NAMES:chararray,TITLE:chararray,STREET_ADDRESS:chararray,CITY:chararray,COUNTRY:chararray,POST_CODE:chararray,HOME_PHONE:chararray,WORK_PHONE:chararray,MOBILE_PHONE:chararray,NI_NUMBER:chararray,CREDITLIMIT:chararray,CREDIT_CARD:chararray,CREDITCARD_START_DATE:chararray,CREDITCARD_END_DATE:chararray,CREDITCARD_CVC:chararray,DOB:chararray);
     
  • We can see the
        output of the person using dump command
              
    • dump person;
  • Run a script to filter out persons  who's person id is less then or equal to 10
             
    • top_ten=FILTER person BY person_id<=10
    Dump top_ten to see the output
   
  • Run a script to extract out the  name and the mobile number of that list

    • mobile_numbers = FOREACH top_ten
              GENERATE NAME , MOBILE_PHONE;

    Dump the mobile_number to see the output

   This is the output we desire.

Friday, June 8, 2012

Develop Jasper report with Hive


In last 2 blog posts we learned
  • Setup Hadoop and writing simple map-reduce jobs
  • Setup hive and firing sql queries over it

In this blog we will use Jasper Report to generate a report which will use Hive as the data store.
We will generate report form the list of customers who have mobile phone
It is assumed that you have Jaspersoft iReport Designer pre installed.

  • Start Hive in server mode so that we can connect it using jdbc client
      • hive --service hiveserver

  • Create table and load the data in the have table from the hive shell . This is done so that we can query it. Hadoop map reduce programs will be called internally to fetch data from this table. The data will be distributed over HDFS and will be collected and returned according to the query
      • hive -p 10000 -h localhost
      • CREATE TABLE person (PERSON_ID INT, NAME STRING, FIRST_NAME STRING, LAST_NAME STRING, MIDDLE_NAMES STRING, TITLE STRING, STREET_ADDRESS STRING, CITY STRING, COUNTRY STRING, POST_CODE STRING, HOME_PHONE STRING, WORK_PHONE STRING, MOBILE_PHONE STRING, NI_NUMBER STRING, CREDITLIMIT STRING, CREDIT_CARD STRING, CREDITCARD_START_DATE STRING, CREDITCARD_END_DATE STRING, CREDITCARD_CVC STRING, DOB STRING) row format delimited fields terminated by ',';
      • load data inpath 'export.csv' overwrite into table person;




  • Start the iReport Designer
    • Create a new datasource to connect to Hive Database. This is the first step which will add a hive database.


  • Create a new report. Refer to the screenshots for more details. An query is given to fetch appropriate data from the hive.




    This way we now have a distributed file system (HDFS). A map-reduce engine above it(Hadoop). Datawarehousing tool over these framework (Hive) and then used a reporting tool to extract out menaingful data out of it and displaying it. Jasper report has built-in capabilities to communicate with Hive (via JDBC).


    Peace.
    Sanket Raut

Thursday, June 7, 2012

Apache Hive example


Once you have HDFS and Hadoop configured, HIVE is a data warehousing solution which runs above HDFS and Hadoop. I have considered the same input file and fired the HIVE queries , which inturn fires hadoop MapReduce jobs.

Following steps were done to install HIVE.
  • Assume you have hadoop installation up and running (described in earlier post)
  • Download the HIVE binaries from apache site.
  • UnZip the hive-0.9.0-bin.tar.gz into a directory
  • cd to the unzipped directory and fire following command
    • export HIVE_HOME=$PWD
    • export PATH=$PWD/bin:$PATH

Once all the above steps are done , we are ready to enter the HIVE shell. This shell will help us enter hive commands.
  • enter command :
    • hive

Once you are in hive shell, you are ready to fire hivesql commands
Since in earlier post we had a csv file we will create a table for the same. This will create a hive table in which we will load the data. This data will be distributed over HDFS across all the nodes.
  • CREATE TABLE person (PERSON_ID INT, NAME STRING, FIRST_NAME STRING, LAST_NAME STRING, MIDDLE_NAMES STRING, TITLE STRING, STREET_ADDRESS STRING, CITY STRING, COUNTRY STRING, POST_CODE STRING, HOME_PHONE STRING, WORK_PHONE STRING, MOBILE_PHONE STRING, NI_NUMBER STRING, CREDITLIMIT STRING, CREDIT_CARD STRING, CREDITCARD_START_DATE STRING, CREDITCARD_END_DATE STRING, CREDITCARD_CVC STRING, DOB STRING) row format delimited fields terminated by ',';

Then we will load the data from the csv file.
  • load data inpath '<PATH_TO_FILE>/export.csv' overwrite into table person

Now we are ready to fire some HiveQL queries , which will call corresponding map-reduce jobs

  • select * from person where person_id=1;
  • select count(1) from person;
  • select * from person where name like '%Bob%'

Hive is making Map-Reduce programming job simpler by giving warehousing and SQL capabilities.

Peace.
Sanket Raut

Wednesday, June 6, 2012

Hadoop Simple Example

Hadoop Example
This post will help understand how to install (standalone node) and run a sample map-reduce job on hadoop. Although the example dose not reflect the most correct actual usage of the map-reduce, its good for starter to start learning and coding Hadoop



Setting up Hadoop on Ubuntu (or any other linux)
  • Download the debian file "hadoop_1.0.3-1_i386.deb" from apache hadoop site
  • Create an group named hadoop. If you are using ubuntu , you explicitly need to create a group because the debian package will try to create a group with id 123 . Usually the group id exists.
                    sudo groupadd -g 142 -r hadoop 
  • Install hadoop using a debian package
                    sudo dpkg -i hadoop_1.0.3-1_i386.deb
  • Create passphraseless SSH (This is , I suppose some limitation on framework as it require passwordless SSH to be enabled. Maybe in actual setup with multiple nodes , this is not needed.):
       sudo su -
       ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
       cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
  • Create a hdfs and format it :
      Create a new directory as HDFS and format it. We need to initialize it.
                     hadoop namenode -format
  • Start hadoop Node . Starting the dfs will ensure the distributed service is started . We also need to start the task node which will run the map reduce programs. 
                     start-dfs.sh
                     start-mapred.sh

  • Java Code
This java code will append a pudo text prior to each element in the csv file

package com.sanket;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
public class PudoFile {
public static class Map extends Mapper {
 private Text word = new Text();

 private Text key = new Text();
 public void map(Text key, Text value, Context context) throws IOException,   InterruptedException {
   String line = key.toString();
  //value is returned as NULL. Hence will parse the Key to read value
  StringTokenizer tokenizer = new StringTokenizer(line,",");
  String strKey = tokenizer.nextToken();
  key.set(strKey);
  word = new Text();
  word.set(line);
  context.write(key, word);
 } 

}
public static class Reduce extends Reducer {
 public void reduce(Text key, Iterable values, Context context) 

 throws IOException, InterruptedException {
  StringBuffer outputValue=new StringBuffer();
   for (Text val : values) {
    StringTokenizer st=new StringTokenizer(val.toString(),",");
    StringBuffer sb=new StringBuffer();
    while(st.hasMoreTokens())
    {
     String entity=st.nextToken();
     sb.append("Pudo"+entity+",");
     }
   context.write(new Text(), new Text(sb.toString().substring(0,sb.toString().length()-1))); 
  }
 }
}
public static void main(String[] args) throws Exception {
 Configuration conf = new Configuration();
 Job reverse = new Job(conf, "ReadCSV");
 reverse.setOutputKeyClass(Text.class);
 reverse.setOutputValueClass(Text.class);
 reverse.setMapOutputKeyClass(Text.class);
 reverse.setMapOutputValueClass(Text.class);
 reverse.setMapperClass(Map.class);
 reverse.setReducerClass(Reduce.class);
 //Regular TextInputFormat gives class cast exception
 //java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text
 reverse.setInputFormatClass(KeyValueTextInputFormat.class);
 reverse.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(reverse, new Path(args[0]));
 FileOutputFormat.setOutputPath(reverse, new Path(args[1]));
 reverse.waitForCompletion(true);
 }
}


  • Compile the class and jar it. We will use the jar to run map reduce programs.
  javac -classpath hadoop-core-1.0.3.jar -d bin PsudoFile.java
  jar cvf FirstProgram.jar -C bin/ .
  • Add an input file to HDFS from local file system
       hadoop fs -mkdir inputfile
       hadoop fs -put export.csv inputfile
  • Run the program using following command
           hadoop jar FirstProgram.jar com.sanket.PsudoFile inputcsv/export.csv outputcsv


  • Check the output and delete existing output file. The output adapter gives error if file already exists
      NOW=$(date +"%b-%d-%s")
      LOGFILE="log-$NOW.log" > $LOGFILE 
      hadoop fs -cat outputcsv/part-r-00000 > $LOGFILE
      hadoop fs -rmr outputcsv 
  • Output will be present in log---.log file


The output will not be in same sequence because of the internal sort done by map-reduce . The workaround is to implement your one Key and override the compare method. This will ensure that the output will be same as the input. But ideally , the data to be analyzed need not be given in the same sequence or can be used to derive a meaning out of the huge chunk of data.

A very practical example is  :
http://online.wsj.com/article/SB10001424052702303444204577460552615646874.html


Peace.
Sanket Raut