Monday, December 8, 2014

Hadoop & Big Data - Simple Introduction

When and where you can use Hadoop - Map/Reduce?

Relational databases are best suitable for transactional data and also used widely in data-warehousing. Hadoop and map/reduce is useful when you have massive amount of data processing at faster speed and lower costs because it uses open source technologies can run on commodity servers clustered together to perform faster. It comes with concurrent processing out of the box as it sub-divides data into smaller chunks which can be processed concurrently. Performing analytics for a large dataset is very good example of using hadoop - map/reduce.


How do you design your system to use hadoop - map/reduce?

Often map-reduce program has tutorial to count words in a large text file or a extremely large input. The framework splits the data input and sorts the input for you. The map program takes this input to combine the data, in case of word count it outputs word1 freq1; word2 freq2;etc. Reducer runs in multiple cycle by combining output from mapper and eventually coming up with final output.

Two main approaches - easy way to think of Map and reduce is - Map is nothing but data combiner, it combines data you want to process by similarities. Reducer processes already combined data. Assume, we have very large input for e.g entire years' lego store sells, trying to find out which one is best selling product by finding max money making product.

For e.g, top selling product by location.
AB, EDM1, Lego Friends, 100, 50
AB, EDM2, Lego Creator, 70, 100
ON, GTA1, Duplo, 100, 45
ON, GTA2, bricks, 1000, 20
BC, VAN1, Lego Farm, 150, 35
BC, VAN2, bricks, 750, 10
ON, GTA3, Lego Friends, 400, 40
BC, VAN3, Lego Creator, 200, 110
..

map will combine the input by state and pass it on to reducer. Reducer will find the top selling product in each state and output the result:

AB, EDM2, Lego Creator, 7000
ON, GTA2, bricks, 20000
BC, Lego Creator, 22000


What is solution when reducer runs out of memory? For e.g in word count problem if the file is very large and contains one word very frequently like 'the', the reducer can run out of memory.
Possible solutions:
Increase number or reducers
Increase memory per reducer
Implement a combiner.

PySpark code for above example (WIP):
store = sqlContext.read.parquet("...")
store = store.assign(col6 = lambda x: col4 * x.col5)
store.groupBy(store.col1,store.col2,store.col3).agg({store.col6, max}).show()

Wednesday, October 15, 2014

Shell Script to generat comma seperated output from a file

#!/bin/bash
counter=1
invoice_clause=""
while read line
do
    if [ $count == 1 ]
    then
    #invoice_clause="'$line'"
        echo "if then"
    else
    invoice_clause="$invoice_clause,'$line'"
    fi
    counter=`expr $counter + 1`
done < invoices.num

echo $invoice_clause


Input file :
one
two
three
four
five

Output:
'one','two','three','four','five'

Simple Python script to convert name value to tabular format

#!/usr/bin/env python
import csv

file1reader = csv.reader(open("mapping.csv"), delimiter=",")

header1 = file1reader.next() #header
header2 = file1reader.next() #header

data = dict()
result=[]
row=[]
for Key, Prop, Val, Flag in file1reader:
        if key in data: # found it
        x=data[Key]
        data[Key] = x + "," + Val
        row.append(Val)
    else:
        data[Key] = Val
        result.append(row)
        row=[]
        row.append(Key)
        row.append(Val)
   
with open('myfile.csv', 'wb') as f:
    w = csv.writer(f,dialect='excel')
    w.writerows(result)   

Example mapping.csv file:

Key1, prop1, value1
Key1, prop2, value2
Key1, prop3, value3
Key2, prop1, value4
Key2, prop2, value5
Key2, prop3, value6

Output file:
Key1, value1, value2, value3
Key2, value3, value4, value5

Technical Interview questions

Q: What happens when you enter URL in the browser?
A:
1.    browser checks cache; if requested object is in cache and is fresh, skip to 9
2.    browser asks OS for server's IP address
3.    OS makes a DNS lookup and replies the IP address to the browser
4.    browser opens a TCP connection to server (this step is much more complex with HTTPS)
5.    browser sends the HTTP request through TCP connection
6.    browser receives HTTP response and may close the TCP connection, or reuse it for another request
7.    browser checks if the response is a redirect (3xx result status codes), authorization request (401), error (4xx and 5xx), etc.; these are handled differently from normal responses (2xx)
8.    if cacheable, response is stored in cache
9.    browser decodes response (e.g. if it's gzipped)
10.  browser determines what to do with response (e.g. is it a HTML page, is it an image, is it a sound clip?)
11.  browser renders response, or offers a download dialog for unrecognized types

Q: How to implement queue using stack?
A: hint: To use two stacks

Q: Design a elevator system in a high rise tower
A: Hint: Use elevation and speed parameters to control stopping of elevator at specific floor.

Q: How to implement a HashMap, what are methods to override?
A: Hint : Override equals, hashCode method of object class
Main methods to implement are put and get

Q: What is dependency injection?
A: Is inject dependency through configuration files - constructor injection, setter injection etc used in spring framework

Q: How spring AOP works?
A: Hint: Uses wrapper (proxy) classes. Another way is to modify byte code but not supported in spring.

Q: How do you keep track of multiple requests on backhand modifying same object?
A: through versions, increment record version in db to avoid race conditions/data courrption

Q: Describe popular design patterns
A: Singleton, Factory (With spring-framwork, you get these out of the box)
visitor, facade, builder, strategy