Monday, December 8, 2014

Hadoop & Big Data - Simple Introduction

When and where you can use Hadoop - Map/Reduce?

Relational databases are best suitable for transactional data and also used widely in data-warehousing. Hadoop and map/reduce is useful when you have massive amount of data processing at faster speed and lower costs because it uses open source technologies can run on commodity servers clustered together to perform faster. It comes with concurrent processing out of the box as it sub-divides data into smaller chunks which can be processed concurrently. Performing analytics for a large dataset is very good example of using hadoop - map/reduce.


How do you design your system to use hadoop - map/reduce?

Often map-reduce program has tutorial to count words in a large text file or a extremely large input. The framework splits the data input and sorts the input for you. The map program takes this input to combine the data, in case of word count it outputs word1 freq1; word2 freq2;etc. Reducer runs in multiple cycle by combining output from mapper and eventually coming up with final output.

Two main approaches - easy way to think of Map and reduce is - Map is nothing but data combiner, it combines data you want to process by similarities. Reducer processes already combined data. Assume, we have very large input for e.g entire years' lego store sells, trying to find out which one is best selling product by finding max money making product.

For e.g, top selling product by location.
AB, EDM1, Lego Friends, 100, 50
AB, EDM2, Lego Creator, 70, 100
ON, GTA1, Duplo, 100, 45
ON, GTA2, bricks, 1000, 20
BC, VAN1, Lego Farm, 150, 35
BC, VAN2, bricks, 750, 10
ON, GTA3, Lego Friends, 400, 40
BC, VAN3, Lego Creator, 200, 110
..

map will combine the input by state and pass it on to reducer. Reducer will find the top selling product in each state and output the result:

AB, EDM2, Lego Creator, 7000
ON, GTA2, bricks, 20000
BC, Lego Creator, 22000


What is solution when reducer runs out of memory? For e.g in word count problem if the file is very large and contains one word very frequently like 'the', the reducer can run out of memory.
Possible solutions:
Increase number or reducers
Increase memory per reducer
Implement a combiner.

PySpark code for above example (WIP):
store = sqlContext.read.parquet("...")
store = store.assign(col6 = lambda x: col4 * x.col5)
store.groupBy(store.col1,store.col2,store.col3).agg({store.col6, max}).show()