Monday, December 8, 2014

Hadoop & Big Data - Simple Introduction

When and where you can use Hadoop - Map/Reduce?

Relational databases are best suitable for transactional data and also used widely in data-warehousing. Hadoop and map/reduce is useful when you have massive amount of data processing at faster speed and lower costs because it uses open source technologies can run on commodity servers clustered together to perform faster. It comes with concurrent processing out of the box as it sub-divides data into smaller chunks which can be processed concurrently. Performing analytics for a large dataset is very good example of using hadoop - map/reduce.


How do you design your system to use hadoop - map/reduce?

Often map-reduce program has tutorial to count words in a large text file or a extremely large input. The framework splits the data input and sorts the input for you. The map program takes this input to combine the data, in case of word count it outputs word1 freq1; word2 freq2;etc. Reducer runs in multiple cycle by combining output from mapper and eventually coming up with final output.

Two main approaches - easy way to think of Map and reduce is - Map is nothing but data combiner, it combines data you want to process by similarities. Reducer processes already combined data. Assume, we have very large input for e.g entire years' lego store sells, trying to find out which one is best selling product by finding max money making product.

For e.g, top selling product by location.
AB, EDM1, Lego Friends, 100, 50
AB, EDM2, Lego Creator, 70, 100
ON, GTA1, Duplo, 100, 45
ON, GTA2, bricks, 1000, 20
BC, VAN1, Lego Farm, 150, 35
BC, VAN2, bricks, 750, 10
ON, GTA3, Lego Friends, 400, 40
BC, VAN3, Lego Creator, 200, 110
..

map will combine the input by state and pass it on to reducer. Reducer will find the top selling product in each state and output the result:

AB, EDM2, Lego Creator, 7000
ON, GTA2, bricks, 20000
BC, Lego Creator, 22000


What is solution when reducer runs out of memory? For e.g in word count problem if the file is very large and contains one word very frequently like 'the', the reducer can run out of memory.
Possible solutions:
Increase number or reducers
Increase memory per reducer
Implement a combiner.

PySpark code for above example (WIP):
store = sqlContext.read.parquet("...")
store = store.assign(col6 = lambda x: col4 * x.col5)
store.groupBy(store.col1,store.col2,store.col3).agg({store.col6, max}).show()

Wednesday, October 15, 2014

Shell Script to generat comma seperated output from a file

#!/bin/bash
counter=1
invoice_clause=""
while read line
do
    if [ $count == 1 ]
    then
    #invoice_clause="'$line'"
        echo "if then"
    else
    invoice_clause="$invoice_clause,'$line'"
    fi
    counter=`expr $counter + 1`
done < invoices.num

echo $invoice_clause


Input file :
one
two
three
four
five

Output:
'one','two','three','four','five'

Simple Python script to convert name value to tabular format

#!/usr/bin/env python
import csv

file1reader = csv.reader(open("mapping.csv"), delimiter=",")

header1 = file1reader.next() #header
header2 = file1reader.next() #header

data = dict()
result=[]
row=[]
for Key, Prop, Val, Flag in file1reader:
        if key in data: # found it
        x=data[Key]
        data[Key] = x + "," + Val
        row.append(Val)
    else:
        data[Key] = Val
        result.append(row)
        row=[]
        row.append(Key)
        row.append(Val)
   
with open('myfile.csv', 'wb') as f:
    w = csv.writer(f,dialect='excel')
    w.writerows(result)   

Example mapping.csv file:

Key1, prop1, value1
Key1, prop2, value2
Key1, prop3, value3
Key2, prop1, value4
Key2, prop2, value5
Key2, prop3, value6

Output file:
Key1, value1, value2, value3
Key2, value3, value4, value5

Technical Interview questions

Q: What happens when you enter URL in the browser?
A:
1.    browser checks cache; if requested object is in cache and is fresh, skip to 9
2.    browser asks OS for server's IP address
3.    OS makes a DNS lookup and replies the IP address to the browser
4.    browser opens a TCP connection to server (this step is much more complex with HTTPS)
5.    browser sends the HTTP request through TCP connection
6.    browser receives HTTP response and may close the TCP connection, or reuse it for another request
7.    browser checks if the response is a redirect (3xx result status codes), authorization request (401), error (4xx and 5xx), etc.; these are handled differently from normal responses (2xx)
8.    if cacheable, response is stored in cache
9.    browser decodes response (e.g. if it's gzipped)
10.  browser determines what to do with response (e.g. is it a HTML page, is it an image, is it a sound clip?)
11.  browser renders response, or offers a download dialog for unrecognized types

Q: How to implement queue using stack?
A: hint: To use two stacks

Q: Design a elevator system in a high rise tower
A: Hint: Use elevation and speed parameters to control stopping of elevator at specific floor.

Q: How to implement a HashMap, what are methods to override?
A: Hint : Override equals, hashCode method of object class
Main methods to implement are put and get

Q: What is dependency injection?
A: Is inject dependency through configuration files - constructor injection, setter injection etc used in spring framework

Q: How spring AOP works?
A: Hint: Uses wrapper (proxy) classes. Another way is to modify byte code but not supported in spring.

Q: How do you keep track of multiple requests on backhand modifying same object?
A: through versions, increment record version in db to avoid race conditions/data courrption

Q: Describe popular design patterns
A: Singleton, Factory (With spring-framwork, you get these out of the box)
visitor, facade, builder, strategy


Thursday, March 14, 2013

Jasper Reports tips and tricks

Jasper Reports basic steps:

1. Design report in iReport designer
2. Save the the template  - jrxml file
3. Compile the jrxml to jasper
4. Run from iReport or integrate it with a java program

Load the jasper report from .jasper file

//reportData is an instance of List
(JasperReport)jasperReport = JRLoader.loadObjectFromLocation(jasperReportName);

JasperPrint jasperPrint = JasperFillManager.fillReport(jasperReport,parameters,new JRBeanCollectionDataSource(reportData));


//Export into pdf format
byte[] pdfFile = JasperExportManager.exportReportToPdf(jasperPrint);


//Export into csv format
JRCsvExporter exporter = new JRCsvExporter();
StringBuffer buffer = new StringBuffer();
exporter.setParameter(JRExporterParameter.JASPER_PRINT, jasperPrint);
exporter.setParameter(JRExporterParameter.OUTPUT_STRING_BUFFER,buffer);
exporter.exportReport();

//Export into HTML format
JRHtmlExporter exporter=new JRHtmlExporter();
StringBuffer buffer = new StringBuffer();
  
exporter.setParameter(JRExporterParameter.JASPER_PRINT, jasperPrint);
exporter.setParameter(JRExporterParameter.OUTPUT_STRING_BUFFER, buffer);
exporter.setParameter(JRHtmlExporterParameter.IS_USING_IMAGES_TO_ALIGN, false);
//following parameters are used to avoid the page breaks, empty spaces and display tabular data better in html compare to pdf
exporter.setParameter(JRHtmlExporterParameter.IGNORE_PAGE_MARGINS, true);
exporter.setParameter(JRHtmlExporterParameter.ZOOM_RATIO, 1.5F);
exporter.setParameter(JRHtmlExporterParameter.BETWEEN_PAGES_HTML, "");
exporter.setParameter(JRHtmlExporterParameter.FRAMES_AS_NESTED_TABLES,true);
exporter.setParameter(JRHtmlExporterParameter.IS_REMOVE_EMPTY_SPACE_BETWEEN_ROWS, true);
exporter.exportReport();
   
iReport Designer tips n tricks:


Create paremeters, fields in the iReport designer and make sure the name match the exact names in the javabean if you plan to use java bean as datasource.

List is also very useful if you do not want to use a subreport.
To create a list, drag list component from the pallet and right-click select datasource :
new net.sf.jasperreports.engine.data.JRBeanCollectionDataSource($F{fieldname})
go to the dataset1 and create all necessary fields and drag and drop them to the area.


If you want to display certain area when some conditions are met, select the band properties for that particular group or section and modify "print when expression" to new Boolean($F{fieldname}>0) for e.g.

Wednesday, October 10, 2012

Tuning Weblogic - Case study of performance issue resolution

Case Study of how to analyze a slower performance or frequent out of memory exceptions of a web applications containing ejb deployed on oracle weblogic 

Enable jvm heap dump on app server

For e.g : export JAVA_OPTS=-verbose:gc -XX:+PrintClassHistogram -XX:+PrintGCDetails -Xloggc:/tmp/jvm_loggc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp

Using gcviewer monitor the heap usage and note the trend of increasing memory consumption even after the garbage collection is executed.


Take several snapshots/heap dumps and identify where the bottle neck is using tools like Eclipse MAT or visualvm
the heap dump can be taken manually using following command, can be analyzed using EclipseMAT or visualVM :
jmap -dump:format=b,file=  

Sort by object size in Dominator Tree or in histogram in EclipseMAT. Look for the top class/object and drill down to the classe name that look more familiar. If the identified class belongs to an ejb then there is a possibility that the ejb is retained in the in memory cache of jvm. To confirm this, check the cached beans current count in weblogic console. (refer to the how to monitor ejb cache section below for details)

One possible solution to above problem is described below :

Configure max-beans-in-cache and idle-timeout-seconds appropriate according to your need in weblogic-ejb-jar.xml as below use stateful/stateless as per your ejb:

<weblogic-enterprise-bean>
   <ejb-name>BeanName</ejb-name>
      <stateful-session-descriptor>
 <stateful-session-cache>
   <max-beans-in-cache>200</max-beans-in-cache>
   <idle-timeout-seconds>1200</idle-timeout-seconds>
 </stateful-session-cache>
      </stateful-session-descriptor>
</weblogic-enterprise-bean>

For testing purpose you can set the max-beans to 3-5 and idle-timeout to 60 secs and do some testing and notice the changes in weblogic console, like the ejb passivation count is increasing that means the ejbs are cleared from the cache once they are used.
Another important parameter is cache-type. Refer to the references for detailed explanations.

How to monitor the ejb cache on weblogic console :

1. Go to your deployed application
2. Expand it and go the the ejbs
3. Select the Monitoring tab.
4. Click and navigate through the pages to view monitoring statistics for deployed stateless, stateful, entity, and message-driven EJBs.
5. If you don't see the activation count or passivation count, click on customize table and add these columns.

References

1. http://docs.oracle.com/cd/E13222_01/wls/docs81/perform/EJBTuning.html
2. http://middlewaremagic.com/weblogic/?p=5665

Tuesday, September 25, 2012

Debugging JSP

1. Keep the generated java files

IBM websphere :
change WEB-INF\ibm-web-ext.xml as following :

<jspAttributes xmi:id="JSPAttribute_1" name="keepgenerated" value="true"/> <jspAttributes xmi:id="JSPAttribute_2" name="scratchdir" value="C:\temp\ibm"/>

Weblogic :
Change weblogic.xml as following :
<jsp-descriptor>
    <keepgenerated>true</keepgenerated>
    <verbose>true</verbose>
    <working-dir>c:/temp/bea</working-dir>
</jsp-descriptor>

2. Restart the server in debug or normal mode

3. When an exception happens in a jsp, it will show and line number. Now, you can find the java file and trace the line number and go from there.