Installing Apache SPARK on windows – Step by step approach

Apache Spark is a general purpose large scale clustering solution which claims to be faster than Hadoop & other HDFS implementations. More theory on Spark can be accessed on the internet.

Here I will focus only on the Installation steps of Apache Spark on Windows

You need JDK1.6+ to proceed with the steps below or in PDF

Step 1: Download & Untar SPARK

Download the version 1.0.2 of Spark from the official website.

Untar the downloaded file to any location (say C:\spark-1.0.2)

Step 2: Download SBT msi (needed for Windows)

Download sbt.MSI & execute it.

5

2

3

4

5

You may need to restart the machine so that command line can identify the sbt command

Step 3: Package Spark using SBT

C:\spark-1.0.2>sbt assembly

Note: This step takes enormous amount of time. Please be patient

6

Step 4: Download SCALA

Spark 1.0.2 needs Scala 2.10. This is extremely important to note. And you can read the README.MD file in the SPARK folder to find the correct scala version needed for your spark.

Download and unzip the scala to any location (say C:\ scala-2.10.1)

Set SCALA_HOME environment variable & set the PATH variable to the bin directory of scala

Verify the scala version (and thus the download)

7

Step 5: Start the spark shell

C:\spark-1.0.2\bin>spark-shell

8

Sample program in SPARK

  • Create a data set of 1…10000 integers

              scala> val data = 1 to 10000

  • Use Spark Context to create an RDD [Resilient Distributed Dateset] from that data

              scala> val distData = sc.parallelize(data)

  • Perform a filter mechanism on that data

             scala> distData.filter(_ < 10).collect()

 

 

 

 

Budget 2014 Top 50 words spoken

budget 2014 word cloud

 

Recently India’s finance budget 2014 was presented by the Finance Minister. I took the transcipt of the speech from the budget website of Indian government & plotted a bubble chart of top 50 words spoken (of course minus the stop words like I, is, was, them, etc).

Some observations -

1) Government was the top spoken word. It was spoken 71 times (71x)

2) Tax (70x) & Taxes (18x) [Not surprising]

3) Development – 53x

Infrastructure – 33x

Growth – 31x

Investment – 28x

Banks – 24x

Economy – 23x

Coal – 21x

Agriculture –  20x

Manufacture – 18x

 

Steps to Install Gradle

Introduction

Gradle is a Groovy-based DSL (versus the traditional XML-based) build automation tool. It makes use of Directed Acyclic Graph (DAG) to determine the order in which tasks are to be run.

For a detailed understanding, please read the wiki entry.

Steps to Install

1) Java 1.5 or above should be installed. Please verify this using the java -version command. If Java is installed, you should see the version; else java is unidentified command message will appear

2) Download the latest binaries from the gradle website.

Image

 

3) Unzip the gradle-<version>-bin.zip contents to your favorite folder – say – /User/gradle

4) For Mac, modify the PATH variable in .profile file to include gradle.

export PATH=/User/gradle/bin:$PATH

For Windows, modify the PATH variable in the Environment variables to include gradle.

(NOTE: If .profile does not exist, please create one)

5) On the Terminal (or command prompt), type gradle -version

If everything went well, the gradle version should appear.

 

MongoDB Installation Steps (MongoDB version: 2.4.6)

Recently I installed MongoDB on my mac OS Lion 10.7.5. Here are the steps which I followed

1) Download the latest production release of MongoDB from its site 

Image

 

2) Un-package the downloaded file to your favorite place – say – /Users/mongo

3) From the Terminal goto /Users/mongo/bin

4) Create the following directory structure

/Users/mongo/bin/data/db

5) Start the mongo server -

./mongod –dbpath data/db

Everything should start properly and the screen should show waiting for mongo shell.

Now you start another terminal & enter following command

mongo