Installing Apache SPARK on windows – Step by step approach

Apache Spark is a general purpose large scale clustering solution which claims to be faster than Hadoop & other HDFS implementations. More theory on Spark can be accessed on the internet.

Here I will focus only on the Installation steps of Apache Spark on Windows

You need JDK1.6+ to proceed with the steps below or in PDF

Step 1: Download & Untar SPARK

Download the version 1.0.2 of Spark from the official website.

Untar the downloaded file to any location (say C:\spark-1.0.2)

Step 2: Download SBT msi (needed for Windows)

Download sbt.MSI & execute it.

5

2

3

4

5

You may need to restart the machine so that command line can identify the sbt command

Step 3: Package Spark using SBT

C:\spark-1.0.2>sbt assembly

Note: This step takes enormous amount of time. Please be patient

6

Step 4: Download SCALA

Spark 1.0.2 needs Scala 2.10. This is extremely important to note. And you can read the README.MD file in the SPARK folder to find the correct scala version needed for your spark.

Download and unzip the scala to any location (say C:\ scala-2.10.1)

Set SCALA_HOME environment variable & set the PATH variable to the bin directory of scala

Verify the scala version (and thus the download)

7

Step 5: Start the spark shell

C:\spark-1.0.2\bin>spark-shell

8

Sample program in SPARK

  • Create a data set of 1…10000 integers

              scala> val data = 1 to 10000

  • Use Spark Context to create an RDD [Resilient Distributed Dateset] from that data

              scala> val distData = sc.parallelize(data)

  • Perform a filter mechanism on that data

             scala> distData.filter(_ < 10).collect()