Use Spark Submit for the first time
July 1, 2018
This article explains how to set-up spark submit to use applications via spark.
This article assumes that you have successfully downloaded and set-up the apache spark. If not, then please refer to http://rjrajivjha.tech/Setup-Spark-Standalone-Master-and-Slave/
Also, before moving ahead, please comment these environment variable from bash_profile, as they might cause error while submitting a standalone spark application.
After commenting these path variables, you can go ahead and try the tutorial.
Write your wordcount.py file
"""Calculates the word count of the given file.
the file can be local or if you setup cluster.
It can be hdfs file path"""
from pyspark import SparkConf, SparkContext
from operator import add
APP_NAME = " HelloWorld of Big Data"
textRDD = sc.textFile(filename)
words = textRDD.flatMap(lambda x: x.split(',')).map(lambda x: (x, 1))
wordcount = words.reduceByKey(add).collect()
for wc in wordcount:
if __name__ == "__main__":
# Configure Spark
conf = SparkConf().setAppName(APP_NAME)
conf = conf.setMaster("local[*]")
sc = SparkContext(conf=conf)
filename = sys.argv
# Execute Main functionality
Submit the Application to spark
Navigate to spark directory where you have decompressed the downloaded tgz file.
Rajivs-Air:spark-2.3.1-bin-hadoop2.7 rjrajivjha$ Rajivs-Air:spark-2.3.1-bin-hadoop2.7 rjrajivjha$ bin/spark-submit --master spark://Rajivs-Air:7077 ~/Desktop/DataScienceChallenges/Sapient/wordcount.py ~/Desktop/DataScienceChallenges/Sapient/household.csv
I am using Spark URL, spark://Rajivs-Air:7077 , which can be found when you run standalone master and slave on spark. Also, the absolute path to my wordcount.py program and absolute path to my targetted csv file.
Hope this helps! Keep tuned for more blogs from my Spark and ML series.
Rajiv Jha :)