[getting started with spark2] #edu #pluralsight #janani
##########################
# course overview
# 2H.16M
# architecture overview
# spark2 vs spark2
# daaframes spark sql
###########################
###plan
ch 0 course overview
ch 1 understand difference between spark 2.x and spark 1.x
ch 2 exploring and analyzing daa with dataframes
ch 3 querying data using spark sql
#################
#ch Understand difference between spark2 and spark1
#################
## overview
* RDD resilient distributed data set
* 2nd gen tungsten engine 10x perf
* unified apis for daasets and dataframes
* higher level of ml apis
* unified batch and stream processing
#prereqs
* begin data exploration and analsis with apache spark (spark v1)
* handle fast data with apache spark sql and streaming (spark2 w scala)
* comforable with python3 ,jupyter noteboks
* basics of distributed computing
## introducing spark
* spark is built on top of hadoop (hdfs, mapreduce, yarn)
* hdfs fs stores data across multiple nodes
* mapreduce: a framework to define a data processing task
* yarn a framework to run the data processing task (yet another resource negotiator)
* spark libs <--> spark core <--> [ hadoop: yarn, hdfs ]
* apache allows real-teim as well as bach
* repl env
* support py, java, scala, R
## RDDs basic buildig block
* inroduced in spark 1, still fundamental building blocks of spark
* all ops in spark are done on in-memory objects
* RDD is in-memory object it's basically a collection java obj
* rdd are special:
* partitioned (split thorugh multiple machines)
* immutable
* Resilient (can be reconstructe even if a node crashes)
* two ops on RDDs
* transformation (transforms into another RDD)
* executed only when you request a result (lazy evaluation)
* action (produces a result from rdd)
* eg. get first 10 rows, count, a sum
* rdd can be created in two ways (read a file, tranform another rdd)
* all transforms are tracked, a lineage of rdd => rdd can be reconstructed in case of a crash
* RDDs, DataFrames, Datasets
* RDD in spark similar to python collections (abstractions since initial release of spark)
* datasets added in spark 1.6 for typed langs: java, scala. columns are strongly typed
- have compile-time stype safety
* dataframes is a daaset with named columns, behave like pandas. conceptually same as a table in RDBMS
- no type safety at compile time
- available in all languages
* datasets and dataframes have unified APIs in spark 2
* install spark 2 on local machine
- prereqs. python 3 java >= 1.8
- install spark release v >= 2.3.0 (2.4.4 aug 30 2019)
- dld /spark-2.4.4-bin-hadoop2.7.tgz
- tar xvf into /opt/tools
add to .bashrc
export SPARK_HOME=/opt/tools/spark-2.4.4-bin-hadoop2.7
export PATH=$SPARK_HOME/bin:$PATH
>source .bashrc
#test by starting spark interactive shell
>pyspark
>>> sc # context is already initialized
#put in .bashrc to make pyspark starting a jupyter notebook
###vars to access spark from jupyter notebooks
export PYSPARK_SUBMIT_ARGS="pyspark-shell"
export PYSPARK_DRIVER_PYTHON=ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
>pyspark #starts a jupyter notebook
* the architecture of spark
- master node (coordination) a jvm process a driver program
launches tasks, hosts sparkcontext
runs several groups of services
sparkenv, dagScheduler, TaskScheduler SparkUI
driver runs application as a spark Context sc = new sparkContext(sparkConf)
- creates RDD directed acyclic graph (DAG)
- internally creates Stages (physical execution plan)
- each stage is split into operatations on RDD partitions called Tasks
in spark 2 sparkContext is wrapped in SparkSession
* Cluster manager (orchestrates jobs execution)
* hadoop yarn
* aache mesos
* spark standalon (used in this course).
* on nodes executtors (agent that execute tasks)
* Demos
datasets available at google drive link https://goo.gl/seiSek
#################
#ch Explore, analyze data with dataFrames
#################
* overview
* RDDs used in limited cases (unstructured data)
otherwise use DataFram
* demos using real-life datasets
* boardcast variables and accumulators
* sparkSession single entry point, used both for hiveContext and sqlContext
* !! demo m2-demo1-LondonCrime.ipynb
grouping. slicing of dataframes
* accumulators and broadcast variables
* spark written in scalae. heavy used of closures
* closure explain. in function, nested function
has access to local variables in outer scope (a func which defines a nested func)
this nested function which can be returned is called a closure.
* a closure contains copies of local vars from outer scope !!
* tasks on workers are closures. Every task has a copy of the variables that it works on
* master copies variables along closures to workers. One copy per task
- broadcast varable only 1 RO copy per node (peer-to-peer copying between workers is enabled)
* held in-memory cache. => can not be huge datasets
- accumulators which are broadcast variables which can be modified
* RW vars
* added associatively a+(b+c) = (a+b)+c and commutatively (a+b = b + a)
* native support for accumulators Long, Double, Collections
* can extend by subclassing AccumulatorV2
* main use-case for global count or a global sum
* workers can only modify the accumulator's state, only Driver can read the accumulator's value
!! demo socker player m2-demo2-Soccer.ipynb
#################
#ch Query Data using spark sql
#################