[getting started with spark2] #edu #pluralsight #janani

9/29/2019 - 11:01 AM
[getting started with spark2] #edu #pluralsight #janani

##########################
#   course overview
#  2H.16M
#  architecture overview
#  spark2 vs spark2
# daaframes spark sql
###########################

###plan
ch 0 course overview
ch 1 understand difference between spark 2.x and spark 1.x
ch 2 exploring and analyzing daa with dataframes
ch 3 querying data using spark sql


#################
#ch  Understand difference between spark2 and spark1
#################
## overview
	* RDD resilient distributed data set
	* 2nd gen tungsten engine 10x perf
    * unified apis for daasets and dataframes
    * higher level of ml apis
    * unified batch and stream processing
#prereqs
	* begin data exploration and analsis with apache spark (spark v1)
    * handle fast  data with apache spark sql and streaming (spark2 w scala)
    * comforable with python3 ,jupyter noteboks
    * basics of distributed computing 
    
## introducing spark
	* spark is built on top of hadoop (hdfs, mapreduce, yarn)
    * hdfs fs stores data across multiple nodes
    * mapreduce: a framework to define a data processing task
    * yarn a framework to run the data processing task (yet another resource negotiator)
    * spark libs <-->  spark core <--> [ hadoop: yarn, hdfs ]
    * apache allows real-teim as well as bach
    * repl env
    * support py, java, scala, R
## RDDs basic buildig block
	* inroduced in spark 1, still fundamental building blocks of spark
    * all ops in spark are done on in-memory objects
    * RDD is in-memory object  it's basically a collection java obj
    * rdd are special:
    	* partitioned (split thorugh multiple machines)
        * immutable
        * Resilient (can be reconstructe even if a node crashes)
        * two ops on RDDs
        	* transformation (transforms into another RDD)
            	* executed only when you request a result (lazy evaluation)
            * action (produces a result from rdd)
            	* eg. get first 10 rows, count, a sum
        * rdd  can be created in two ways (read a file,  tranform another rdd)
        	* all transforms are tracked,  a lineage of rdd => rdd can be reconstructed in case of a crash
    * RDDs, DataFrames, Datasets
    	* RDD in spark similar to python collections (abstractions since initial release of spark)
        * datasets added in spark 1.6 for typed langs:  java, scala. columns are strongly typed
        	- have compile-time stype safety
        * dataframes  is a daaset with named columns, behave like pandas. conceptually same as a table in RDBMS
        	- no type safety at compile time
            - available in all languages
        * datasets and dataframes have unified APIs in spark 2
    * install spark 2 on local machine
    	- prereqs.  python 3  java >= 1.8
        - install spark release v >= 2.3.0  (2.4.4  aug 30 2019)
        - dld /spark-2.4.4-bin-hadoop2.7.tgz
        - tar xvf  into /opt/tools
        add to .bashrc
        export SPARK_HOME=/opt/tools/spark-2.4.4-bin-hadoop2.7
        export PATH=$SPARK_HOME/bin:$PATH
        >source .bashrc
        #test by starting spark interactive shell
        >pyspark
        >>> sc # context is already initialized
      
        #put in .bashrc to make pyspark starting a jupyter notebook
        ###vars to access spark from jupyter notebooks
		export PYSPARK_SUBMIT_ARGS="pyspark-shell"
		export PYSPARK_DRIVER_PYTHON=ipython
	 	export PYSPARK_DRIVER_PYTHON_OPTS='notebook' pyspark
        >pyspark #starts a jupyter notebook
	* the architecture of spark
    	- master node (coordination) a jvm process   a driver program
        	launches tasks, hosts sparkcontext
            runs several groups  of services
            sparkenv, dagScheduler, TaskScheduler SparkUI
         driver runs application as a spark Context  sc = new sparkContext(sparkConf) 
         	- creates RDD  directed acyclic graph (DAG)
            - internally creates Stages (physical execution plan)
            - each stage is split into operatations on RDD partitions called Tasks
         in spark 2 sparkContext is wrapped in SparkSession        
         * Cluster manager (orchestrates jobs execution)
         	* hadoop yarn
            * aache mesos
            * spark standalon (used in this course).
            * on nodes executtors (agent that execute tasks)
    * Demos
      datasets available at google drive link  https://goo.gl/seiSek

#################
#ch  Explore, analyze data with dataFrames
#################
    * overview
		* RDDs used in limited cases (unstructured data)
    	otherwise use DataFram
    	* demos using real-life datasets
    	* boardcast variables and accumulators 
    * sparkSession single entry point, used both for hiveContext and sqlContext
    
    * !! demo m2-demo1-LondonCrime.ipynb
		grouping. slicing of dataframes 
        
    * accumulators and broadcast variables 
      * spark written in scalae. heavy used of closures
      * closure explain.  in function, nested function 
      		has access to local variables  in outer scope (a func which defines a nested func)
            this nested function which can be returned is called a closure. 
            	* a closure contains copies of local vars from outer scope !!
    	  *  tasks on workers are closures. Every task has a copy of the variables that it works on
          	* master copies variables along closures to workers. One copy per task
          - broadcast varable  only 1 RO copy per node (peer-to-peer copying between workers is enabled)
          	* held in-memory cache.  => can not be huge datasets
          - accumulators which are broadcast variables which can be modified
          	* RW vars 
  				* added associatively a+(b+c) = (a+b)+c and commutatively (a+b = b + a)
      			* native support for accumulators  Long, Double, Collections
                * can extend by subclassing AccumulatorV2
                * main use-case for global count  or a global sum
                * workers can only modify the accumulator's state, only Driver can read the accumulator's value
            !! demo socker player m2-demo2-Soccer.ipynb
            
#################
#ch  Query Data using spark sql
#################
Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

[getting started with spark2] #edu #pluralsight #janani