Tutorial

Here is a quick tutorial on how to start your first Spark application:

Tutorial on partitions

Here is a quick tutorial on how partitioning affects Spark jobs:

Repartitioning vs Coalesce

In a nutshell, when even distribution of data is desired, we should use "repartition" and when better performance is needed, we should use "coalesce".

Compare repartition and coalesce API
repartition coalesce
Shuffle YES NO
Even distribution YES NO

A quick tutorial


                                // A map from partition number to the number of elements to sample
                                // i.e. we're going to take 44 from the first partition and 68 from the second and etc.
                                val sampling = Map(
                                    0 -> 44, 1 -> 68, 2 -> 87, 3 -> 10,
                                    4 -> 13, 5 -> 104, 6 -> 77, 7 -> 85
                                )
                                val x0 = sc.parallelize(1 to 1000)
                                // Each partition has 125 elements each
                                x0.foreachPartition(x => println(s"====> ${x.toList.size}"))
                                // Run the sampling
                                val x1 = x0.mapPartitionsWithIndex((idx, data) => data.take(sampling(idx)))
                                // print the initial distribution of data
                                x1.foreachPartition(x => println(s"====> ${x.toList.size}"))
                                // repartition and print the distribution of data
                                x1.repartition(4).foreachPartition(x => println(s"====> ${x.toList.size}"))
                                // coalesce and print the distribution of data
                                x1.coalesce(4).foreachPartition(x => println(s"====> ${x.toList.size}"))
                            

Here is the distribution of data after sampling.

A sample data before changing the number of partitions
Partition 0 1 2 3 4 5 6 7
# of elements 10 44 13 68 77 87 104 85

Here is the distribution of data after repartition. It has even distribution of data

The sample data after changing the number of partitions using repartition API
Partition 0 1 2 3
# of elements 123 120 122 123

Here is the distribution of data after coalesce. It doesn't have even distribution and the neighbor partitions are just merged

The sample data after changing the number of partitions using coalesce API
Partition 0
(0 and 1 of original)
1
(2 and 3 of original)
2
(4 and 5 of original)
3
(6 and 7 of original)
# of elements 112 97 117 162

If you find yourself to quickly improve a some math skills that are used in this class, I suggest take a look at Data Science Math Skills course prepared by Duke University on Coursera platform.

A tutorial on linear regression using Excel

A nice step-by-step tutorial to follow about Linear regression analysis in Excel. Here is the dataset from the tutorial. We will use it in our Spark MLlib practices as well.

The tutorial codes in Spark MLlib can be found on Github

Here is a notebook on databricks to practice a few ML pipelines.