Tutorial
Here is a quick tutorial on how to start your first Spark application:
Tutorial on partitions
Here is a quick tutorial on how partitioning affects Spark jobs:
Repartitioning vs Coalesce
In a nutshell, when even distribution of data is desired, we should use "repartition" and when better performance is needed, we should use "coalesce".
repartition | coalesce | |
---|---|---|
Shuffle | YES | NO |
Even distribution | YES | NO |
A quick tutorial
// A map from partition number to the number of elements to sample
// i.e. we're going to take 44 from the first partition and 68 from the second and etc.
val sampling = Map(
0 -> 44, 1 -> 68, 2 -> 87, 3 -> 10,
4 -> 13, 5 -> 104, 6 -> 77, 7 -> 85
)
val x0 = sc.parallelize(1 to 1000)
// Each partition has 125 elements each
x0.foreachPartition(x => println(s"====> ${x.toList.size}"))
// Run the sampling
val x1 = x0.mapPartitionsWithIndex((idx, data) => data.take(sampling(idx)))
// print the initial distribution of data
x1.foreachPartition(x => println(s"====> ${x.toList.size}"))
// repartition and print the distribution of data
x1.repartition(4).foreachPartition(x => println(s"====> ${x.toList.size}"))
// coalesce and print the distribution of data
x1.coalesce(4).foreachPartition(x => println(s"====> ${x.toList.size}"))
Here is the distribution of data after sampling.
Partition | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
# of elements | 10 | 44 | 13 | 68 | 77 | 87 | 104 | 85 |
Here is the distribution of data after repartition. It has even distribution of data
Partition | 0 | 1 | 2 | 3 |
---|---|---|---|---|
# of elements | 123 | 120 | 122 | 123 |
Here is the distribution of data after coalesce. It doesn't have even distribution and the neighbor partitions are just merged
Partition | 0 (0 and 1 of original) |
1 (2 and 3 of original) |
2 (4 and 5 of original) |
3 (6 and 7 of original) |
---|---|---|---|---|
# of elements | 112 | 97 | 117 | 162 |
If you find yourself to quickly improve a some math skills that are used in this class, I suggest take a look at Data Science Math Skills course prepared by Duke University on Coursera platform.
A tutorial on linear regression using Excel
A nice step-by-step tutorial to follow about Linear regression analysis in Excel. Here is the dataset from the tutorial. We will use it in our Spark MLlib practices as well.
The tutorial codes in Spark MLlib can be found on Github