How do I find the amount of resources that a Spark cluster need to have to apply some MLib algorithm on a huge 1 TB data set?
Imagine that I need to use some classification algorithm on a huge data set: https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#decision-tree-classifier
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data) featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data) (trainingData, testData) = data.randomSplit([0.7, 0.3]) model = pipeline.fit(trainingData) predictions = model.transform(testData)
There are many PySpark functions in this sample that use original data set and training/test data sets as input arguments and these data sets might be huge (TB size). Would some function require whole data set to be loaded in memory while it is processed and force me to have amount of RAM to match the data set size?
Is there a guarantee that the functions might use whatever amount of memory is available even if they need to process TB-size data-frames much larger than memory? Working slower due to lack of resources is fine, as long as it don't fails immediately because it is out of resources.