alathrop
8/31/2019 - 4:32 PM

spark ml workflows

# ---
# Databricks training
# ---

# import data
bostonDF = (spark.read
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/bostonhousing/bostonhousing/bostonhousing.csv")
)

display(bostonDF)

# split into train and test
""" Conventions using other machine learning tools often entail creating 4 objects: X_train, y_train, X_test, and y_test where your features X are separate from your label y. Since Spark is distributed, the Spark convention keeps the features and labels together when the split is performed.
"""
trainDF, testDF = bostonDF.randomSplit([0.8, 0.2], seed=42)

print("We have {} training examples and {} test examples.".format(trainDF.count(), testDF.count()))

# Create a baseline model by calculating the average housing value in the training dataset.
from pyspark.sql.functions import avg

trainAvg = trainDF.select(avg("medv")).first()[0]

print("Average home value: {}".format(trainAvg))

# Take the average calculated on the training dataset and append it as the column prediction on the test dataset.
from pyspark.sql.functions import lit

testPredictionDF = testDF.withColumn("prediction", lit(trainAvg))

display(testPredictionDF)

# Define the evaluator with the prediction column, label column, and MSE metric.
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="medv", metricName="mse")

# Evaluate testPredictionDF using the .evaluator() method.
testError = evaluator.evaluate(testPredictionDF)

print("Error on the test set for the baseline model: {}".format(testError))
# Error on the test set for the baseline model: 79.36094952409287
"""
This score indicates that the average squared distance between the true home value and the prediction of the baseline is about 79. Taking the square root of that number gives us the error in the units of the quantity being estimated. In other words, taking the square root of 79 gives us an average error of about $8,890. That's not great, but it's also not too bad for a naive approach.
"""

"""
Question: What does a data scientist's workflow look like?
Answer: Data scientists employ an iterative workflow that includes the following steps:

    Business and Data Understanding: ensures a rigorous understanding of both the business problem and the available data
    Data Preparation: involves cleaning data so that it can be fed into algorithms and create new features
    Modeling: entails training many models and many combinations of parameters for a given model
    Evaluation: compares model performance and chooses the best option
    Deployment: launches a model into production where it's used to inform business decision-making
    
Question: How do I evaluate the performance of a regression model?
Answer: There are a number of ways of evaluating regression models. The most common way of evaluating regression models is using Mean Squared Error (MSE). This calculates the average squared distance between the predicted value and the true value. By squaring the error, we will always get a positive number so this evaluation metric does not care if the prediction is above or below the true value. There are many alternatives, including Root Mean Squared Error (RMSE). RMSE is a helpful metric because, by taking the square root of the MSE, the error has the same units as the dependent variable.
"""