PysparkのMllibでリッジ回帰
作ったPysparkの環境でMap関数の練習リッジ回帰やってみました。
元ネタはedXのこの講義↓
Scalable Machine Learning | edX
データはMillionSongDatasetから取得できます。
で、結果から先に書きますと、Default値含めてGridsearchしたにも関わらず、
Default値実施した回帰の方が精度が高くなるという鼻血が出そうな結果になってしまってます。
まずはもろもろimport。
#library import from pyspark.mllib.regression import LinearRegressionWithSGD from pyspark.mllib.regression import LabeledPoint from test_helper import Test import numpy as np #data load rawData = sc.textFile('millionsong.txt')
リッジ回帰(デフォルト)
使ったのはLinearRegressionWithSGD。
samplePoints = rawData.take(2) #Create LabeledPoint def parsePoint(line): tokens = line.split(',') label = tokens[0] features = tokens[1:] return LabeledPoint(label, features) parsedData = rawData.map(parsePoint) print parsedData.take(3) [LabeledPoint(2001.0, [0.884123733793,0.610454259079,0.600498416968,0.474669212493,0.247232680947,0.357306088914,0.344136412234,0.339641227335,0.600858840135,0.425704689024,0.60491501652,0.419193351817]), LabeledPoint(2001.0, [0.854411946129,0.604124786151,0.593634078776,0.495885413963,0.266307830936,0.261472105188,0.506387076327,0.464453565511,0.665798573683,0.542968988766,0.58044428577,0.445219373624]), LabeledPoint(2001.0, [0.908982970575,0.632063159227,0.557428975183,0.498263761394,0.276396052336,0.312809861625,0.448530069406,0.448674249968,0.649791323916,0.489868662682,0.591908113534,0.4500023818])]
次にデータをTraning、Validation,Testに分割。
#Devide rawData into Traning, Validation and Test weights = [.8, .1, .1] seed = 50 parsedTrainData, parsedValData, parsedTestData = parsedData.randomSplit(weights, seed)
# Fit the model with default values fitModel = LinearRegressionWithSGD.train(parsedTrainData) print fitModel (weights=[348.391703677,312.158507888,303.514705245,268.768326368,260.265535915,321.082923267,345.636059404,369.96940298,414.587178279,328.611497772,389.972179858,348.42792115], intercept=0.0)
このモデルを元に、実測値と予測値の比較。
誤差は約2年。
# Prediction testPoint = parsedTrainData.take(1)[0] print testPoint.label testPrediction = fitModel.predict(testPoint.features) print testPrediction 2001.0 2003.04838193
次に2乗平均平方根誤差(RMSE)を使ったモデル評価を実施。
さらに、RMSEを評価関数としてグリッドサーチでハイパーパラメータを探していきます。
# Define a formula to caluculate RMSE(Root Mean Square Error) def squaredError(label, prediction): ##Calculates the the squared error for a single prediction. return (label - prediction)**2 def calcRMSE(labelsAndPreds): ##Calculates the root mean squared error for an `RDD` of (label, prediction) tuples. return np.sqrt(labelsAndPreds .map(lambda (label, prediction): squaredError(label, prediction)) .mean()) #Create new RDD with actual label and predicted label labelsAndPreds = parsedValData.map(lambda lp: (lp.label, fitModel.predict(lp.features))) #Calculation RMSE rmseValLR1 = calcRMSE(labelsAndPreds) print rmseValLR1 126.788570325
リッジ回帰(グリッドサーチ編)
デフォルト値で実施したFittingではRMSEは126程度。
でもって、次にグリッドサーチでバリっとハイパーパラメータを探してみます。
##Grid search # Values to use when training the linear regression model miniBatchFrac = 1.0 # miniBatchFraction regType = 'l2' # regType useIntercept = True # intercept # Seed of minmum RMSE min_rmseVal = 10**10 #Fix HyperParameters modelRMSEs = [] for grid_alpha in [1e-5,1.0,10,100]: for grid_numIters in [50,100,500]: for grid_reg in [1e-5,0.0, 1.0]: model = LinearRegressionWithSGD.train(parsedTrainData, iterations=grid_numIters, step=grid_alpha, regParam=grid_reg) labelsAndPreds = parsedValData.map(lambda lp: (lp.label, model.predict(lp.features))) rmseVal = calcRMSE(labelsAndPreds) if rmseVal < min_rmseVal: min_rmseVal = rmseVal best_alpha = grid_alpha best_numIter =grid_numIters best_reg = grid_reg print "best_alpha:{},best_numIter:{}, best_reg:{}, best_rmseVal:{}".format(best_alpha, best_numIter, best_reg, min_rmseVal) best_alpha:1.0,best_numIter:500, best_reg:1e-05, best_rmseVal:117.120806943
こちらはRMSEが117程度と若干下がって、精度が上がっていることが期待出来そう。
で、早速Fittingして、最初のモデルと同様に実測値と予測値の比較。
#Fitting with HyperParameters fixed by grid search Final_model = LinearRegressionWithSGD.train(parsedTrainData, iterations=best_numIter, step=best_alpha, regParam =best_reg) #Labels comparison between Final_model and actual data. Final_testPoint = parsedTrainData.take(1)[0] print Final_testPoint.label Final_Prediction = Final_model.predict(Final_testPoint.features) print Final_Prediction 2001.0 1969.7210425
誤差30年近くと大幅に精度が悪くなってますね。なんでやろ。
ちょっとすぐには原因わからないので、やり直したら書き直します。