Friday, February 24, 2017

NorDevCon 2017: code samples

This is the sample code, in Python and R, for the talk I gave yesterday at the NorDevCon 2017, pre-meeting talks.

To install h2o for Python, from the commandline do:

  pip install h2o

To install it in R, from inside an R session do:

install.packages("h2o")

Either way, they should get all the dependencies that you need.

The data was the “train.csv.zip” file found at Kaggle (You need to sign-up to Kaggle to be allowed to download it.) The following scripts assume you have unzipped it and put train.csv in the same directory as the scripts.

That Kaggle URL is also where the description of fields is to be found.

Here is how to prepare H2O, and the data, in Python:

import h2o

h2o.init()

data = h2o.import_file("train.csv")

data["Ht"].cor(data["Wt"])

factorsList = ['Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41']

data[factorsList] = data[factorsList].asfactor()

# Split off a random 10% to use to evaluate
# the models we build.
train, test = data.split_frame([0.9], seed=123)

# Sanity check
train.ncol
test.ncol
train.nrow
test.nrow

# What the data looks like:
train.head(rows=1)
test.head(rows=1)

Here is the very quick deep learning model:

m_DL = h2o.estimators.H2ODeepLearningEstimator(epochs=1)
m_DL.train(x, y, train)

(I made the powerful one-liner claim in the talk but, as you can see, in Python they are two-liners.)

Then to evaluate that model:

m_DL

m_DL.predict( test[1, x] )  #Ask prediction about first test record

m_DL.predict( test[range(1,6), x] ).cbind(test[range(1,6), y] )  #Compare result for first 6 records

m_DL.model_performance(test) #Average performance on all 6060 test records

m_DL.model_performance(train)  #For comparison: the performance on the data it was trained on

Here is the default GBM model:

m_GBM = h2o.estimators.H2OGradientBoostingEstimator()
m_GBM.train(x, y, train)

m_GBM.model_performance(test)

Then here is the tuned GBM model - basically it is all about giving it more trees to play with:

m_GBM_best = h2o.estimators.H2OGradientBoostingEstimator(
    sample_rate=0.95,
    ntrees=200,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_GBM_best.train(x, y, train, validation_frame=test)

m_GBM_best.model_performance(test)

And here is the tuned deep learning model:

m_DL_best = h2o.estimators.H2ODeepLearningEstimator(
    activation="RectifierWithDropout",
    hidden=[300,300,300],
    l1=1e-5,
    l2=0,
    input_dropout_ratio=0.2,
    hidden_dropout_ratios=[0.4, 0.4, 0.4],
    epochs=1000,
    stopping_tolerance=0,stopping_rounds=4,stopping_metric="MSE"
    )
m_DL_best.train(x, y, train, validation_frame=test)

m_DL_best.model_performance(test)

And here is the R code, that does the same as the above:

library(h2o)

h2o.init(nthreads=-1)


data = h2o.importFile("train.csv")
# View it on Flow

h2o.cor(data$Wt, data$BMI)


factorsList = c('Product_Info_1', 'Product_Info_2', 'Product_Info_3', 'Product_Info_5', 'Product_Info_6', 'Product_Info_7', 'Employment_Info_2', 'Employment_Info_3', 'Employment_Info_5', 'InsuredInfo_1', 'InsuredInfo_2', 'InsuredInfo_3', 'InsuredInfo_4', 'InsuredInfo_5', 'InsuredInfo_6', 'InsuredInfo_7', 'Insurance_History_1', 'Insurance_History_2', 'Insurance_History_3', 'Insurance_History_4', 'Insurance_History_7', 'Insurance_History_8', 'Insurance_History_9', 'Family_Hist_1', 'Medical_History_2', 'Medical_History_3', 'Medical_History_4', 'Medical_History_5', 'Medical_History_6', 'Medical_History_7', 'Medical_History_8', 'Medical_History_9', 'Medical_History_11', 'Medical_History_12', 'Medical_History_13', 'Medical_History_14', 'Medical_History_16', 'Medical_History_17', 'Medical_History_18', 'Medical_History_19', 'Medical_History_20', 'Medical_History_21', 'Medical_History_22', 'Medical_History_23', 'Medical_History_25', 'Medical_History_26', 'Medical_History_27', 'Medical_History_28', 'Medical_History_29', 'Medical_History_30', 'Medical_History_31', 'Medical_History_33', 'Medical_History_34', 'Medical_History_35', 'Medical_History_36', 'Medical_History_37', 'Medical_History_38', 'Medical_History_39', 'Medical_History_40', 'Medical_History_41')
data[,factorsList] <- as.factor(data[,factorsList])


splits <- h2o.splitFrame(data, 0.9, seed=123)
train <- h2o.assign(splits[[1]], "train")  #90% for training
test <- h2o.assign(splits[[2]], "test")  #10% to evaluate with

ncol(train)   #128
ncol(test)    #128

nrow(train)  #53321
nrow(test)  #6060

t(head(train, 1))
t( as.matrix(test[1,1:127]) )

m_DL <- h2o.deeplearning(2:127, 128, train)
m_DL <- h2o.deeplearning(2:127, 128, train, epochs = 1)  #7 to 9 secs
#system.time( m_DL <- h2o.deeplearning(2:127, 128, train) )  #42.5 secs

h2o.predict(m_DL, test[1,2:127])

h2o.cbind(
  h2o.predict(m_DL, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 7.402184        8
# 2 5.414277        1
# 3 6.946732        8
# 4 6.542647        1
# 5 2.596471        6
# 6 6.224758        5


h2o.performance(m_DL, test)
# H2ORegressionMetrics: deeplearning
# 
# MSE:  3.770782
# RMSE:  1.94185
# MAE:  1.444321
# RMSLE:  0.4248774
# Mean Residual Deviance :  3.770782




######
m_GBM <- h2o.gbm(2:127, 128, train)  #7.3s

h2o.predict(m_GBM, test[1, 2:127])

h2o.cbind(
  h2o.predict(m_GBM, test[1:6, 2:127]),
  test[1:6, 128]
  )

#    predict Response
# 1 6.934054        8
# 2 5.231893        1
# 3 7.135411        8
# 4 5.906502        1
# 5 3.056508        6
# 6 5.049540        5

h2o.performance(m_GBM, test)

# MSE:  3.599897
# RMSE:  1.89734
# MAE:  1.433456
# RMSLE:  0.4225507
# Mean Residual Deviance :  3.599897


##########

#Takes 20-30secs
m_GBM_best = h2o.gbm(
  2:127, 128, train,

  sample_rate = 0.95,
  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",
  ntrees = 200

  )

#h2o.performance gave MSE of 3.473637856428858

plot(m_GBM_best)

h2o.scoreHistory(m_GBM_best)


####################

# 3-4 minutes  (204secs)
m_DL_best <- h2o.deeplearning(
  2:127, 128, train,
  epochs = 1000,

  validation_frame = test,
  stopping_tolerance = 0,
  stopping_rounds = 4,
  stopping_metric = "MSE",

  activation = "RectifierWithDropout",
  hidden = c(300, 300, 300),
  l1 = 1e-5,
  l2 = 0,
  input_dropout_ratio = 0.2,
  hidden_dropout_ratios = c(0.4, 0.4, 0.4)
  )  


h2o.performance(m_DL_best, test)

# MSE:  3.609624
# RMSE:  1.899901
# MAE:  1.444417
# RMSLE:  0.4164153
# Mean Residual Deviance :  3.609624

Finally, and not surprisingly, I can highly recommend my own book, if you would like to learn more about how to use H2O. Examples in the book are on three different data sets, and go into more depth about the different machine learning algorithms that H2O offers, as well as some ideas about how to tune:

From O’Reilly here:
http://shop.oreilly.com/product/0636920053170.do

From Amazon UK:
https://www.amazon.co.uk/Practical-Machine-Learning-Darren-Cook/dp/149196460X

(And other good bookshops, of course!)

Thanks!