Friday, September 23, 2016

Hacking the H2O R API

Hacking the H2O R API

H2O comes with a comprehensive R API, but sometimes you want to do something that it does not (yet) support. This article will show how to add a couple of functions for fetching and saving models. Beyond giving you these functions, I want to show how to approach hacking on the API, including using internals. (Code in this article has been tested on the 3.8.2.x, 3.8.3.x and 3.10.0.x releases.)
If you want to learn more about H2O, and machine learning, may I recommend my book: Practical Machine Learning with H2O, published by O’Reilly? (It is “coming really soon” as I write this!) And, my company, QQ Trend, are available for helping you with all your machine learning needs, everything from a few hours of H2O-related consulting to helping you build massive models to solve the mysteries of life. (Contact me at dc at qqtrend.com )

Saving it all for another day

Say you have 30 models stored on H2O, and you want to save them all. The scenario might be that you want to stop the cluster overnight, but want to use your current set of models as the starting point for better models tomorrow. Or in an ensemble. Or something. At the time of writing H2O does not offer this functionality, in any of its various APIs and front-ends. So I want to write an h2o.saveAllModels() function.
Breaking that down a bit, I’m going to need these two functions:
  • get a list of all models
  • save models, given a list of models or model IDs.
Let’s start with the “or model IDs” requirement. H2O’s API offers h2o.saveModel(), but that only takes a model object, so how can we use it when all we have is an ID?

Exposing The Guts…

I am a huge fan of open source. H2O is open source. R is open source. But there is open, and then there is open, and one of the things I like about R is if you want to see how something was implemented, you just type the function name.
Type h2o.saveModel (without parentheses) in an R session (where you’ve already done library(h2o), of course) and you will see the source code. Here it is; notice how the only part of object that it uses is the model id - that is a stroke of luck, because it means that (under the covers) the API works just the way we needed it to!
function (object, path = "", force = FALSE) 
{
    #... Error-checking elided ...
    path <- file.path(path, object@model_id)
    res <- .h2o.__remoteSend(
      paste0("Models.bin/", object@model_id),
      dir = path, force = force, h2oRestApiVersion = 99)
    res$dir
}
If you are new to H2O, you need to understand that all the hard work is done in a Java application (which can equally well be running on your machine or on a cluster the other side of the world), and the clients (whether R, Python or Flow’s CoffeeScript) are all using the same REST API to send commands to it. So it should be no surprise to see .h2o.__remoteSend there; it is making a call to the “Models.bin” REST endpoint.
.h2o.__remoteSend is a private function in the R API. That means you cannot call it directly. Luckily, R doesn’t get in our way like Java or C++ would. We can use the package name followed by the triple colon operator to run it from normal user code: h2o:::.h2o.__remoteSend(...)
WARNING: Remember that hacking with the internals of an API is not future-proof. An upgrade might break everything you’ve written. …ooh, look at us, adrenaline flowing, living the life of danger. (Note to self: need to get out more.)

Let’s Write Some Code!

We now have enough to make the saveModels() function:
h2o.saveModels <- function(models, path, force = FALSE){
sapply(models, function(id){
  if(is.object(id))id <- id@model_id
  res <- h2o:::.h2o.__remoteSend(
    paste0("Models.bin/", id),
    dir = file.path(path, id),
    force = force, h2oRestApiVersion = 99)
  res$dir
  }, USE.NAMES=F)
}
The if(is.object(id))id = id@model_id line is what allows it to work with a mix of model id strings or model objects. The use of sapply(..., USE.NAMES=F) means it returns a character vector, containing the full path of each model file that was saved. Use it as follows:
h2o.saveModels(
  c("DL:defaults", "DL:200x200-500", "RF:100-40"),
  "/path/to/h2o_models/todays_hard_work",
  force = TRUE
  )
and it will output:
[1] "/path/to/h2o_models/todays_hard_work/DL:defaults"
[2] "/path/to/h2o_models/todays_hard_work/DL:200x200-500"
[3] "/path/to/h2o_models/todays_hard_work/RF:100-40"
(By the way, there is one irritating problem with this function: if any failure occurs, such as a file already existing, or a model ID not found, it stops with a long error message, and doesn’t attempt to save the other models. I’ll leave improving that to you. Hint: consider wrapping the h2o:::.h2o.__remoteSend() call with ?tryCatch.)

What Models Have I Made?

Next, how to get a list of all models? The Flow interface has getModels, and the REST API has GET /3/Models, but the R and Python APIs do not; the closest they have is h2o.ls() which returns the names of all data frames, models, prediction results, etc. with no (reliable) way to tell them apart. But GET /3/Models is not ideal either, because it returns everything about the model, whereas all we want is the model id. Having trawled through the H2O source (BTW, If you fancied submitting a patch to add a GET /3/ModelIds/ command, this looks like a good starting point I.e. what we want is just the first half of that function) it appears we are stuck with this. It is unlikely to matter unless you have 1000s of models, or a slow connection to your remote H2O cluster.
Start the same way as before by typing h2o.getModel (no parentheses) into R. Ooh! That function is long, and it is doing an awful lot. If you want a generally useful h2o.getModels() function I leave that as another of those exercises for the reader. Instead I’m going to call my function
h2o.getAllModelIds(), and limit the scope to just that, which makes the code much simpler. (Did you notice the pro tip there: just by calling my function “getAllModelIds” instead of “getModels” I saved myself hours of work. You see kids, naming really does matter.)
Here it is:
h2o.getAllModelIds <- function(){
d <- h2o:::.h2o.__remoteSend(method = "GET", "Models")
sapply(d[[3]], function(x) x$model_id$name)
}
Line 1 says get all the models. Line 2 says filter just the model id out, and throw the rest of it away. (Yeah, that d[[3]] bit is particularly fragile.)
Anyway, the final step is to simply put our two new functions together:
h2o.saveAllModels <- function(path){
h2o.saveModels(h2o.getAllModelIds(), path)
}
Use it, as shown here:
fnames <- h2o.saveAllModels("/path/to/todays_hard_work")
On one test, length(fnames) returned 154 (I’d been busy), and those 154 models totalled 150MB. However some models (e.g. random forest) are bigger than others, so make sure you have plenty of disk space to hand, just in case. Speaking of which, h2o.saveAllModels() should work equally well with S3 or HDFS destinations.

The day after the night before…

I could’ve done a dput(fnames) after running h2o.saveAllModels(), and saved the output somewhere. But as I’m not putting anything else in that particular directory, I can get the list again with Sys.glob(). So, I might start my next day’s session as follows.
 library(h2o)
 h2o.init(nthreads = -1)
 fnames <- Sys.glob("/path/to/todays_hard_work/*")
 models <- lapply(fnames, h2o.loadModel)
Voila! models will be an R list of the H2O model objects.

Clusters

If you are working on a remote cluster, with more than one node, there is a little twist to be aware of. h2o.saveModel() (and therefore our h2o.saveModels() extension) will create files on whichever node of the cluster your client is connected to. (At least, as of 3.10.0.7; I suspect this behaviour might change in future.)
But h2o.loadModel() will look for it on the file system of node 1 of the cluster. And node 1 is not (necessarily) the first node you listed in your flatfile. Instead it is the one listed first in h2o.clusterStatus().
This won’t concern you if you saved to HDFS or S3.

Bonus

What I actually use to load model files is shown below. It will get all the files from sub-directories. (Look at the list.file() documentation for how you can use pattern to choose just some files to be loaded in.)
h2o.loadModelsDirectory <- function(path, pattern=NULL, recursive=T, verbose=F){
fnames <- list.files(path, pattern = pattern, recursive = recursive, full.names = T, include.dirs = F)
h2o.loadModels(fnames, verbose)
}
(This code still has one problem: if I’m loading in work from both yesterday and two days ago, and I had made a model called “DF:default” on both days, I lose one of them. Sorting that out is my final exercise for the reader - please post your answer, or a link to your answer, in the comments!)