Simple hyperparameter and architecture search in tensorflow with ray tune
In a previous blog post I have shown a very clean method on how to implement efficient hyperparameter search in tensorflow from scratch. I presented population-based training, an evolutionary method that allows cheap and adaptive hyperparameter search by changing hyperparameters already during training instead of having to train until convergence before the resulting performance statistics can be used to inform the choice of hyperparameters. That said, the vanilla version I presented had two major downsides: The training was not distributed out of the box and also required that the graph is constructed in the beginning and can be reused for different parameter settings.
In this blog post we want to look at the distributed computation framework ray and its little brother ray tune that allow distributed and easy to implement hyperparameter search. It not only supports population-based training but also other hyperparameter search algorithms. Ray and ray tune support any autograd package, including tensorflow and PyTorch.
Setting up ray
Install ray on all your machines
pip install ray
If you only have a single machine at your disposal (it will make use of all local CPU and GPU resources), initialize ray with
import ray
ray.init()
Otherwise, pick a machine to be the head node
ray start --head --redis-port=6379
and connect your other instances
ray start --redis-address=HEAD_HOSTNAME:6379
Finally, run your code on any node
import ray
ray.init(redis_address='HEAD_HOSTNAME:6379')
Having done that, we are ready to distribute tasks such as hyperparameter tuning. More information on how to set up a cluster can be found here.
Implementing your model
We will need to implement a model using the following skeleton. You should be able to plug in your existing tensorflow code easily:
import ray.tune as tune
class Model:
# TODO implement
class MyTrainable(Trainable):
def _setup(self):
# Load your data
self.data = ...
# Setup your tensorflow model
# Hyperparameters for this trial can be accessed in dictionary self.config
self.model = Model(self.data, hyperparameters=self.config)
# To save and restore your model
self.saver = tf.train.Saver()
# Start a tensorflow session
self.sess = tf.Session()
def _train(self):
# Run your training op for n iterations
for _ in range(n):
self.sess.run(self.model.training_op)
# Report a performance metric to be used in your hyperparameter search
validation_loss = self.sess.run(self.model.validation_loss)
return tune.TrainingResult(timesteps_this_iter=n, mean_loss=validation_loss)
def _stop(self):
self.sess.close()
# This function will be called if a population member
# is good enough to be exploited
def _save(self, checkpoint_dir):
path = checkpoint_dir + '/save'
return self.saver.save(self.sess, path, global_step=self._timesteps_total)
# Population members that perform very well will be
# exploited (restored) from their checkpoint
def _restore(self, checkpoint_path):
return self.saver.restore(self.sess, checkpoint_path)
A new Trainable will be instantiated and executed on an available GPU in your cluster / on your machine for each trial or population member (each having their own hyperparameters) in your population-based training.
Setting up ray tune
Next, register your trainable and specify your experiments
import numpy as np
tune.register_trainable('MyTrainable', MyTrainable)
train_spec = {
'run': 'MyTrainable',
# Specify the number of CPU cores and GPUs each trial requires
'trial_resources': {'cpu': 1, 'gpu': 1},
'stop': {'timesteps_total': 20000},
# All your hyperparameters (variable and static ones)
'config': {
'batch_size': 20,
'units': 100,
'l1_scale': lambda cfg: return np.random.uniform(1e-3, 1e-5),
'learning_rate': tune.random_search([1e-3, 1e-4])
...
},
# Number of trials
'repeat': 4
}
The entry ‘repeat’ describes the number of trials / population members.
Each trial will sample its ‘config’ from the specification above (i.e. using the predefined values or running the specified function).
The instruction tune.random_search
will multiply the number of trials by the number of elements it was given, effectively running a grid search.
Finally, we have to define the kind of hyperparameter tuning we would like to perform and start our experiments. This is how it works for population-based training:
pbt = PopulationBasedTraining(
time_attr='training_iteration',
reward_attr='mean_loss',
perturbation_interval=1,
hyperparam_mutations={
'l1_scale': lambda: np.random.uniform(1e-3, 1e-5),
'learning_rate': [1e-2, 1e-3, 1e-4]
}
)
tune.run_experiments({'population_based_training': train_spec}, scheduler=pbt)
The above example will save, explore and exploit your population every time after _train
has been called on your Trainable.
This is because we set ‘perturbation_interval’ to 1.
Furthermore, both ‘l1_scale’, as well as ‘learning_rate’, will be perturbed or resampled during explore operations according to the scheme specified.
You can even implement your own exploration function, to learn more see here.
Because at every mutation step of the population-based training the entire model is saved to disk and restored (and possibly sent to a different machine), ray tune requires a significantly bigger overhead compared to our vanilla tensorflow version. Therefore you might want to increase pertubation_interval depending on the length of each iteration.
Optional: Network morphisms
On the other hand, if your graph needs to be rebuilt anyways, for instance, to make architectural changes based on hyperparameters, the graph reconstruction makes the process quite easy.
To give you an example, let’s say you want to increase the number of units in a layer.
You then could implement a network morphism that keeps your neural network function \(f\) identical but changes the number of units by padding your weight and bias variables with zeros.
You will have to redefine your _restore
function
def _restore(self, checkpoint_path):
reader = tf.train.NewCheckpointReader(checkpoint_path)
for var in self.saver._var_list:
tensor_name = var.name.split(':')[0]
if not reader.has_tensor(tensor_name):
continue
saved_value = reader.get_tensor(tensor_name)
resized_value = fit_to_shape(saved_value, var.shape.as_list())
var.load(resized_value, self.sess)
where fit_to_shape
is defined as
def fit_to_shape(array, target_shape):
source_shape = np.array(array.shape)
target_shape = np.array(target_shape)
if len(target_shape) != len(source_shape):
raise ValueError('Axes must match')
size_diff = target_shape - source_shape
if np.all(size_diff == 0):
return array
if np.any(size_diff > 0):
paddings = np.zeros((len(target_shape), 2), dtype=np.int32)
paddings[:, 1] = np.maximum(size_diff, 0)
array = np.pad(array, paddings, mode='constant')
if np.any(size_diff < 0):
slice_desc = [slice(d) for d in target_shape]
array = array[slice_desc]
return array
Note that in the case where your number of units is reduced, the above code will not keep the function \(f\) identical! One might, for instance, derive a more intelligent algorithm that removes only the weights and biases that have zero magnitudes. This might be encouraged by l1-regularization.
Waiting for results and visualizing
Finally, lean back and let the magic happen. You will see that ray tune outputs nice statistics along the way
== Status ==
PopulationBasedTraining: 42 checkpoints, 28 perturbs
Resources used: 3/12 CPUs, 3/3 GPUs
Result logdir: /home/louis/ray_results/population_based_training
PAUSED trials:
- Experiment_0: PAUSED [pid=20781], 971 s, 1600 ts, -1.64e+03 rew, 0.935 loss, 0.676 acc
RUNNING trials:
- Experiment_1: RUNNING [pid=23121], 1162 s, 1350 ts, -1.68e+03 rew, 0.994 loss, 0.665 acc
- Experiment_2: RUNNING [pid=18700], 979 s, 1550 ts, -1.63e+03 rew, 0.988 loss, 0.663 acc
- Experiment_3: RUNNING [pid=22593], 990 s, 1550 ts, -1.77e+03 rew, 0.959 loss, 0.671 acc
To visualize log data both from ray tune and your own tensorflow summaries use Tensorboard
tensorboard --logdir ~/ray_results/population_based_training
If you want to learn more about ray tune, have a look at the documentation and examples.