CIFAR-10 Genetic Algorithm Hyperparameter Selection using TensorFlow
CIFAR-10 Genetic Algorithm Hyperparameter Selection using TensorFlow
In machine learning, one of the key challenges is selecting the correct hyperparameter values for your problem. This includes things like model complexity, learning rate, batch size, etc. This has largely been done using intuition, which is decidedly not scientific. At GTC 2017, there were two talks on using genetic algorithms to select the best hyperparameter values for neural networks:
I decided to try this out using one of the common image classification benchmarks, CIFAR-10. This is a standard classification problem, the goal being to place each image into one of 10 classes. I started with the tutorial implementation available in the TensorFlow repository.
I wrote a small library to do genetic optimization internally at Acceleware. It supports the following types of traits:
- Boolean
- Integers within a specified range
- Uniformly distributed floating point values within a specified range
- Geometrically distributed floating point values
- Geometrically distributed integers
The next task was to determine the ranges of values for each of the hyperparameters. The network as implemented in the sample code (cifar10.py inference()
) looks like this:
I decided to vary the following parameters:
Parameter | Description | Values |
---|---|---|
conv1_size | Size of filters | 3, 4, 5, 6 |
conv1_count | Number of filters | 32, 40, 51, 64, 81, 102 |
conv2_size | Size of filters | 3, 4, 5, 6 |
conv2_count | Number of filters | 32, 40, 51, 64, 81, 102 |
pool_size | Pooling size | 3, 4 |
pool_stride | Pooling stride | 1, 2, 3, 4 |
local3_size | Number of neurons | 256, 287, 323, 362, 406, 456 |
local4_size | Number of neurons | 128, 147, 169, 194, 223 |
In total there are 138240 combinations for all of the parameters in the table above. At an average of one hour per combination, on 4 GPUs, it would take almost 4 years to run all possible combinations! We can do better, using genetic algorithms.
I modified inference()
to take these parameters in a config dictionary:In total there are 138240 combinations. At an average of one hour per combination, on 4 GPUs, it would take almost 4 years to run all possible combinations! We can do better, using genetic algorithms.
with tf.variable_scope('conv1') as scope: kernel = _variable_with_weight_decay('weights', shape=[config['conv1_size'], config['conv1_size'], 3, config['conv1_count']], stddev=5e-2, wd=0.0) conv = tf.nn.conv2d(images, kernel, [1, 1, 1, 1], padding='SAME') biases = _variable_on_cpu('biases', [config['conv1_count']], tf.constant_initializer(0.0)) pre_activation = tf.nn.bias_add(conv, biases) conv1 = tf.nn.relu(pre_activation, name=scope.name) _activation_summary(conv1)
I then created a population with the following parameters: The first generation had 20 random entities. The parents of each subsequent generation consisted of:
- The top 10 members of the previous generation
- The top 10 entities of all time
- One new random entity
Fitness was determined using the classification accuracy for the validation set. The parents were randomly paired off to breed with each other until 20 offspring were created for the next generation. When breeding, each trait had a 5% chance of mutating.
Using two NVIDIA Tesla P100 and two NVIDIA Tesla K20c GPUs, I ran this algorithm through 8 generations, and the top entities were as follows, with the original sample code values for comparison:
conv1 size |
conv1 count |
conv2 size |
conv2 count |
pool size |
pool stride |
local3 size |
local4 size |
accuracy |
---|---|---|---|---|---|---|---|---|
5 |
64 |
5 |
102 |
3 |
1 |
406 |
169 |
88.01% |
5 |
102 |
6 |
102 |
3 |
1 |
256 |
128 |
87.96% |
5 |
64 |
5 |
102 |
3 |
1 |
456 |
169 |
87.76% |
5 |
64 |
6 |
102 |
3 |
1 |
406 |
169 |
87.71% |
3 |
64 |
5 |
102 |
3 |
1 |
456 |
169 |
87.53% |
Sample Code - Baseline |
||||||||
5 |
64 |
5 |
64 |
3 |
2 |
384 |
192 |
86.31% |
As you can see, we have achieved an improvement over the baseline with minimal effort. The top performing algorithms use other network topologies.
What can we do to further improve? We could modify the algorithm beyond simple hyperparameter selection and allow it to actually choose different network topologies. We could also try using Bayesian optimization as an alternative to genetic algorithms, although that can be harder to parallelize. We could also try penalizing the fitness function with the size of the network in order to build a model which is not only accurate but fast and memory-efficient as well.
- Scott Quiring's blog
- Log in to post comments