Bayesian Optimization
Is it really the best hyperparameter tuning method? Comparing it with a grid search, random search, and hyperband search
Hyperparameter tuning is a time-intensive and computationally expensive task. What we want is a tuning mechanism that finds the optimal hyperparameters as quickly as possible. There are a few methods to implement hyperparameter tunings such as grid search, random search, and hyperband. Each of them has its own benefits and drawbacks. And there comes Bayesian optimization. Making other tuning techniques seem a bit old-fashioned but does this fancy method beat all the other tuning approaches? Is it a golden rule of thumb? My quick answer is:
If you have limited computational resource and wants to get presumably the best hyperparameters within few iterations then yes. However, is it drastically better than other methods? No.
Let’s dig a little more into the Bayesian optimization.
What is Bayesian optimization?
Previously, in grid search, you are trying every possible option digging holes even that are not promising at all. Although impractical you won’t accidentally miss the global optimum. However, you need to have potentially unlimited computational power to achieve this.
Random search, on the other hand, attempts to randomly poke here and there to give a good estimate of where the optimal hyperparameters are. It is a much more preferred way to do hyperparameter tuning. But you still need many iterations to have a good glimpse of where the range of global optimum is.
Hyperband search is a variation of the random search. Unlike other search methods that will continue running the iterations with clearly bad hyperparameter sets until the prescribed iteration numbers exhaust. Hyperband search, instead, will terminate those bad configurations after very few iterations and only run sets of parameters that are promising to output good results based on the previous history.
Bayesian optimization search is a completely different approach compared to the aforementioned search techniques. Bayesian optimization is a probabilistically principled method based on the Bayes theorem to find the global optimum of the black-box function.
What does this mean? If you know the ground-truth objective function such as f(x) = 0, we also know where the min and max values are. However, when we are using deep learning models, we can never know what the underlying objective function is. Therefore, all we could do is do lots of trial-and-error until either you get exhausted or your resource gets exhausted. However, no one wants either of these options! We want an intelligent way to explore the parameter space. Looking at the figures below, you can see how Bayesian optimization search finds global minimum with fewer iterations compared to other search methods.
This approach becomes handy when the objective function is complex and expensive to evaluate. As we do not know the objective function in most neural networks and deep learning model is costly to evaluate, a Bayesian optimization search seems like a light at the end of the tunnel. This is how Bayesian optimization works:
- Bayes Theorem (conditional probability of an event):
P(A|B) = P(B|A) * P(A) /P(B)
proportional quantity: P(A|B) = P(B|A) * P(A) - Prior: Samples are drawn and will be evaluated with a random function. The outcome will be used to define the prior. The prior represents the behavior of the function.
- Posterior: Posterior is the conditional probability (the reverse is called likelihood) assigned taking the priors into account. Thus, it is an approximation of the objective function and evaluates the cost of specific parameter configurations. The posterior becomes a prior as they are updated in the next iteration. Posterior is used in the acquisition function to determine the next points to sample in the search space that is most likely to return good results.
- Surrogate Function
The surrogate function is an approximation of the black-box objective function. A Gaussian process(Gaussian distribution and random process) operates as a surrogate function. The Gaussian process builds the probabilistic model of the objective function, assuming all random variables have a multivariate normal distribution. In short, utilizing the Gaussian process guides how to explore the parameter samples in the search space. - Acquisition Function: It determines the next query points from the parameter space. It uses exploration(where are optimal values? Let’s explore unknown areas or most uncertain areas) and exploitation(where is the optimal value? Let’s find points where it gives the highest value based on the previous result) strategy to guide selecting the next sample. The acquisition function tries to yield a good balance between exploration and exploitation. There are multiple ways to do this. “Expected Improvement (EI)” and “Maximum Probability of Improvement” are popular methods. In short, the acquisition function decides what to sample next.
The gif image below simulates the Bayesian optimization process.
Bayesian optimization, however, may not be the best tuning mechanism. There has been multiple research that the Bayesian optimization approach only contributes to negligible additional improvement compare to random search. Moreover, as the Bayesian optimization is weak at performing in high dimensionality. Sometimes, using just simple and straightforward is easier to implement, explain, and efficient in many real-life use cases if it’s causing only negligible performance differences. Depending on your use case, simple is the best!
One change made in this article is that although many people think that as Bayesian optimization uses the Gaussian process as a surrogate function the optimization process is sequential. This also implies that it cannot leverage parallel resources. However, there are ways you can enable parellel using tools such as SigOpt.