Hyperparameter Tuning Project Page

Problem Statement

In ML, we often need to "stitch" multiple different components into a pipeline.
Can we simultaneously and automatically tune hyperparameters of these pieces in an end-to-end fashion?
Is it possible to tune for different types of objectives (e.g inference speed, latency, compute cost, memory usage, accuracy)?

Overview

Bayesian optimization (BO) is often used for hyperparameter tuning, due to its ability to optimize black-box functions efficiently Efficiency is often measured in terms of number of iterations required for convergence. However, at each iteration of the BO algorithm, different hyperparameters may incur different costs (i.e. time units required to perform a query). Thus, measuring covergence speed in terms of number of iterations, may not be representative of the true cost incurred by a given BO acquisition function. Furthermore, most BO acqusition functions are often tailored to tune the hyperparameters of the ML model alone. For example, when tuning a random forest classifier model, assume that we have access to an oracle, which tells us that the ideal max depth hyperparameter is 100. We would like our BO algorithm to achieve a tradeoff between a small max depth hyperparameter (say 10), which may result in a low accuracy, at a low latency, and a high max depth (say 10000), which could potentially result in a high accuracy, at high latency. However, this random forest classifier may depend on a preprocessing step such as TF-IDF vectorisation, which also bears tunable hyperparameters. Indeed, ML models are often dependent on one or more preprocessing and postprocessing steps, as part of an ML pipeline (see visualisation on the right). We propose a novel BO acquisition function to ML pipeline hyparameter tuning, and demonstrate its experimental cost-efficiency.

Example

Visualisation of an ML pipeline for the object detection task. Here, the preprocessing step may include image resizing and deblurring. The resize width/height, and gaussian kernel represent preprocessing stage hyperparameters. Postprocessing hyperparameters may include NMS threshold.

Heterogenous Cost

A multi-stage ML pipeline will have multiple cost subspaces. Consider, for example, the below 3-stage pipeline with 2 hyperparameters per stage. $[x_1, x_2]$, $[x_3, x_4]$, and $[x_5, x_6]$ represent the hyperparameters for stage 1, 2 and 3 respectively. Meanwhile, the cost functions for the 3 stages are represented by $|\sin(x_1^2)+\cos(x_1+x_2)|$, $|\sin(x_3)\cdot \cos(10+x_4)|$, and $|\cos(x_5/x_6) \cdot \sin(4-x_5x_6)\cdot \sin(0.5\cdot x_5) |$ respectively. If we consider the heatmaps of cost values for each stage, wherein the hyperparameter values are restricted to $[-5,5]$, the cost function spaces will appears as shown on the right hand side of the below image.

The overall cost function of the pipeline is therefore $|\sin(x_1^2)+\cos(x_1+x_2)|$ + $|\sin(x_3)\cdot \cos(10+x_4)|$ + $|\cos(x_5/x_6) \cdot \sin(4-x_5x_6)\cdot \sin(0.5\cdot x_5) |$. It is visualised in the top left corner of the above image. Note that while this is a function in $\mathbb{R}^6$, we provide a 2-dimensional visualisation of the function upon the 6-dimensional domain. The addition of the costs is the direct result of composition of pipelines stages.

Our goals, with our novel BO algorithm for ML pipelines, is to is to minimise total cost over all pipeline stages, while optimising the objective function. In this case, the objective function is defined as $\mathrm{obj}(x) = |x_1^2 + x_2^2 + x_3^2 + x_4^2 + x_5^2 + x_6|$, also represented in 2D. Observe that the high cost and low cost regions occur in different parts of the space from the high and low objective function regions.

Methodology

When performing hyperparameter search in an ML pipeline setting, we can cache hyperparameter combinations from earlier stages of a pipeline, once queried, to be reused with hyperparameter combinations of latter stages in a pipeline, for a more effecient search. This cache-and-reuse is referred to as memoisation. For example, to the left of the image below, we have a 3-stage pipeline where $x_1=1.2$, $x_2=4.0$, and $x_3=1.0$, are the hyperparameters for 3 respective ML pipeline stages. The costs associated with the hyperparameter is shown in orange, e.g the cost for the stage 1 is 5.1.

To the right of the image below, we have a "prefix pool", resulting from the queried hyperparameter combination This prefix pool has two entries: the first is a hyperparameter combination where $x_1=1.2$ was stored from a prior run, and the second is one where $x_1=1.2$ and $x_2=4.0$ were both stored. Our proposed EEIPU acquisition function would discount costs from prior queried hyperparameters in both cases, when calculating expected improvement per unit costs.

We discount the cost of non-memoised stages within our acqusition function as above. Expected inverse cost excludes costs from stages which have been memoised. $\delta$ is a threshold variables that tells us the earliest non-memoised stage from which to begin considering costs (i.e where we end our discounting). Discounting memoised costs in this way may result in over-biasing towards previously queried points. Our research finds the needed balance between exploration and exploitation.

While we have analytical, closed form expressions for the cost functions for the above pipeline, this is often not the case for real world piplines. For example, one usually does not know the relationship between hyperparmaeters of a computer vision pipeline and pipeline latency/speed. As a result, we model cost at each pipeline stage as a Guassian process, and log latency measurements for queried hyperparameter combinations, to use in iteratively modeling the stagewise costs.

BO algorithms are often tested on a range of synthetic functions designed to assess convergence. Compositions of these functions can be use to form a pipeline. Below are results for 2-stage pipeline where the cost objective for each stage is a 2-dimensional negative Branin function, while the objective function is a sum of logistic functions. The overall objective, is given by $-$Branin$(x_1,x_2)$ $-$Branin$(x_3,x_4)$. Stagewise costs are $-\texttt{logistic}(x_1) -\texttt{logistic}(x_2)$ for stage 1, and $-\texttt{logistic}(x_3) + (1/x_4)$ for stage 2.

Our EEIPU algorithm If indeed our algorithm worked, we would expect to see that our EEIPU function would perform identically to EIPU. This is what we observe from our initial experiments, as can be seen from green line sits in between the red and the yello in the second plot from the left, above. Regular expected improvement (EI), is represented by the red line. EI, which does not factor in cost at all, naturally performs well in terms of the traditional number of iterations required for convergence, but underperforms EEIPU and EIPU when cumulative cost is considered EIPU has the advantage of having an oracle available to it, which specifies the true cost at each point in the hyperparameter space. This is not realistic in real-world scenarios, where one does not know the cost of a hyperparameter combination until after it has been evaluated - an expensive operation. It is thus impressive that without access to an oracle as such, EEIPU is able to compete with EIPU in terms of objective value delta per cumulative cost incurred.