**A guide to hyperparameter tuning in machine learning - a critical step for model success**

## Introduction

It’s well known that 80% of a data scientist’s job is to load the data, understand its underlying relationships, draw insights, and then prepare it for modeling. However, most people often overlook the importance of a minor aspect within the remaining 20% - hyperparameter tuning.

Hyperparameter tuning is intricate yet crucial to a model’s success. Much like a Formula 1 car, a proper model tune can be the difference between a mediocre model and a highly efficient one. Essentially, hyperparameter tuning involves adjusting the settings (i.e., the hyperparameters) that control the model’s learning process.

That being said, finding the perfect combination of hyperparameters is not easy. In fact, it’s typically a cumbersome, exhaustive, and complex challenge. Nowadays, there are multiple hyperparameter tuning methods (including automated ones) that try to make this process easier to digest; nevertheless, the complexity around hyperparameters today is no longer about finding the right settings but also about understanding the various techniques available at your disposal.

The objective of this article is to introduce some of the more important hyperparameter tuning methods in machine learning that every data scientist or machine learning engineer should know. We’ll delve into their strengths and weaknesses and also provide an empirical comparison of each technique.

## Getting Started

For the sake of demonstration, we’ll be using the fairly common, publicly available Census Income dataset (source: https://archive.ics.uci.edu/dataset/2/adult).

The idea behind this dataset is to predict whether a specific individual earns more than $50K/yr.

We can directly load this data into Python:

import pandas as pd # Load and preprocess dataset def load_and_preprocess_data(url): columns = ["age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income"] data = pd.read_csv(url, names=columns, sep=',\s', na_values="?", engine='python') data['income'] = data['income'].map({'<=50K': 0, '>50K': 1}) X = data.drop('income', axis=1) y = data['income'] return X, y url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" X, y = load_and_preprocess_data(url)

We’ll also need to perform some basic feature encoding to get rid of any categorical features.

# Impute and encode features def impute_and_encode_features(X): categorical_cols = X.select_dtypes(include=['object']).columns numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns for col in categorical_cols: X[col].fillna(X[col].mode()[0], inplace=True) for col in numerical_cols: X[col].fillna(X[col].median(), inplace=True) X = pd.get_dummies(X, columns=categorical_cols) scaler = StandardScaler() X[numerical_cols] = scaler.fit_transform(X[numerical_cols]) return X X = impute_and_encode_features(X)

Next, we’ll split our dataset into training and testing sets. This is crucial to evaluate the model’s performance during hyperparameter tuning.

We can easily do this through sklearn.

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

As part of our experiments, we’ll also be using the LGBMClassifier model because it’s known to be quite robust and fast to train.

We can define the model as follows:

model = lgb.LGBMClassifier(learning_rate=0.05, n_estimators=100, verbose=-1)

We can also prepare some of the imports that we’ll need to evaluate the techniques.

import time from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, balanced_accuracy_score, make_scorer

With this, we now have our environment all set up and ready to go. Let’s start with perhaps the most basic and fundamental approach: grid search.

## Technique 1: Grid Search

When it comes to hyperparameter tuning in machine learning, grid search is often the go-to method for many data scientists. Let’s explore why it is so popular and how it operates.

### How It Works

**Exhaustive search:**grid search evaluates all possible combinations from a specified range of hyperparameter values. It’s akin to systematically checking every possible configuration to identify the one that yields the best performance.**Cross-validation:**To avoid overfitting, grid search typically uses cross-validation. This process divides the dataset into a number of subsets, training the model on some subsets and validating it on others.

### Implementation Insights

**Results analysis:**This process yields the most effective combination of parameters for the model. However, there is one obvious cost here - grid search can be extremely time-consuming, especially with large datasets and numerous hyperparameters.

### Other Considerations

**Time efficiency**: For extensive searches, grid search can take considerable time. The bigger the search space, the longer it takes. In real-life production scenarios, it’s not unusual for this process to take several hours!

### Python Implementation

We start off by defining our search space (the hyperparameter grid).

param_grid = { 'num_leaves': [31, 50, 100], 'reg_alpha': [0.1, 0.5], 'min_data_in_leaf': [30, 50, 100, 300], 'lambda_l21': [0, 1], 'lambda_l2': [0, 1] }

Next, we initialize the grid search object.

from sklearn.model_selection import GridSearchCV search = GridSearchCV(model, param_grid, cv=3, scoring='accuracy', n_jobs=-1, verbose=1)

We now just need to fit the object onto our data.

start_time = time.time() search.fit(X_train, y_train) print(time.time() - start_time)

Once finished, we can access the best hyperparameters found using:

best_params = search.best_params_

All that’s left now is to train the model using the best hyperparameters found and evaluate its performance.

model.set_params(**best_params) model.fit(X_train, y_train) y_pred = model.predict(X_test) y_pred_proba = model.predict_proba(X_test)[:, 1] { 'accuracy': accuracy_score(y_test, y_pred), 'precision': precision_score(y_test, y_pred), 'recall': recall_score(y_test, y_pred), 'f1_score': f1_score(y_test, y_pred), 'roc_auc': roc_auc_score(y_test, y_pred_proba), }

## Technique 2: Randomized Search

Randomized search offers an alternative to the exhaustive nature of grid search. It’s designed to navigate the hyperparameter space more efficiently, providing a pragmatic balance between resource use and model performance.

### How It Works

**Selective exploration:**Unlike grid search, randomized search samples a fixed number of hyperparameter settings from the specified distributions. This method allows for a more diversified search across the parameter space.**Balancing speed and accuracy:**The approach is particularly useful when dealing with large datasets or when computational resources are constrained. It can uncover high-performing hyperparameters with fewer iterations than grid search.

### Implementation Insights

**Results overview:**While it might not explore every possible combination like grid search, randomized search often finds a very competitive set of parameters in a fraction of the time.

### Other Considerations

**Time comparison:**The time taken is substantially less. For example, a process that takes hours in grid search might only take minutes in randomized search.

### Python Implementation

For this one, we’ll be using the same param_grid.

We can initialize the randomized search object as follows:

from sklearn.model_selection import RandomizedSearchCV search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=60, cv=5, scoring='accuracy', n_jobs=-1, verbose=1, random_state=42)

And we can follow the same process as before to fit and evaluate the results.

## Technique 3: Bayesian Optimization

This technique stands out with its probabilistic approach, offering a smarter and more efficient pathway to model optimization. It has gained significant interest in the machine learning community, especially in the competitive space, due to its speed and reliability.

### How It Works

**Probability-driven decisions:**It uses probability models to predict the performance of hyperparameters. The main objective is to balance both**exploration**(trying new hyperparameters) and**exploitation**(refining promising hyperparameters).**Learning from past evaluations:**The Bayesian approach builds a model of the objective function (pre-defined by the user) and uses it to predict which hyperparameters are likely to yield better results, learning from previous iterations.

### Implementation Insights

**Comparative Advantage:**This is much faster and more effective than grid and randomized Searches, especially in high-dimensional spaces. It’s also beneficial when each evaluation of the objective function is costly or time-consuming*.*

### Other Considerations

**Efficiency gains:**Bayesian optimization generally requires fewer iterations to find high-quality hyperparameters, translating to significant time savings and resource efficiency.

### Python Implementation

The first thing we need to do to perform Bayesian optimization is to define our objective function.

The objective function is quite straightforward and is used to let the Bayesian algorithm know what we want to optimize. In our case, we want to get the least amount of loss (prediction error).

from sklearn.model_selection import cross_val_score from hyperopt import hp, fmin, tpe, Trials def objective(params): lgbm = lgb.LGBMClassifier(**params, verbose=-1) score = cross_val_score(lgbm, X_train, y_train, scoring='accuracy', cv=3, n_jobs=-1).mean() return score

For Bayesian optimization, we also need to slightly change the hyperparameter grid that we use. The reason behind this is to let the algorithm know how it should deal with the different parameter settings.

space = { 'num_leaves': hp.choice('num_leaves', range(20, 150, 5)), 'min_data_in_leaf': hp.choice('min_data_in_leaf', range(20, 200, 10)), 'learning_rate': hp.loguniform('learning_rate', -5, 0), 'max_depth': hp.choice('max_depth', range(5, 30, 1)), 'reg_alpha': hp.uniform('reg_alpha', 0.0, 1.0), 'reg_lambda': hp.uniform('reg_lambda', 0.0, 1.0), 'n_estimators': hp.choice('n_estimators', range(100, 1000, 50)), }

Next, we need to define our trials and search objects. The trials is essentially the object that stores the different sets tried out (in short, the past experiments) and the search object is the algorithm behind the tuning process. The search algorithm that we’re going to be using today is the tree parzen estimator (TPE). We won’t be getting into much detail on this here since it’s outside the scope of this article. For further reading, kindly see:

https://arxiv.org/pdf/2304.11127.pdf

trials = Trials() start_time = time.time() search = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=20, trials=trials) print(time.time() - start_time)

## Technique 4: DataHeroes Coreset Optimization

Dataheroes is a relatively new Python package that offers several cool functionalities centered around the concept of coresets.

A coreset is essentially a sample of the original population which retains the same explained variance of the entire population and also encapsulates its edge cases.

The result of this is that the training data is significantly reduced; therefore, training time is drastically reduced, allowing for a less demanding training and tuning process.

https://dataheroes.ai/introduction-to-coresets/

### How It Works

**Coreset sampling:**As explained above, it uses a sample of the original population that has the same properties and underlying relationships.

### Implementation Insights

**Faster execution:**This is much faster and more effective than the standard grid search, since the training time per iteration will be drastically improved.

### Python Implementation

The Python implementation for this is relatively easy.

The first thing that we need to do is define our coreset object.

from dataheroes import CoresetTreeServiceDTC n_instances = len(X_train) chunk_size = int(n_instances / 4) # standard sizes from the documentation coreset_size = int(chunk_size / 2) # standard sizes from the documentation start = time.time() service_obj = CoresetTreeServiceDTC( optimized_for='training', chunk_size=chunk_size, coreset_size=coreset_size, n_instances=n_instances, model_cls=LGBMClassifier ) service_obj.build(X=X_train.to_numpy(), y=y_train.to_numpy()) end = time.time() coreset_build_time = end - start print(f"CoresetTreeServiceDTC construction lasted {coreset_build_time:.2f} seconds")

We’re able to create a coreset from our original dataset in just 7 seconds.

The beauty of this is that we can reuse this coreset however we like; thus, these 7 seconds are only taken once.

Now, there are plenty of uses with this coreset - but of course, ours is hyperparameter tuning.

We’ll be using the same parameter grid, and we’ll need to define a scorer (similar to the objective function used in Bayesian optimization). However, this time, we can create a scorer directly using sklearn.

from sklearn.metrics import make_scorer # Define the parameter grid param_grid = { 'num_leaves': [31, 50, 100], 'reg_alpha': [0.1, 0.5], 'min_data_in_leaf': [30, 50, 100, 300, 400], 'lambda_l1': [0, 1], 'lambda_l2': [0, 1], 'verbose': [-1] } scoring = make_scorer(accuracy_score) start = time.time() optimal_hyperparameters, scores, trained_model = service_obj.grid_search( param_grid=param_grid, scoring=scoring, refit=True, verbose=2 ) end = time.time() coreset_grid_search_time = end - start y_pred = trained_model.predict(X_test.to_numpy()) coreset_best_params_test_score = balanced_accuracy_score(y_test.to_numpy(), y_pred) print(f'\nThe balanced accuracy score on the test data for the best hyperparameters is: {round(coreset_best_params_test_score, 4)}') print(f'The optimal hyperparameters are: {optimal_hyperparameters}') print(f"Grid search on the corset tree lasted {coreset_grid_search_time:.2f} seconds")

## Concluding Remarks

As part of this article, we’ve explored the main automated hyperparameter tuning methods in machine learning and discussed their main characteristics. We also introduced the relatively new concept of Dataheroes’ coresets and discussed its relevance within hyperparameter tuning.