Use Customized Estimators

The version v0.1.4 of deepforest has added the support on:

  • using customized base estimators in cascade layers of deep forest

  • using the customized predictor concatenated to the deep forest

The page gives a detailed introduction on how to use this new feature.

Instantiate the deep forest model

To begin with, you need to instantiate a deep forest model. Notice that some parameters specified here will be overridden by downstream steps. For example, if the parameter use_predictor is set to False here, whereas set_predictor() is called latter, then the internal attribute use_predictor will be altered to True.

from deepforest import CascadeForestClassifier
model = CascadeForestClassifier()

Instantiate your estimators

In order to use customized estimators in the cascade layer of deep forest, the next step is to instantiate the estimators and encapsulate them into a Python list:

n_estimators = 4  # the number of base estimators per cascade layer
estimators = [your_estimator(random_state=i) for i in range(n_estimators)]

Tip

You need to make sure that instantiated estimators in the list are with different random seeds if seeds are manually specified. Otherwise, they will have the same behavior on the dataset and make cascade layers less effective.

For the customized predictor, you only need to instantiate it, and there is no extra step:

predictor = your_predictor()

Deep forest will conduct internal checks to make sure that estimators and predictor are valid for training and evaluating. To pass the internal checks, the class of your customized estimators or predictor should at least implement methods listed below:

  • fit() for training

  • [Classification] predict_proba() for evaluating

  • [Regression] predict() for evaluating

The name of these methods follow the convention in scikit-learn, and they are already implemented in a lot of packages offering scikit-learn APIs (e.g., XGBoost, LightGBM, CatBoost). Otherwise, you have to implement a wrapper on your customized estimators to make these methods callable.

Call set_estimator() and set_predictor()

The core step is to call set_estimator() and set_predictor() to override estimators used by default:

# Customized base estimators
model.set_estimator(estimators)

# Customized predictor
model.set_predictor(predictor)

set_estimator() has another parameter n_splits, which determines the number of folds of the internal cross-validation strategy. Its value should be at least 2, and the default value is 5. Generally speaking, a larger n_splits leads to better generalization performance. If you are confused about the effect of cross-validation here, please refer to the original paper for details on how deep forest adopts the cross-validation strategy to build cascade layers.

Train and Evaluate

Remaining steps follow the original workflow of deep forest.

model.train(X_train, y_train)
y_pred = model.predict(X_test)

Note

When using customized estimators via set_estimator(), deep forest adopts the cross-validation strategy to grow cascade layers. Suppose that n_splits is set to 5 when calling set_estimator(), each estimator will be repeatedly trained over five times to get full augmented features from a cascade layer. As a result, you may experience a drastic increase in running time and memory.