API Reference

Below is the class and function reference for deepforest. Notice that the package is still under active development, and some features may not be stable yet.

CascadeForestClassifier

fit(X, y[, sample_weight])

Build a deep forest using the training data.

predict_proba(X)

Predict class probabilities for X.

predict(X)

Predict class for X.

clean()

Clean the buffer created by the model.

get_estimator(layer_idx, est_idx, estimator_type)

Get estimator from a cascade layer in the deep forest.

get_layer_feature_importances(layer_idx)

Return the feature importances of layer_idx-th cascade layer.

load(dirname)

Load the model from the directory dirname.

save([dirname])

Save the model to the directory dirname.

set_estimator(estimators[, n_splits])

Specify the custom base estimators for cascade layers.

set_predictor(predictor)

Specify the custom predictor concatenated to deep forest.

class deepforest.CascadeForestClassifier(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='gini', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)

Bases: deepforest.cascade.BaseCascadeForest, sklearn.base.ClassifierMixin

Implementation of the deep forest for classification.

Parameters
  • n_bins (int, default=255) – The number of bins used for non-missing values. In addition to the n_bins bins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.

  • bin_subsample (int, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller than bin_subsample, then all training samples will be used.

  • bin_type ({"percentile", "interval"}, default= "percentile") –

    The type of binner used to bin feature values into integer-valued bins.

    • If "percentile", each bin will have approximately the same number of distinct feature values.

    • If "interval", each bin will have approximately the same size.

  • max_layers (int, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller than max_layers because of the internal early stopping stage.

  • criterion ({"gini", "entropy"}, default= "gini") – The function to measure the quality of a split. Supported criteria are gini for the Gini impurity and entropy for the information gain. Note: this parameter is tree-specific.

  • n_estimators (int, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains a RandomForestClassifier and a ExtraTreesClassifier, respectively.

  • n_trees (int, default=100) – The number of trees in each estimator.

  • max_depth (int, default=None) – The maximum depth of each tree. None indicates no constraint.

  • min_samples_split (int, default=2) – The minimum number of samples required to split an internal node.

  • min_samples_leaf (int, default=1) – The minimum number of samples required to be at a leaf node.

  • use_predictor (bool, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.

  • predictor ({"forest", "xgboost", "lightgbm"}, default= "forest") – The type of the predictor concatenated to the deep forest. If use_predictor is False, this parameter will have no effect.

  • predictor_kwargs (dict, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. If use_predictor is False, this parameter will have no effect.

  • backend ({"custom", "sklearn"}, default= "custom") – The backend of the forest estimator. Supported backends are custom for higher time and memory efficiency and sklearn for additional functionality.

  • n_tolerant_rounds (int, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far for n_tolerant_rounds rounds.

  • delta (float, default=1e-5) – Specify the threshold on early stopping. The counting on n_tolerant_rounds is triggered if the performance of a fitted cascade layer does not improve by delta compared against the best validation performance achieved so far.

  • partial_mode (bool, default=False) –

    Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.

    • If True, the partial mode is activated and all fitted estimators will be dumped in a local buffer;

    • If False, all fitted estimators are directly stored in the memory.

  • n_jobs (int or None, default=None) – The number of jobs to run in parallel for both fit() and predict(). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • random_state (int or None, default=None) –

    • If int, random_state is the seed used by the random number generator;

    • If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, default=1) –

    Controls the verbosity when fitting and predicting.

    • If <= 0, silent mode, which means no logging information will be displayed;

    • If 1, logging information on the cascade layer level will be displayed;

    • If > 1, full logging information will be displayed.

fit(X, y, sample_weight=None)

Build a deep forest using the training data.

Note

Deep forest supports two kinds of modes for training:

  • Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.

  • Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.

By setting the partial_mode to True, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.

Parameters
  • X – The training data. Internally, it will be converted to np.uint8.

  • y (numpy.ndarray of shape (n_samples,)) – The class labels of input samples.

  • sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted.

predict_proba(X)

Predict class probabilities for X.

Parameters

X – The input samples. Internally, its dtype will be converted to np.uint8.

Returns

proba – The class probabilities of the input samples.

Return type

numpy.ndarray of shape (n_samples, n_classes)

predict(X)

Predict class for X.

Parameters

X – The input samples. Internally, its dtype will be converted to np.uint8.

Returns

y – The predicted classes.

Return type

numpy.ndarray of shape (n_samples,)

clean()

Clean the buffer created by the model.

get_estimator(layer_idx, est_idx, estimator_type)

Get estimator from a cascade layer in the deep forest.

Parameters
  • layer_idx (int) – The index of the cascade layer, should be in the range [0, self.n_layers_-1].

  • est_idx (int) – The index of the estimator, should be in the range [0, self.n_estimators].

  • estimator_type ({"rf", "erf", "custom"}) –

    Specify the forest type.

    • If rf, return the random forest.

    • If erf, return the extremely random forest.

    • If custom, return the customized estimator, only applicable when using customized estimators in deep forest via set_estimator().

Returns

estimator

Return type

Estimator with the given index.

get_layer_feature_importances(layer_idx)

Return the feature importances of layer_idx-th cascade layer.

Parameters

layer_idx (int) – The index of the cascade layer, should be in the range [0, self.n_layers_-1].

Returns

feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.

Return type

numpy.ndarray of shape (n_features,)

Note

  • This method is only applicable when deep forest is built using the sklearn backend

  • The functionality of this method is not available when using customized estimators in deep forest.

load(dirname)

Load the model from the directory dirname.

Parameters

dirname (str) – The name of the input directory.

Note

The dumped model after calling load_model() is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.

save(dirname='model')

Save the model to the directory dirname.

Parameters

dirname (str, default=”model”) – The name of the output directory.

Warning

Other methods on model serialization such as pickle or joblib are not recommended, especially when partial_mode is set to True.

set_estimator(estimators, n_splits=5)

Specify the custom base estimators for cascade layers.

Parameters
  • estimators (list) – A list of your base estimators, will be used in all cascade layers.

  • n_splits (int, default=5) – The number of folds, must be at least 2.

set_predictor(predictor)

Specify the custom predictor concatenated to deep forest.

Parameters

predictor (object) – The instantiated object of your predictor.

CascadeForestRegressor

fit(X, y[, sample_weight])

Build a deep forest using the training data.

predict(X)

Predict regression target for X.

clean()

Clean the buffer created by the model.

get_estimator(layer_idx, est_idx, estimator_type)

Get estimator from a cascade layer in the deep forest.

get_layer_feature_importances(layer_idx)

Return the feature importances of layer_idx-th cascade layer.

load(dirname)

Load the model from the directory dirname.

save([dirname])

Save the model to the directory dirname.

set_estimator(estimators[, n_splits])

Specify the custom base estimators for cascade layers.

set_predictor(predictor)

Specify the custom predictor concatenated to deep forest.

class deepforest.CascadeForestRegressor(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='mse', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)

Bases: deepforest.cascade.BaseCascadeForest, sklearn.base.RegressorMixin

Implementation of the deep forest for regression.

Parameters
  • n_bins (int, default=255) – The number of bins used for non-missing values. In addition to the n_bins bins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.

  • bin_subsample (int, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller than bin_subsample, then all training samples will be used.

  • bin_type ({"percentile", "interval"}, default= "percentile") –

    The type of binner used to bin feature values into integer-valued bins.

    • If "percentile", each bin will have approximately the same number of distinct feature values.

    • If "interval", each bin will have approximately the same size.

  • max_layers (int, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller than max_layers because of the internal early stopping stage.

  • criterion ({"mse", "mae"}, default= "mse") – The function to measure the quality of a split. Supported criteria are mse for the mean squared error, which is equal to variance reduction as feature selection criterion, and mae for the mean absolute error.

  • n_estimators (int, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains a RandomForestRegressor and a ExtraTreesRegressor, respectively.

  • n_trees (int, default=100) – The number of trees in each estimator.

  • max_depth (int, default=None) – The maximum depth of each tree. None indicates no constraint.

  • min_samples_split (int, default=2) – The minimum number of samples required to split an internal node.

  • min_samples_leaf (int, default=1) – The minimum number of samples required to be at a leaf node.

  • use_predictor (bool, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.

  • predictor ({"forest", "xgboost", "lightgbm"}, default= "forest") – The type of the predictor concatenated to the deep forest. If use_predictor is False, this parameter will have no effect.

  • predictor_kwargs (dict, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. If use_predictor is False, this parameter will have no effect.

  • backend ({"custom", "sklearn"}, default= "custom") – The backend of the forest estimator. Supported backends are custom for higher time and memory efficiency and sklearn for additional functionality.

  • n_tolerant_rounds (int, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far for n_tolerant_rounds rounds.

  • delta (float, default=1e-5) – Specify the threshold on early stopping. The counting on n_tolerant_rounds is triggered if the performance of a fitted cascade layer does not improve by delta compared against the best validation performance achieved so far.

  • partial_mode (bool, default=False) –

    Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.

    • If True, the partial mode is activated and all fitted estimators will be dumped in a local buffer;

    • If False, all fitted estimators are directly stored in the memory.

  • n_jobs (int or None, default=None) – The number of jobs to run in parallel for both fit() and predict(). None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

  • random_state (int or None, default=None) –

    • If int, random_state is the seed used by the random number generator;

    • If None, the random number generator is the RandomState instance used by np.random.

  • verbose (int, default=1) –

    Controls the verbosity when fitting and predicting.

    • If <= 0, silent mode, which means no logging information will be displayed;

    • If 1, logging information on the cascade layer level will be displayed;

    • If > 1, full logging information will be displayed.

fit(X, y, sample_weight=None)

Build a deep forest using the training data.

Note

Deep forest supports two kinds of modes for training:

  • Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.

  • Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.

By setting the partial_mode to True, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.

Parameters
  • X – The training data. Internally, it will be converted to np.uint8.

  • y (numpy.ndarray of shape (n_samples,) or (n_samples, n_outputs)) – The target values of input samples.

  • sample_weight (numpy.ndarray of shape (n_samples,), default=None) – Sample weights. If None, then samples are equally weighted.

predict(X)

Predict regression target for X.

Parameters

X – The input samples. Internally, its dtype will be converted to np.uint8.

Returns

y – The predicted values.

Return type

numpy.ndarray of shape (n_samples,) or (n_samples, n_outputs)

clean()

Clean the buffer created by the model.

get_estimator(layer_idx, est_idx, estimator_type)

Get estimator from a cascade layer in the deep forest.

Parameters
  • layer_idx (int) – The index of the cascade layer, should be in the range [0, self.n_layers_-1].

  • est_idx (int) – The index of the estimator, should be in the range [0, self.n_estimators].

  • estimator_type ({"rf", "erf", "custom"}) –

    Specify the forest type.

    • If rf, return the random forest.

    • If erf, return the extremely random forest.

    • If custom, return the customized estimator, only applicable when using customized estimators in deep forest via set_estimator().

Returns

estimator

Return type

Estimator with the given index.

get_layer_feature_importances(layer_idx)

Return the feature importances of layer_idx-th cascade layer.

Parameters

layer_idx (int) – The index of the cascade layer, should be in the range [0, self.n_layers_-1].

Returns

feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.

Return type

numpy.ndarray of shape (n_features,)

Note

  • This method is only applicable when deep forest is built using the sklearn backend

  • The functionality of this method is not available when using customized estimators in deep forest.

load(dirname)

Load the model from the directory dirname.

Parameters

dirname (str) – The name of the input directory.

Note

The dumped model after calling load_model() is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.

save(dirname='model')

Save the model to the directory dirname.

Parameters

dirname (str, default=”model”) – The name of the output directory.

Warning

Other methods on model serialization such as pickle or joblib are not recommended, especially when partial_mode is set to True.

set_estimator(estimators, n_splits=5)

Specify the custom base estimators for cascade layers.

Parameters
  • estimators (list) – A list of your base estimators, will be used in all cascade layers.

  • n_splits (int, default=5) – The number of folds, must be at least 2.

set_predictor(predictor)

Specify the custom predictor concatenated to deep forest.

Parameters

predictor (object) – The instantiated object of your predictor.