API Reference¶
Below is the class and function reference for deepforest. Notice that the package is still under active development, and some features may not be stable yet.
CascadeForestClassifier¶
|
Build a deep forest using the training data. |
Predict class probabilities for X. |
|
|
Predict class for X. |
|
Clean the buffer created by the model. |
|
Get estimator from a cascade layer in the deep forest. |
|
Return the feature importances of |
|
Load the model from the directory |
|
Save the model to the directory |
|
Specify the custom base estimators for cascade layers. |
|
Specify the custom predictor concatenated to deep forest. |
-
class
deepforest.CascadeForestClassifier(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='gini', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)¶ Bases:
deepforest.cascade.BaseCascadeForest,sklearn.base.ClassifierMixinImplementation of the deep forest for classification.
- Parameters:
n_bins (
int, default=255) – The number of bins used for non-missing values. In addition to then_binsbins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.bin_subsample (
int, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller thanbin_subsample, then all training samples will be used.bin_type (
{"percentile", "interval"}, default="percentile") –The type of binner used to bin feature values into integer-valued bins.
If
"percentile", each bin will have approximately the same number of distinct feature values.If
"interval", each bin will have approximately the same size.
max_layers (
int, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller thanmax_layersbecause of the internal early stopping stage.criterion (
{"gini", "entropy"}, default="gini") – The function to measure the quality of a split. Supported criteria areginifor the Gini impurity andentropyfor the information gain. Note: this parameter is tree-specific.n_estimators (
int, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains aRandomForestClassifierand aExtraTreesClassifier, respectively.n_trees (
int, default=100) – The number of trees in each estimator.max_depth (
int, default=None) – The maximum depth of each tree.Noneindicates no constraint.min_samples_split (
int, default=2) – The minimum number of samples required to split an internal node.min_samples_leaf (
int, default=1) – The minimum number of samples required to be at a leaf node.use_predictor (
bool, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.predictor (
{"forest", "xgboost", "lightgbm"}, default="forest") – The type of the predictor concatenated to the deep forest. Ifuse_predictoris False, this parameter will have no effect.predictor_kwargs (
dict, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. Ifuse_predictoris False, this parameter will have no effect.backend (
{"custom", "sklearn"}, default="custom") – The backend of the forest estimator. Supported backends arecustomfor higher time and memory efficiency andsklearnfor additional functionality.n_tolerant_rounds (
int, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far forn_tolerant_roundsrounds.delta (
float, default=1e-5) – Specify the threshold on early stopping. The counting onn_tolerant_roundsis triggered if the performance of a fitted cascade layer does not improve bydeltacompared against the best validation performance achieved so far.partial_mode (
bool, default=False) –Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.
If
True, the partial mode is activated and all fitted estimators will be dumped in a local buffer;If
False, all fitted estimators are directly stored in the memory.
n_jobs (
intorNone, default=None) – The number of jobs to run in parallel for bothfit()andpredict(). None means 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.random_state (
intorNone, default=None) –If
int,random_stateis the seed used by the random number generator;If
None, the random number generator is the RandomState instance used bynp.random.
verbose (
int, default=1) –Controls the verbosity when fitting and predicting.
If
<= 0, silent mode, which means no logging information will be displayed;If
1, logging information on the cascade layer level will be displayed;If
> 1, full logging information will be displayed.
-
fit(X, y, sample_weight=None)¶ Build a deep forest using the training data.
Note
Deep forest supports two kinds of modes for training:
Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.
Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.
By setting the
partial_modetoTrue, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.- Parameters:
X – The training data. Internally, it will be converted to
np.uint8.y (
numpy.ndarrayof shape (n_samples,)) – The class labels of input samples.sample_weight (
numpy.ndarrayof shape (n_samples,), default=None) – Sample weights. IfNone, then samples are equally weighted.
-
predict_proba(X)¶ Predict class probabilities for X.
- Parameters:
X – The input samples. Internally, its dtype will be converted to
np.uint8.- Returns:
proba – The class probabilities of the input samples.
- Return type:
numpy.ndarrayof shape (n_samples, n_classes)
-
predict(X)¶ Predict class for X.
- Parameters:
X – The input samples. Internally, its dtype will be converted to
np.uint8.- Returns:
y – The predicted classes.
- Return type:
numpy.ndarrayof shape (n_samples,)
-
clean()¶ Clean the buffer created by the model.
-
get_estimator(layer_idx, est_idx, estimator_type)¶ Get estimator from a cascade layer in the deep forest.
- Parameters:
layer_idx (
int) – The index of the cascade layer, should be in the range[0, self.n_layers_-1].est_idx (
int) – The index of the estimator, should be in the range[0, self.n_estimators].estimator_type (
{"rf", "erf", "custom"}) –Specify the forest type.
If
rf, return the random forest.If
erf, return the extremely random forest.If
custom, return the customized estimator, only applicable when using customized estimators in deep forest viaset_estimator().
- Returns:
estimator
- Return type:
Estimator with the given index.
-
get_layer_feature_importances(layer_idx)¶ Return the feature importances of
layer_idx-th cascade layer.- Parameters:
layer_idx (
int) – The index of the cascade layer, should be in the range[0, self.n_layers_-1].- Returns:
feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.
- Return type:
numpy.ndarrayof shape (n_features,)
Note
This method is only applicable when deep forest is built using the
sklearnbackendThe functionality of this method is not available when using customized estimators in deep forest.
-
load(dirname)¶ Load the model from the directory
dirname.- Parameters:
dirname (
str) – The name of the input directory.
Note
The dumped model after calling
load_model()is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.-
save(dirname='model')¶ Save the model to the directory
dirname.- Parameters:
dirname (
str, default=”model”) – The name of the output directory.
Warning
Other methods on model serialization such as
pickleorjoblibare not recommended, especially whenpartial_modeis set to True.-
set_estimator(estimators, n_splits=5)¶ Specify the custom base estimators for cascade layers.
- Parameters:
estimators (
list) – A list of your base estimators, will be used in all cascade layers.n_splits (
int, default=5) – The number of folds, must be at least 2.
-
set_predictor(predictor)¶ Specify the custom predictor concatenated to deep forest.
- Parameters:
predictor (
object) – The instantiated object of your predictor.
CascadeForestRegressor¶
fit(X, y[, sample_weight])Build a deep forest using the training data.
predict(X)Predict regression target for X.
clean()Clean the buffer created by the model.
get_estimator(layer_idx, est_idx, estimator_type)Get estimator from a cascade layer in the deep forest.
get_layer_feature_importances(layer_idx)Return the feature importances of
layer_idx-th cascade layer.load(dirname)Load the model from the directory
dirname.save([dirname])Save the model to the directory
dirname.set_estimator(estimators[, n_splits])Specify the custom base estimators for cascade layers.
set_predictor(predictor)Specify the custom predictor concatenated to deep forest.
-
class
deepforest.CascadeForestRegressor(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='mse', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)¶ Bases:
deepforest.cascade.BaseCascadeForest,sklearn.base.RegressorMixinImplementation of the deep forest for regression.
- Parameters:
n_bins (
int, default=255) – The number of bins used for non-missing values. In addition to then_binsbins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.bin_subsample (
int, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller thanbin_subsample, then all training samples will be used.bin_type (
{"percentile", "interval"}, default="percentile") –The type of binner used to bin feature values into integer-valued bins.
If
"percentile", each bin will have approximately the same number of distinct feature values.If
"interval", each bin will have approximately the same size.
max_layers (
int, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller thanmax_layersbecause of the internal early stopping stage.criterion (
{"mse", "mae"}, default="mse") – The function to measure the quality of a split. Supported criteria aremsefor the mean squared error, which is equal to variance reduction as feature selection criterion, andmaefor the mean absolute error.n_estimators (
int, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains aRandomForestRegressorand aExtraTreesRegressor, respectively.n_trees (
int, default=100) – The number of trees in each estimator.max_depth (
int, default=None) – The maximum depth of each tree.Noneindicates no constraint.min_samples_split (
int, default=2) – The minimum number of samples required to split an internal node.min_samples_leaf (
int, default=1) – The minimum number of samples required to be at a leaf node.use_predictor (
bool, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.predictor (
{"forest", "xgboost", "lightgbm"}, default="forest") – The type of the predictor concatenated to the deep forest. Ifuse_predictoris False, this parameter will have no effect.predictor_kwargs (
dict, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. Ifuse_predictoris False, this parameter will have no effect.backend (
{"custom", "sklearn"}, default="custom") – The backend of the forest estimator. Supported backends arecustomfor higher time and memory efficiency andsklearnfor additional functionality.n_tolerant_rounds (
int, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far forn_tolerant_roundsrounds.delta (
float, default=1e-5) – Specify the threshold on early stopping. The counting onn_tolerant_roundsis triggered if the performance of a fitted cascade layer does not improve bydeltacompared against the best validation performance achieved so far.partial_mode (
bool, default=False) –Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.
If
True, the partial mode is activated and all fitted estimators will be dumped in a local buffer;If
False, all fitted estimators are directly stored in the memory.
n_jobs (
intorNone, default=None) – The number of jobs to run in parallel for bothfit()andpredict(). None means 1 unless in ajoblib.parallel_backendcontext.-1means using all processors.random_state (
intorNone, default=None) –If
int,random_stateis the seed used by the random number generator;If
None, the random number generator is the RandomState instance used bynp.random.
verbose (
int, default=1) –Controls the verbosity when fitting and predicting.
If
<= 0, silent mode, which means no logging information will be displayed;If
1, logging information on the cascade layer level will be displayed;If
> 1, full logging information will be displayed.
-
fit(X, y, sample_weight=None)¶ Build a deep forest using the training data.
Note
Deep forest supports two kinds of modes for training:
Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.
Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.
By setting the
partial_modetoTrue, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.- Parameters:
X – The training data. Internally, it will be converted to
np.uint8.y (
numpy.ndarrayof shape (n_samples,) or (n_samples, n_outputs)) – The target values of input samples.sample_weight (
numpy.ndarrayof shape (n_samples,), default=None) – Sample weights. IfNone, then samples are equally weighted.
-
predict(X)¶ Predict regression target for X.
- Parameters:
X – The input samples. Internally, its dtype will be converted to
np.uint8.- Returns:
y – The predicted values.
- Return type:
numpy.ndarrayof shape (n_samples,) or (n_samples, n_outputs)
-
clean()¶ Clean the buffer created by the model.
-
get_estimator(layer_idx, est_idx, estimator_type)¶ Get estimator from a cascade layer in the deep forest.
- Parameters:
layer_idx (
int) – The index of the cascade layer, should be in the range[0, self.n_layers_-1].est_idx (
int) – The index of the estimator, should be in the range[0, self.n_estimators].estimator_type (
{"rf", "erf", "custom"}) –Specify the forest type.
If
rf, return the random forest.If
erf, return the extremely random forest.If
custom, return the customized estimator, only applicable when using customized estimators in deep forest viaset_estimator().
- Returns:
estimator
- Return type:
Estimator with the given index.
-
get_layer_feature_importances(layer_idx)¶ Return the feature importances of
layer_idx-th cascade layer.- Parameters:
layer_idx (
int) – The index of the cascade layer, should be in the range[0, self.n_layers_-1].- Returns:
feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.
- Return type:
numpy.ndarrayof shape (n_features,)
Note
This method is only applicable when deep forest is built using the
sklearnbackendThe functionality of this method is not available when using customized estimators in deep forest.
-
load(dirname)¶ Load the model from the directory
dirname.- Parameters:
dirname (
str) – The name of the input directory.
Note
The dumped model after calling
load_model()is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.-
save(dirname='model')¶ Save the model to the directory
dirname.- Parameters:
dirname (
str, default=”model”) – The name of the output directory.
Warning
Other methods on model serialization such as
pickleorjoblibare not recommended, especially whenpartial_modeis set to True.-
set_estimator(estimators, n_splits=5)¶ Specify the custom base estimators for cascade layers.
- Parameters:
estimators (
list) – A list of your base estimators, will be used in all cascade layers.n_splits (
int, default=5) – The number of folds, must be at least 2.
-
set_predictor(predictor)¶ Specify the custom predictor concatenated to deep forest.
- Parameters:
predictor (
object) – The instantiated object of your predictor.