API Reference¶
Below is the class and function reference for deepforest
. Notice that the package is still under active development, and some features may not be stable yet.
CascadeForestClassifier¶
|
Build a deep forest using the training data. |
Predict class probabilities for X. |
|
|
Predict class for X. |
|
Clean the buffer created by the model. |
|
Get estimator from a cascade layer in the deep forest. |
|
Return the feature importances of |
|
Load the model from the directory |
|
Save the model to the directory |
|
Specify the custom base estimators for cascade layers. |
|
Specify the custom predictor concatenated to deep forest. |
-
class
deepforest.
CascadeForestClassifier
(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='gini', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)¶ Bases:
deepforest.cascade.BaseCascadeForest
,sklearn.base.ClassifierMixin
Implementation of the deep forest for classification.
- Parameters
n_bins (
int
, default=255) – The number of bins used for non-missing values. In addition to then_bins
bins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.bin_subsample (
int
, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller thanbin_subsample
, then all training samples will be used.bin_type (
{"percentile", "interval"}
, default="percentile"
) –The type of binner used to bin feature values into integer-valued bins.
If
"percentile"
, each bin will have approximately the same number of distinct feature values.If
"interval"
, each bin will have approximately the same size.
max_layers (
int
, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller thanmax_layers
because of the internal early stopping stage.criterion (
{"gini", "entropy"}
, default="gini"
) – The function to measure the quality of a split. Supported criteria aregini
for the Gini impurity andentropy
for the information gain. Note: this parameter is tree-specific.n_estimators (
int
, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains aRandomForestClassifier
and aExtraTreesClassifier
, respectively.n_trees (
int
, default=100) – The number of trees in each estimator.max_depth (
int
, default=None) – The maximum depth of each tree.None
indicates no constraint.min_samples_split (
int
, default=2) – The minimum number of samples required to split an internal node.min_samples_leaf (
int
, default=1) – The minimum number of samples required to be at a leaf node.use_predictor (
bool
, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.predictor (
{"forest", "xgboost", "lightgbm"}
, default="forest"
) – The type of the predictor concatenated to the deep forest. Ifuse_predictor
is False, this parameter will have no effect.predictor_kwargs (
dict
, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. Ifuse_predictor
is False, this parameter will have no effect.backend (
{"custom", "sklearn"}
, default="custom"
) – The backend of the forest estimator. Supported backends arecustom
for higher time and memory efficiency andsklearn
for additional functionality.n_tolerant_rounds (
int
, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far forn_tolerant_rounds
rounds.delta (
float
, default=1e-5) – Specify the threshold on early stopping. The counting onn_tolerant_rounds
is triggered if the performance of a fitted cascade layer does not improve bydelta
compared against the best validation performance achieved so far.partial_mode (
bool
, default=False) –Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.
If
True
, the partial mode is activated and all fitted estimators will be dumped in a local buffer;If
False
, all fitted estimators are directly stored in the memory.
n_jobs (
int
orNone
, default=None) – The number of jobs to run in parallel for bothfit()
andpredict()
. None means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.random_state (
int
orNone
, default=None) –If
int
,random_state
is the seed used by the random number generator;If
None
, the random number generator is the RandomState instance used bynp.random
.
verbose (
int
, default=1) –Controls the verbosity when fitting and predicting.
If
<= 0
, silent mode, which means no logging information will be displayed;If
1
, logging information on the cascade layer level will be displayed;If
> 1
, full logging information will be displayed.
-
fit
(X, y, sample_weight=None)¶ Build a deep forest using the training data.
Note
Deep forest supports two kinds of modes for training:
Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.
Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.
By setting the
partial_mode
toTrue
, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.- Parameters
X – The training data. Internally, it will be converted to
np.uint8
.y (
numpy.ndarray
of shape (n_samples,)) – The class labels of input samples.sample_weight (
numpy.ndarray
of shape (n_samples,), default=None) – Sample weights. IfNone
, then samples are equally weighted.
-
predict_proba
(X)¶ Predict class probabilities for X.
- Parameters
X – The input samples. Internally, its dtype will be converted to
np.uint8
.- Returns
proba – The class probabilities of the input samples.
- Return type
numpy.ndarray
of shape (n_samples, n_classes)
-
predict
(X)¶ Predict class for X.
- Parameters
X – The input samples. Internally, its dtype will be converted to
np.uint8
.- Returns
y – The predicted classes.
- Return type
numpy.ndarray
of shape (n_samples,)
-
clean
()¶ Clean the buffer created by the model.
-
get_estimator
(layer_idx, est_idx, estimator_type)¶ Get estimator from a cascade layer in the deep forest.
- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.est_idx (
int
) – The index of the estimator, should be in the range[0, self.n_estimators]
.estimator_type (
{"rf", "erf", "custom"}
) –Specify the forest type.
If
rf
, return the random forest.If
erf
, return the extremely random forest.If
custom
, return the customized estimator, only applicable when using customized estimators in deep forest viaset_estimator()
.
- Returns
estimator
- Return type
Estimator with the given index.
-
get_layer_feature_importances
(layer_idx)¶ Return the feature importances of
layer_idx
-th cascade layer.- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.- Returns
feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.
- Return type
numpy.ndarray
of shape (n_features,)
Note
This method is only applicable when deep forest is built using the
sklearn
backendThe functionality of this method is not available when using customized estimators in deep forest.
-
load
(dirname)¶ Load the model from the directory
dirname
.- Parameters
dirname (
str
) – The name of the input directory.
Note
The dumped model after calling
load_model()
is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.
-
save
(dirname='model')¶ Save the model to the directory
dirname
.- Parameters
dirname (
str
, default=”model”) – The name of the output directory.
Warning
Other methods on model serialization such as
pickle
orjoblib
are not recommended, especially whenpartial_mode
is set to True.
-
set_estimator
(estimators, n_splits=5)¶ Specify the custom base estimators for cascade layers.
- Parameters
estimators (
list
) – A list of your base estimators, will be used in all cascade layers.n_splits (
int
, default=5) – The number of folds, must be at least 2.
-
set_predictor
(predictor)¶ Specify the custom predictor concatenated to deep forest.
- Parameters
predictor (
object
) – The instantiated object of your predictor.
CascadeForestRegressor¶
|
Build a deep forest using the training data. |
|
Predict regression target for X. |
|
Clean the buffer created by the model. |
|
Get estimator from a cascade layer in the deep forest. |
|
Return the feature importances of |
|
Load the model from the directory |
|
Save the model to the directory |
|
Specify the custom base estimators for cascade layers. |
|
Specify the custom predictor concatenated to deep forest. |
-
class
deepforest.
CascadeForestRegressor
(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='mse', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)¶ Bases:
deepforest.cascade.BaseCascadeForest
,sklearn.base.RegressorMixin
Implementation of the deep forest for regression.
- Parameters
n_bins (
int
, default=255) – The number of bins used for non-missing values. In addition to then_bins
bins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.bin_subsample (
int
, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller thanbin_subsample
, then all training samples will be used.bin_type (
{"percentile", "interval"}
, default="percentile"
) –The type of binner used to bin feature values into integer-valued bins.
If
"percentile"
, each bin will have approximately the same number of distinct feature values.If
"interval"
, each bin will have approximately the same size.
max_layers (
int
, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller thanmax_layers
because of the internal early stopping stage.criterion (
{"mse", "mae"}
, default="mse"
) – The function to measure the quality of a split. Supported criteria aremse
for the mean squared error, which is equal to variance reduction as feature selection criterion, andmae
for the mean absolute error.n_estimators (
int
, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains aRandomForestRegressor
and aExtraTreesRegressor
, respectively.n_trees (
int
, default=100) – The number of trees in each estimator.max_depth (
int
, default=None) – The maximum depth of each tree.None
indicates no constraint.min_samples_split (
int
, default=2) – The minimum number of samples required to split an internal node.min_samples_leaf (
int
, default=1) – The minimum number of samples required to be at a leaf node.use_predictor (
bool
, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.predictor (
{"forest", "xgboost", "lightgbm"}
, default="forest"
) – The type of the predictor concatenated to the deep forest. Ifuse_predictor
is False, this parameter will have no effect.predictor_kwargs (
dict
, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. Ifuse_predictor
is False, this parameter will have no effect.backend (
{"custom", "sklearn"}
, default="custom"
) – The backend of the forest estimator. Supported backends arecustom
for higher time and memory efficiency andsklearn
for additional functionality.n_tolerant_rounds (
int
, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far forn_tolerant_rounds
rounds.delta (
float
, default=1e-5) – Specify the threshold on early stopping. The counting onn_tolerant_rounds
is triggered if the performance of a fitted cascade layer does not improve bydelta
compared against the best validation performance achieved so far.partial_mode (
bool
, default=False) –Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.
If
True
, the partial mode is activated and all fitted estimators will be dumped in a local buffer;If
False
, all fitted estimators are directly stored in the memory.
n_jobs (
int
orNone
, default=None) – The number of jobs to run in parallel for bothfit()
andpredict()
. None means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.random_state (
int
orNone
, default=None) –If
int
,random_state
is the seed used by the random number generator;If
None
, the random number generator is the RandomState instance used bynp.random
.
verbose (
int
, default=1) –Controls the verbosity when fitting and predicting.
If
<= 0
, silent mode, which means no logging information will be displayed;If
1
, logging information on the cascade layer level will be displayed;If
> 1
, full logging information will be displayed.
-
fit
(X, y, sample_weight=None)¶ Build a deep forest using the training data.
Note
Deep forest supports two kinds of modes for training:
Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.
Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.
By setting the
partial_mode
toTrue
, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.- Parameters
X – The training data. Internally, it will be converted to
np.uint8
.y (
numpy.ndarray
of shape (n_samples,) or (n_samples, n_outputs)) – The target values of input samples.sample_weight (
numpy.ndarray
of shape (n_samples,), default=None) – Sample weights. IfNone
, then samples are equally weighted.
-
predict
(X)¶ Predict regression target for X.
- Parameters
X – The input samples. Internally, its dtype will be converted to
np.uint8
.- Returns
y – The predicted values.
- Return type
numpy.ndarray
of shape (n_samples,) or (n_samples, n_outputs)
-
clean
()¶ Clean the buffer created by the model.
-
get_estimator
(layer_idx, est_idx, estimator_type)¶ Get estimator from a cascade layer in the deep forest.
- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.est_idx (
int
) – The index of the estimator, should be in the range[0, self.n_estimators]
.estimator_type (
{"rf", "erf", "custom"}
) –Specify the forest type.
If
rf
, return the random forest.If
erf
, return the extremely random forest.If
custom
, return the customized estimator, only applicable when using customized estimators in deep forest viaset_estimator()
.
- Returns
estimator
- Return type
Estimator with the given index.
-
get_layer_feature_importances
(layer_idx)¶ Return the feature importances of
layer_idx
-th cascade layer.- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.- Returns
feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.
- Return type
numpy.ndarray
of shape (n_features,)
Note
This method is only applicable when deep forest is built using the
sklearn
backendThe functionality of this method is not available when using customized estimators in deep forest.
-
load
(dirname)¶ Load the model from the directory
dirname
.- Parameters
dirname (
str
) – The name of the input directory.
Note
The dumped model after calling
load_model()
is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.
-
save
(dirname='model')¶ Save the model to the directory
dirname
.- Parameters
dirname (
str
, default=”model”) – The name of the output directory.
Warning
Other methods on model serialization such as
pickle
orjoblib
are not recommended, especially whenpartial_mode
is set to True.
-
set_estimator
(estimators, n_splits=5)¶ Specify the custom base estimators for cascade layers.
- Parameters
estimators (
list
) – A list of your base estimators, will be used in all cascade layers.n_splits (
int
, default=5) – The number of folds, must be at least 2.
-
set_predictor
(predictor)¶ Specify the custom predictor concatenated to deep forest.
- Parameters
predictor (
object
) – The instantiated object of your predictor.