DF21 Documentation¶
DF21 is an implementation of Deep Forest 2021.2.1. It is designed to have the following advantages:
Powerful: Better accuracy than existing tree-based ensemble methods.
Easy to Use: Less efforts on tunning parameters.
Efficient: Fast training speed and high efficiency.
Scalable: Capable of handling large-scale data.
DF21 offers an effective & powerful option to the tree-based machine learning algorithms such as Random Forest or GBDT. This package is actively being developed, and any help would be welcomed. Please check the homepage on Gitee or Github for details.
Guidepost¶
For a quick start, please refer to How to Get Started.
For a guidance on tunning parameters for DF21, please refer to Parameters Tunning.
For a comparison between DF21 and other tree-based ensemble methods, please refer to Experiments.
Installation¶
DF21 can be installed using pip via PyPI which is the package installer for Python. You can use pip to install packages from the Python Package Index and other indexes. Refer this for the documentation of pip. Use this command to download DF21 :
$ pip install deep-forest
Quickstart¶
Classification¶
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deepforest import CascadeForestClassifier
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))
>>> Testing Accuracy: 98.667 %
Regression¶
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from deepforest import CascadeForestRegressor
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestRegressor(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("\nTesting MSE: {:.3f}".format(mse))
>>> Testing MSE: 8.068
Resources¶
Reference¶
@article{zhou2019deep,
title={Deep forest},
author={Zhi-Hua Zhou and Ji Feng},
journal={National Science Review},
volume={6},
number={1},
pages={74--86},
year={2019}}
@inproceedings{zhou2017deep,
Author = {Zhi-Hua Zhou and Ji Feng},
Booktitle = {IJCAI},
Pages = {3553-3559},
Title = {{Deep Forest:} Towards an alternative to deep neural networks},
Year = {2017}}
How to Get Started¶
This is a quick start guide for you to try out deep forest. The full script is available at Example.
Installation¶
The package is available via PyPI. As a kind reminder, do not forget the hyphen (-
) between deep
and forest
.
$ pip install deep-forest
Load Data¶
Deep forest assumes data to be in the form of 2D Numpy array of shape (n_samples
, n_features
). It will conduct internal check and transformation on the input data. For example, the code snippet below loads a toy dataset on digits classification:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
Define the Model¶
Deep forest provides unified APIs on binary classification and multi-class classification. For the demo dataset on classification, the corresponding model is CascadeForestClassifier
:
from deepforest import CascadeForestClassifier
model = CascadeForestClassifier()
A key advantage of deep forest is its adaptive model complexity depending on the dataset. The default setting on hyper-parameters enables it to perform reasonably well across all datasets. Please refer to API Reference for the meaning of different input arguments.
Train and Evaluate¶
Deep forest provides Scikit-Learn like APIs on training and evaluating. Given the training data X_train
and labels y_train
, the training stage is triggered with the following code snippet:
model.fit(X_train, y_train)
Once the model was trained, you can call predict()
to produce prediction results on the testing data X_test
.
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100 # classification accuracy
Save and Load¶
Deep forest also provides easy-to-use APIs on model serialization. Here, MODE_DIR
is the directory to save the model.
model.save(MODEL_DIR)
Given the saving results, you can call load()
to use deep forest for prediction:
new_model = CascadeForestClassifier()
new_model.load(MODEL_DIR)
Notice that new_model
is not the same as model
, because only key information used for model inference was saved.
Example¶
Below is the full script on using deep forest for classification on a demo dataset.
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deepforest import CascadeForestClassifier
# Load data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
model = CascadeForestClassifier()
# Train and evaluate
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))
# Save the model
model.save("model")
Installation Guide¶
Stable Version¶
The stable version is available via PyPI using:
$ pip install deep-forest
The package is portable and with very few package dependencies. It is recommended to use the package environment from Anaconda since it already installs all required packages.
Notice that only the 64-bit Linux, Windows, and Mac-OS platform are officially supported. To use deep forest on other platforms, you will need to build the entire package from source.
Building from Source¶
Building from source is required to work on a contribution (bug fix, new feature, code or documentation improvement).
Use Git to check out the latest source from the repository on Gitee or Github:
$ git clone https://gitee.com/lamda-nju/deep-forest.git
$ git clone https://github.com/LAMDA-NJU/Deep-Forest.git
$ cd Deep-Forest
Install a C compiler for your platform.
Note
The compiler is used to compile the Cython files in the package. Please refer to Installing Cython for details on choosing the compiler.
Optional (but recommended): create and activate a dedicated virtual environment or conda environment.
Build the project with pip in the editable mode:
$ pip install --verbose -e .
Note
The key advantage of the editable mode is that there is no need to re-install the entire package if you have modified a Python file. However, you still have to run the pip install --verbose -e .
command again if the source code of a Cython file is updated (ending with .pyx or .pxd).
Optional: run the tests on the module:
$ cd tests
% pytest
Acknowledgement¶
The installation instructions were adapted from Scikit-Learn’s advanced installation instructions.
API Reference¶
Below is the class and function reference for deepforest
. Notice that the package is still under active development, and some features may not be stable yet.
CascadeForestClassifier¶
|
Build a deep forest using the training data. |
Predict class probabilities for X. |
|
|
Predict class for X. |
|
Clean the buffer created by the model. |
|
Get estimator from a cascade layer in the deep forest. |
|
Return the feature importances of |
|
Load the model from the directory |
|
Save the model to the directory |
|
Specify the custom base estimators for cascade layers. |
|
Specify the custom predictor concatenated to deep forest. |
-
class
deepforest.
CascadeForestClassifier
(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='gini', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)¶ Bases:
deepforest.cascade.BaseCascadeForest
,sklearn.base.ClassifierMixin
Implementation of the deep forest for classification.
- Parameters
n_bins (
int
, default=255) – The number of bins used for non-missing values. In addition to then_bins
bins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.bin_subsample (
int
, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller thanbin_subsample
, then all training samples will be used.bin_type (
{"percentile", "interval"}
, default="percentile"
) –The type of binner used to bin feature values into integer-valued bins.
If
"percentile"
, each bin will have approximately the same number of distinct feature values.If
"interval"
, each bin will have approximately the same size.
max_layers (
int
, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller thanmax_layers
because of the internal early stopping stage.criterion (
{"gini", "entropy"}
, default="gini"
) – The function to measure the quality of a split. Supported criteria aregini
for the Gini impurity andentropy
for the information gain. Note: this parameter is tree-specific.n_estimators (
int
, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains aRandomForestClassifier
and aExtraTreesClassifier
, respectively.n_trees (
int
, default=100) – The number of trees in each estimator.max_depth (
int
, default=None) – The maximum depth of each tree.None
indicates no constraint.min_samples_split (
int
, default=2) – The minimum number of samples required to split an internal node.min_samples_leaf (
int
, default=1) – The minimum number of samples required to be at a leaf node.use_predictor (
bool
, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.predictor (
{"forest", "xgboost", "lightgbm"}
, default="forest"
) – The type of the predictor concatenated to the deep forest. Ifuse_predictor
is False, this parameter will have no effect.predictor_kwargs (
dict
, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. Ifuse_predictor
is False, this parameter will have no effect.backend (
{"custom", "sklearn"}
, default="custom"
) – The backend of the forest estimator. Supported backends arecustom
for higher time and memory efficiency andsklearn
for additional functionality.n_tolerant_rounds (
int
, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far forn_tolerant_rounds
rounds.delta (
float
, default=1e-5) – Specify the threshold on early stopping. The counting onn_tolerant_rounds
is triggered if the performance of a fitted cascade layer does not improve bydelta
compared against the best validation performance achieved so far.partial_mode (
bool
, default=False) –Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.
If
True
, the partial mode is activated and all fitted estimators will be dumped in a local buffer;If
False
, all fitted estimators are directly stored in the memory.
n_jobs (
int
orNone
, default=None) – The number of jobs to run in parallel for bothfit()
andpredict()
. None means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.random_state (
int
orNone
, default=None) –If
int
,random_state
is the seed used by the random number generator;If
None
, the random number generator is the RandomState instance used bynp.random
.
verbose (
int
, default=1) –Controls the verbosity when fitting and predicting.
If
<= 0
, silent mode, which means no logging information will be displayed;If
1
, logging information on the cascade layer level will be displayed;If
> 1
, full logging information will be displayed.
-
fit
(X, y, sample_weight=None)¶ Build a deep forest using the training data.
Note
Deep forest supports two kinds of modes for training:
Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.
Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.
By setting the
partial_mode
toTrue
, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.- Parameters
X – The training data. Internally, it will be converted to
np.uint8
.y (
numpy.ndarray
of shape (n_samples,)) – The class labels of input samples.sample_weight (
numpy.ndarray
of shape (n_samples,), default=None) – Sample weights. IfNone
, then samples are equally weighted.
-
predict_proba
(X)¶ Predict class probabilities for X.
- Parameters
X – The input samples. Internally, its dtype will be converted to
np.uint8
.- Returns
proba – The class probabilities of the input samples.
- Return type
numpy.ndarray
of shape (n_samples, n_classes)
-
predict
(X)¶ Predict class for X.
- Parameters
X – The input samples. Internally, its dtype will be converted to
np.uint8
.- Returns
y – The predicted classes.
- Return type
numpy.ndarray
of shape (n_samples,)
-
clean
()¶ Clean the buffer created by the model.
-
get_estimator
(layer_idx, est_idx, estimator_type)¶ Get estimator from a cascade layer in the deep forest.
- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.est_idx (
int
) – The index of the estimator, should be in the range[0, self.n_estimators]
.estimator_type (
{"rf", "erf", "custom"}
) –Specify the forest type.
If
rf
, return the random forest.If
erf
, return the extremely random forest.If
custom
, return the customized estimator, only applicable when using customized estimators in deep forest viaset_estimator()
.
- Returns
estimator
- Return type
Estimator with the given index.
-
get_layer_feature_importances
(layer_idx)¶ Return the feature importances of
layer_idx
-th cascade layer.- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.- Returns
feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.
- Return type
numpy.ndarray
of shape (n_features,)
Note
This method is only applicable when deep forest is built using the
sklearn
backendThe functionality of this method is not available when using customized estimators in deep forest.
-
load
(dirname)¶ Load the model from the directory
dirname
.- Parameters
dirname (
str
) – The name of the input directory.
Note
The dumped model after calling
load_model()
is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.
-
save
(dirname='model')¶ Save the model to the directory
dirname
.- Parameters
dirname (
str
, default=”model”) – The name of the output directory.
Warning
Other methods on model serialization such as
pickle
orjoblib
are not recommended, especially whenpartial_mode
is set to True.
-
set_estimator
(estimators, n_splits=5)¶ Specify the custom base estimators for cascade layers.
- Parameters
estimators (
list
) – A list of your base estimators, will be used in all cascade layers.n_splits (
int
, default=5) – The number of folds, must be at least 2.
-
set_predictor
(predictor)¶ Specify the custom predictor concatenated to deep forest.
- Parameters
predictor (
object
) – The instantiated object of your predictor.
CascadeForestRegressor¶
|
Build a deep forest using the training data. |
|
Predict regression target for X. |
|
Clean the buffer created by the model. |
|
Get estimator from a cascade layer in the deep forest. |
|
Return the feature importances of |
|
Load the model from the directory |
|
Save the model to the directory |
|
Specify the custom base estimators for cascade layers. |
|
Specify the custom predictor concatenated to deep forest. |
-
class
deepforest.
CascadeForestRegressor
(n_bins=255, bin_subsample=200000, bin_type='percentile', max_layers=20, criterion='mse', n_estimators=2, n_trees=100, max_depth=None, min_samples_split=2, min_samples_leaf=1, use_predictor=False, predictor='forest', predictor_kwargs={}, backend='custom', n_tolerant_rounds=2, delta=1e-05, partial_mode=False, n_jobs=None, random_state=None, verbose=1)¶ Bases:
deepforest.cascade.BaseCascadeForest
,sklearn.base.RegressorMixin
Implementation of the deep forest for regression.
- Parameters
n_bins (
int
, default=255) – The number of bins used for non-missing values. In addition to then_bins
bins, one more bin is reserved for missing values. Its value must be no smaller than 2 and no greater than 255.bin_subsample (
int
, default=200,000) – The number of samples used to construct feature discrete bins. If the size of training set is smaller thanbin_subsample
, then all training samples will be used.bin_type (
{"percentile", "interval"}
, default="percentile"
) –The type of binner used to bin feature values into integer-valued bins.
If
"percentile"
, each bin will have approximately the same number of distinct feature values.If
"interval"
, each bin will have approximately the same size.
max_layers (
int
, default=20) – The maximum number of cascade layers in the deep forest. Notice that the actual number of layers can be smaller thanmax_layers
because of the internal early stopping stage.criterion (
{"mse", "mae"}
, default="mse"
) – The function to measure the quality of a split. Supported criteria aremse
for the mean squared error, which is equal to variance reduction as feature selection criterion, andmae
for the mean absolute error.n_estimators (
int
, default=2) – The number of estimator in each cascade layer. It will be multiplied by 2 internally because each estimator contains aRandomForestRegressor
and aExtraTreesRegressor
, respectively.n_trees (
int
, default=100) – The number of trees in each estimator.max_depth (
int
, default=None) – The maximum depth of each tree.None
indicates no constraint.min_samples_split (
int
, default=2) – The minimum number of samples required to split an internal node.min_samples_leaf (
int
, default=1) – The minimum number of samples required to be at a leaf node.use_predictor (
bool
, default=False) – Whether to build the predictor concatenated to the deep forest. Using the predictor may improve the performance of deep forest.predictor (
{"forest", "xgboost", "lightgbm"}
, default="forest"
) – The type of the predictor concatenated to the deep forest. Ifuse_predictor
is False, this parameter will have no effect.predictor_kwargs (
dict
, default={}) – The configuration of the predictor concatenated to the deep forest. Specifying this will extend/overwrite the original parameters inherit from deep forest. Ifuse_predictor
is False, this parameter will have no effect.backend (
{"custom", "sklearn"}
, default="custom"
) – The backend of the forest estimator. Supported backends arecustom
for higher time and memory efficiency andsklearn
for additional functionality.n_tolerant_rounds (
int
, default=2) – Specify when to conduct early stopping. The training process terminates when the validation performance on the training set does not improve compared against the best validation performance achieved so far forn_tolerant_rounds
rounds.delta (
float
, default=1e-5) – Specify the threshold on early stopping. The counting onn_tolerant_rounds
is triggered if the performance of a fitted cascade layer does not improve bydelta
compared against the best validation performance achieved so far.partial_mode (
bool
, default=False) –Whether to train the deep forest in partial mode. For large datasets, it is recommended to use the partial mode.
If
True
, the partial mode is activated and all fitted estimators will be dumped in a local buffer;If
False
, all fitted estimators are directly stored in the memory.
n_jobs (
int
orNone
, default=None) – The number of jobs to run in parallel for bothfit()
andpredict()
. None means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.random_state (
int
orNone
, default=None) –If
int
,random_state
is the seed used by the random number generator;If
None
, the random number generator is the RandomState instance used bynp.random
.
verbose (
int
, default=1) –Controls the verbosity when fitting and predicting.
If
<= 0
, silent mode, which means no logging information will be displayed;If
1
, logging information on the cascade layer level will be displayed;If
> 1
, full logging information will be displayed.
-
fit
(X, y, sample_weight=None)¶ Build a deep forest using the training data.
Note
Deep forest supports two kinds of modes for training:
Full memory mode, in which the training / testing data and all fitted estimators are directly stored in the memory.
Partial mode, in which after fitting each estimator using the training data, it will be dumped in the buffer. During the evaluating stage, the dumped estimators are reloaded into the memory sequentially to evaluate the testing data.
By setting the
partial_mode
toTrue
, the partial mode is activated, and a local buffer will be created at the current directory. The partial mode is able to reduce the running memory cost when training the deep forest.- Parameters
X – The training data. Internally, it will be converted to
np.uint8
.y (
numpy.ndarray
of shape (n_samples,) or (n_samples, n_outputs)) – The target values of input samples.sample_weight (
numpy.ndarray
of shape (n_samples,), default=None) – Sample weights. IfNone
, then samples are equally weighted.
-
predict
(X)¶ Predict regression target for X.
- Parameters
X – The input samples. Internally, its dtype will be converted to
np.uint8
.- Returns
y – The predicted values.
- Return type
numpy.ndarray
of shape (n_samples,) or (n_samples, n_outputs)
-
clean
()¶ Clean the buffer created by the model.
-
get_estimator
(layer_idx, est_idx, estimator_type)¶ Get estimator from a cascade layer in the deep forest.
- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.est_idx (
int
) – The index of the estimator, should be in the range[0, self.n_estimators]
.estimator_type (
{"rf", "erf", "custom"}
) –Specify the forest type.
If
rf
, return the random forest.If
erf
, return the extremely random forest.If
custom
, return the customized estimator, only applicable when using customized estimators in deep forest viaset_estimator()
.
- Returns
estimator
- Return type
Estimator with the given index.
-
get_layer_feature_importances
(layer_idx)¶ Return the feature importances of
layer_idx
-th cascade layer.- Parameters
layer_idx (
int
) – The index of the cascade layer, should be in the range[0, self.n_layers_-1]
.- Returns
feature_importances_ – The impurity-based feature importances of the cascade layer. Notice that the number of input features are different between the first cascade layer and remaining cascade layers.
- Return type
numpy.ndarray
of shape (n_features,)
Note
This method is only applicable when deep forest is built using the
sklearn
backendThe functionality of this method is not available when using customized estimators in deep forest.
-
load
(dirname)¶ Load the model from the directory
dirname
.- Parameters
dirname (
str
) – The name of the input directory.
Note
The dumped model after calling
load_model()
is not exactly the same as the model before saving, because many objects irrelevant to model inference will not be saved.
-
save
(dirname='model')¶ Save the model to the directory
dirname
.- Parameters
dirname (
str
, default=”model”) – The name of the output directory.
Warning
Other methods on model serialization such as
pickle
orjoblib
are not recommended, especially whenpartial_mode
is set to True.
-
set_estimator
(estimators, n_splits=5)¶ Specify the custom base estimators for cascade layers.
- Parameters
estimators (
list
) – A list of your base estimators, will be used in all cascade layers.n_splits (
int
, default=5) – The number of folds, must be at least 2.
-
set_predictor
(predictor)¶ Specify the custom predictor concatenated to deep forest.
- Parameters
predictor (
object
) – The instantiated object of your predictor.
Parameters Tunning¶
This page contains parameters tuning guides for deep forest.
Better Accuracy¶
Increase model complexity¶
An intuitive way of improving the performance of deep forest is to increase its model complexity, below are some important parameters controlling the model complexity:
n_estimators
: Specify the number of estimators in each cascade layer.n_trees
: Specify the number of trees in each estimator.max_layers
: Specify the maximum number of cascade layers.
Using a large value of parameters above, the performance of deep forest may improve on complex datasets that require a larger model to perform well.
Add the predictor¶
In addition to increasing the model complexity, you can borrow the power of random forest or gradient boosting decision tree (GBDT), which can be helpful depending on the dataset:
use_predictor
: Decide whether to use the predictor concatenated to the deep forest.predictor
: Specify the type of the predictor, should be one of"forest"
,"xgboost"
,"lightgbm"
.
Please make sure that XGBoost or LightGBM are installed if you are going to use them as the predictor.
Tip
A useful rule on deciding whether to add the predictor is to compare the performance of deep forest with a standalone predictor produced from the training data. If the predictor consistently outperforms deep forest, then the performance of deep forest is expected to improve via adding the predictor. In this case, the augmented features produced from deep forest also facilitate training the predictor.
Faster Speed¶
Parallelization¶
Using parallelization is highly recommended because deep forest is naturally suited to it.
n_jobs
: Specify the number of workers used. Setting its value to an integer larger than 1 enables the parallelization. Setting its value to-1
means all processors are used.
Fewer Splits¶
n_bins
: Specify the number of feature discrete bins. A smaller value means fewer splitting cut-offs will be considered, should be an integer in the range [2, 255].bin_type
: Specify the binning type. Setting its value to"interval"
enables less splitting cut-offs to be considered on dense intervals where the feature values accumulate.
Decrease model complexity¶
Setting parameters below to a smaller value decreases the model complexity of deep forest, and may lead to a faster speed on training and evaluating.
max_depth
: Specify the maximum depth of tree.None
indicates no constraint.min_samples_leaf
: Specify the minimum number of samples required to be at a leaf node. The smallest value is1
.n_estimators
: Specify the number of estimators in each cascade layer.n_trees
: Specify the number of trees in each estimator.n_tolerant_rounds
: Specify the number of tolerant rounds when handling early stopping. The smallest value is1
.
Warning
Since deep forest automatically determines the model complexity according to the validation performance on the training data, setting parameters above to a smaller value may lead to a deep forest model with more cascade layers.
Lower Memory Usage¶
Partial Mode¶
partial_mode
: Decide whether to train and evaluate the model in partial mode. If set toTrue
, the model will actively dump fitted estimators in a local buffer. As a result, the memory usage of deep forest no longer increases linearly with the number of fitted cascade layers.
In addition, decreasing the model complexity also pulls down the memory usage.
Experiments¶
Baseline¶
For all experiments, we used 5 popular tree-based ensemble methods as baselines. Details on the baselines are listed in the following table:
Name |
Introduction |
---|---|
An efficient implementation of Random Forest in Scikit-Learn |
|
Histogram-based GBDT in Scikit-Learn |
|
The vanilla version of XGBoost |
|
The histogram optimized version of XGBoost |
|
Light Gradient Boosting Machine |
Environment¶
For all experiments, we used a single linux server. Details on the specifications are listed in the table below. All processors were used for training and evaluating.
OS |
CPU |
Memory |
---|---|---|
Ubuntu 18.04 LTS |
Xeon E-2288G |
128GB |
Setting¶
We kept the number of decision trees the same across all baselines, while remaining hyper-parameters were set to their default values. Running scripts on reproducing all experiment results are available, please refer to this Repo.
Classification¶
Dataset¶
We have collected a number of datasets for both binary and multi-class classification, as listed in the table below. They were selected based on the following criteria:
Publicly available and easy to use;
Cover different application areas;
Reflect high diversity in terms of the number of samples, features, and classes.
As a result, some baselines may fail on datasets with too many samples or features. Such cases are indicated by N/A
in all tables below.
Name |
# Training |
# Testing |
# Features |
# Classes |
---|---|---|---|---|
49,990 |
91,701 |
22 |
2 |
|
7,494 |
3,498 |
16 |
10 |
|
15,000 |
5,000 |
16 |
26 |
|
67,557 |
20,267 |
126 |
3 |
|
6,412 |
3,207 |
55,197 |
105 |
|
406,708 |
174,304 |
54 |
7 |
|
4,500,000 |
500,000 |
18 |
2 |
|
10,500,000 |
500,000 |
28 |
2 |
|
7,291 |
2,007 |
256 |
10 |
|
60,000 |
10,000 |
784 |
10 |
|
60,000 |
10,000 |
784 |
10 |
Classification Accuracy¶
The table below shows the testing accuracy of each method, with the best result on each dataset bolded. Each experiment was conducted over 5 independently trials, and the average result was reported.
Name |
RF |
HGBDT |
XGB EXACT |
XGB HIST |
LightGBM |
Deep Forest |
---|---|---|---|---|---|---|
ijcnn1 |
98.07 |
98.43 |
98.20 |
98.23 |
98.61 |
98.16 |
pendigits |
96.54 |
96.34 |
96.60 |
96.60 |
96.17 |
97.50 |
letter |
95.39 |
91.56 |
90.80 |
90.82 |
88.94 |
95.92 |
connect-4 |
70.18 |
70.88 |
71.57 |
71.57 |
70.31 |
72.05 |
sector |
85.62 |
N/A |
66.01 |
65.61 |
63.24 |
86.74 |
covtype |
73.73 |
64.22 |
66.15 |
66.70 |
65.00 |
74.27 |
susy |
80.19 |
80.31 |
80.32 |
80.35 |
80.33 |
80.18 |
higgs |
N/A |
74.95 |
75.85 |
76.00 |
74.97 |
76.46 |
usps |
93.79 |
94.32 |
93.77 |
93.37 |
93.97 |
94.67 |
mnist |
97.20 |
98.35 |
98.07 |
98.14 |
98.42 |
98.11 |
fashion mnist |
87.87 |
87.02 |
90.74 |
90.80 |
90.81 |
89.66 |
Runtime¶
Runtime in seconds reported in the table below covers both the training stage and evaluating stage.
Name |
RF |
HGBDT |
XGB EXACT |
XGB HIST |
LightGBM |
Deep Forest |
---|---|---|---|---|---|---|
ijcnn1 |
9.60 |
6.84 |
11.24 |
1.90 |
1.99 |
8.37 |
pendigits |
1.26 |
5.12 |
0.39 |
0.26 |
0.46 |
2.21 |
letter |
0.76 |
1.30 |
0.34 |
0.17 |
0.19 |
2.84 |
connect-4 |
5.17 |
7.54 |
13.26 |
3.19 |
1.12 |
10.73 |
sector |
292.15 |
N/A |
632.27 |
593.35 |
18.83 |
521.68 |
covtype |
84.00 |
2.56 |
58.43 |
11.62 |
3.96 |
164.18 |
susy |
1429.85 |
59.09 |
1051.54 |
44.85 |
34.40 |
1866.48 |
higgs |
N/A |
523.74 |
7532.70 |
267.64 |
209.65 |
7307.44 |
usps |
9.28 |
8.73 |
9.43 |
5.78 |
9.81 |
6.08 |
mnist |
590.81 |
229.91 |
1156.64 |
762.40 |
233.94 |
599.55 |
fashion mnist |
735.47 |
32.86 |
1403.44 |
2061.80 |
428.37 |
661.05 |
Some observations are listed as follow:
Histogram-based GBDT (e.g.,
HGBDT
,XGB HIST
,LightGBM
) are typically faster mainly because decision tree in GBDT tends to have a much smaller tree depth;With the number of input dimensions increasing (e.g., on mnist and fashion-mnist), random forest and deep forest can be faster.
Regression¶
Dataset¶
We have also collected four datasets on univariate regression for a comparison on the regression problem.
Name |
# Training |
# Testing |
# Features |
---|---|---|---|
1,071 |
528 |
11 |
|
2,799 |
1,378 |
8 |
|
5,489 |
2,703 |
12 |
|
379 |
127 |
13 |
|
303 |
139 |
10 |
Testing Mean Squared Error¶
The table below shows the testing mean squared error of each method, with the best result on each dataset bolded. Each experiment was conducted over 5 independently trials, and the average result was reported.
Name |
RF |
HGBDT |
XGB EXACT |
XGB HIST |
LightGBM |
Deep Forest |
---|---|---|---|---|---|---|
wine |
0.35 |
0.40 |
0.41 |
0.41 |
0.39 |
0.34 |
abalone |
4.79 |
5.40 |
5.73 |
5.75 |
5.60 |
4.66 |
cpusmall |
8.31 |
9.01 |
9.86 |
11.82 |
8.99 |
7.15 |
boston |
16.61 |
20.68 |
20.61 |
19.65 |
20.27 |
19.87 |
diabetes |
3796.62 |
4333.66 |
4337.15 |
4303.96 |
4435.95 |
3431.01 |
Runtime¶
Runtime in seconds reported in the table below covers both the training stage and evaluating stage.
Name |
RF |
HGBDT |
XGB EXACT |
XGB HIST |
LightGBM |
Deep Forest |
---|---|---|---|---|---|---|
wine |
0.76 |
2.88 |
0.30 |
0.30 |
0.30 |
1.26 |
abalone |
0.53 |
1.57 |
0.47 |
0.50 |
0.17 |
1.29 |
cpusmall |
1.87 |
3.59 |
1.71 |
1.25 |
0.36 |
2.06 |
boston |
0.70 |
1.75 |
0.19 |
0.22 |
0.20 |
1.45 |
diabetes |
0.37 |
0.66 |
0.14 |
0.18 |
0.06 |
1.09 |
Report from Users¶
The page collects user reports on using deep forest. Thanks all of them for their nice work!
Competition¶
1st winning solution of the competition Insurance-Pricing-Game@AIcrowd: [Solution] | [Presentation]
Application¶
Model Architecture¶
This page introduces the model architecture, training stage, and evaluating stage of DF21. You may find this page helpful on understanding the meanings of different parameters listed in API Reference.

Component¶
This section presents the meanings of key components in DF21, along with associated parameters.
Binner
: The class used to reduce the number of splitting candidates for building decision trees.n_bins
,bin_subsample
,bin_type
Estimator
: Base estimators used in cascades layer of DF21. Default estimators are RandomForestClassifier and ExtraTreesClassifier.n_trees
,max_depth
,min_samples_split
,min_samples_leaf
,criterion
,backend
Layer
: The cascade layer of DF21, which consists of multiple estimators.max_layers
,n_estimators
Predictor
: The optional predictor concatenated to the DF21 model.use_predictor
,predictor
,predictor_kwargs
Training¶
The training stage of DF21 starts with discretizing feature-wise values of training samples into n_bins
unique values, which is a commonly-used technique on accelerating building decision trees. After then, the first cascade layer in DF21 with n_estimators
estimators is produced using the binned data (Notice that by default n_estimators
would be multiplied by 2 internally). Furthermore, each estimator consists of n_trees
decision trees that adopt the splitting criterion criterion
, satisfying the constraints enforced by max_depth
and min_samples_leaf
.
After data binning and building the first cascade layer, DF21 enters the main training loop:
Bin the out-of-bag predictions of the previous cascade layer (denoted by augmented features in the figure above) using a newly-fitted
binner
;Concatenate the augmented features to the binned training samples, serving as the new training data for the cascade layer to be built;
Build a new
layer
using the concatenated training data, following the same training protocols as that used to build the first cascade layer;Get the out-of-bag predictions of the
layer
and estimate its generalization performance via out-of-bag estimation;If the estimated performance is better than all previously-built layers, DF21 continues to build a new layer. Otherwise, the early-stopping procedure is triggered, and DF21 will terminate the training stage before reaching
max_layers
if the performance does not improve forn_tolerant_rounds
rounds.
As an optional step, DF21 builds another predictor if use_predictor
is set to True
. This predictor takes the input the concatenated training data from the last cascade layer, and outputs the predicted class probabilities for classification problems, and predicted values for regression problems. One can use predictors like random forest or GBDT through setting predictor
. Besides, you can better configure it through setting predictor_kwargs
.
Evaluating¶
The evaluating stage follows the sequential structure of DF21. First, the testing samples are binned using the first binner
and passed into the first layer
. After then, DF21 sets the augmented features as the output of the current cascade layer, and bins it using the subsequent binner
. After concatenating augmented features to the binned testing samples, DF21 moves to the next layer, util reaching the last cascade layer or the predictor.
Use Customized Estimators¶
The version v0.1.4 of deepforest
has added the support on:
using customized base estimators in cascade layers of deep forest
using the customized predictor concatenated to the deep forest
The page gives a detailed introduction on how to use this new feature.
Instantiate the deep forest model¶
To begin with, you need to instantiate a deep forest model. Notice that some parameters specified here will be overridden by downstream steps. For example, if the parameter use_predictor
is set to False
here, whereas set_predictor()
is called latter, then the internal attribute use_predictor
will be altered to True
.
from deepforest import CascadeForestClassifier
model = CascadeForestClassifier()
Instantiate your estimators¶
In order to use customized estimators in the cascade layer of deep forest, the next step is to instantiate the estimators and encapsulate them into a Python list:
n_estimators = 4 # the number of base estimators per cascade layer
estimators = [your_estimator(random_state=i) for i in range(n_estimators)]
Tip
You need to make sure that instantiated estimators in the list are with different random seeds if seeds are manually specified. Otherwise, they will have the same behavior on the dataset and make cascade layers less effective.
For the customized predictor, you only need to instantiate it, and there is no extra step:
predictor = your_predictor()
Deep forest will conduct internal checks to make sure that estimators
and predictor
are valid for training and evaluating. To pass the internal checks, the class of your customized estimators or predictor should at least implement methods listed below:
fit()
for training[Classification]
predict_proba()
for evaluating[Regression]
predict()
for evaluating
The name of these methods follow the convention in scikit-learn, and they are already implemented in a lot of packages offering scikit-learn APIs (e.g., XGBoost, LightGBM, CatBoost). Otherwise, you have to implement a wrapper on your customized estimators to make these methods callable.
Call set_estimator()
and set_predictor()
¶
The core step is to call set_estimator()
and set_predictor()
to override estimators used by default:
# Customized base estimators
model.set_estimator(estimators)
# Customized predictor
model.set_predictor(predictor)
set_estimator()
has another parameter n_splits
, which determines the number of folds of the internal cross-validation strategy. Its value should be at least 2
, and the default value is 5
. Generally speaking, a larger n_splits
leads to better generalization performance. If you are confused about the effect of cross-validation here, please refer to the original paper for details on how deep forest adopts the cross-validation strategy to build cascade layers.
Train and Evaluate¶
Remaining steps follow the original workflow of deep forest.
model.train(X_train, y_train)
y_pred = model.predict(X_test)
Note
When using customized estimators via set_estimator()
, deep forest adopts the cross-validation strategy to grow cascade layers. Suppose that n_splits
is set to 5
when calling set_estimator()
, each estimator will be repeatedly trained over five times to get full augmented features from a cascade layer. As a result, you may experience a drastic increase in running time and memory.
Contributors ✨¶
Thanks goes to these wonderful people (emoji key):
Alex-Medium 💻 ⚠️ |
Dwaipayan Munshi 📖 |
Hao Lyu 💻 |
Joey Gao 💻 |
Mr-memorandum 🐛 |
NiMaZi 💻 ⚠️ |
T-Allen-sudo 🚧 ⚠️ |
Wenjie Zhang 💻 ⚠️ |
Yi-Xuan Xu 💻 📖 ⚠️ |
tczhao 💻 📖 ⚠️ |
zhenlingcn 🐛 |
zshgostop 🐛 |
陈鼎彦 🐛 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!
Changelog¶
Badge |
Meaning |
---|---|
Feature |
Add something that cannot be achieved before. |
Efficiency |
Improve the efficiency on the computation or memory. |
Enhancement |
Miscellaneous minor improvements. |
Fix |
Fix up something that does not work as expected. |
API Change |
You will need to change the code to have the same effect. |
Version 0.1.*¶
Feature support the latest version of scikit-learn and drop support on python 3.6 (#115) @xuyxu
Feature API Change add support on
pandas.DataFrame
forX
andy
(#86) @IncubatorShokuhouFix fix missing functionality of
_set_n_trees()
@xuyxuFix API Change add docstrings for parameter
bin_type
(#74) @xuyxuFeature API Change recover the parameter
min_samples_split
(#73) @xuyxuFix fix the breakdown under the corner case where no internal node exists (#70) @xuyxu
Feature support python 3.9 (#69) @xuyxu
Fix fix inconsistency on array shape for
CascadeForestRegressor
in customized mode (#67) @xuyxuFix fix missing sample indices for parameter
sample_weight
inKFoldWrapper
(#48) @xuyxuFeature API Change add support on customized estimators (#48) @xuyxu
Enhancement improve target checks for
CascadeForestRegressor
(#53) @chendingyanFix fix the prediction workflow with only one cascade layer (#56) @xuyxu
Fix fix inconsistency on predictor name (#52) @xuyxu
Feature add official support for ManyLinux-aarch64 (#47) @xuyxu
Fix fix accepted types of target for
CascadeForestRegressor
(#44) @xuyxuFeature API Change add multi-output support for
CascadeForestRegressor
(#40) @Alex-MediumFeature API Change add layer-wise feature importances (#39) @xuyxu
Feature API Change add scikit-learn backend (#36) @xuyxu
Feature add official support for Mac-OS (#34) @T-Allen-sudo
Feature API Change support configurable criterion (#28) @tczhao
Feature API Change support regression prediction (#25) @tczhao
Fix fix accepted data types on the
binner
(#23) @xuyxuFeature API Change implement the
get_estimator()
method for efficient indexing (#22) @xuyxuFeature support class label encoding (#18) @NiMaZi
Feature API Change support sample weight in
fit()
(#7) @tczhaoFeature API Change configurable predictor parameter (#9) @tczhao
Enhancement add base class
BaseEstimator
andClassifierMixin
(#8) @pjgao