deeprob.spn.learning.splitting package

Submodules

deeprob.spn.learning.splitting.cluster module

deeprob.spn.learning.splitting.cluster.gmm(data, distributions, domains, random_state, n=2)[source]

Execute GMM clustering on some data.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
n (int) – The number of clusters.

Returns

An array where each element is the cluster where the corresponding data belong.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.kmeans(data, distributions, domains, random_state, n=2)[source]

Execute K-Means clustering on some data.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
n (int) – The number of clusters.

Returns

An array where each element is the cluster where the corresponding data belong.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.kmeans_mb(data, distributions, domains, random_state, n=2)[source]

Execute MiniBatch K-Means clustering on some data.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
n (int) – The number of clusters.

Returns

An array where each element is the cluster where the corresponding data belong.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.dbscan(data, distributions, domains, random_state, n=2)[source]

Execute DBSCAN clustering on some data (only on discrete data).

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
n (int) – The number of clusters.

Returns

An array where each element is the cluster where the corresponding data belong.

Raises

ValueError – If the leaf distributions are NOT discrete.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.wald(data, distributions, domains, random_state, n=2)[source]

Execute Ward (Hierarchical) clustering on some data (only discrete data).

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
n (int) – The number of clusters.

Returns

An array where each element is the cluster where the corresponding data belong.

Raises

ValueError – If the leaf distributions are NOT discrete.

Return type

ndarray

deeprob.spn.learning.splitting.cols module

deeprob.spn.learning.splitting.cols.SplitColsFunc

A signature for a columns splitting function.

alias of Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting.cols.split_cols_clusters(data, clusters, scope)[source]

Split the data vertically given the clusters.

Parameters

data (ndarray) – The data.
clusters (ndarray) – The clusters.
scope (List[int]) – The original scope.

Returns

(slices, scopes) where slices is a list of partial data and scopes is a list of partial scopes.

Return type

Tuple[List[ndarray], List[List[int]]]

deeprob.spn.learning.splitting.cols.get_split_cols_method(split_cols)[source]

Get the columns splitting method given a string.

Parameters: split_cols (str) – The string of the method do get.
Returns: The corresponding columns splitting function.
Raises: ValueError – If the columns splitting method is unknown.
Return type: Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting.entropy module

deeprob.spn.learning.splitting.entropy.entropy_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1)[source]

Entropy based column splitting method.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – Distributions of the features.
domains (List[Union[list, tuple]]) – Range of values of the features.
e (float) – Threshold of the considered entropy to be signficant.
alpha (float) – laplacian alpha to apply at frequence.
random_state (RandomState) –

Returns

A partitioning of features.

Return type

ndarray

deeprob.spn.learning.splitting.entropy.entropy_adaptive_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1, size=None)[source]

Adaptive Entropy based column splitting method.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – Distributions of the features.
domains (List[Union[list, tuple]]) – Range of values of the features.
e (float) – Threshold of the considered entropy to be signficant.
alpha (float) – laplacian alpha to apply at frequence.
size (Optional[int]) – Size of whole dataset.
random_state (RandomState) –

Returns

A partitioning of features.

Raises

ValueError – If the size of the data is missing.

Return type

ndarray

deeprob.spn.learning.splitting.gini module

deeprob.spn.learning.splitting.gini.gini_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1)[source]

Gini index column splitting method.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – Distributions of the features.
domains (List[Union[list, tuple]]) – Range of values of the features.
e (float) – Threshold of the considered entropy to be signficant.
alpha (float) – laplacian alpha to apply at frequence.
random_state (RandomState) –

Returns

A partitioning of features.

Return type

ndarray

deeprob.spn.learning.splitting.gini.gini_adaptive_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1, size=None)[source]

Adaptive Gini index column splitting method.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – Distributions of the features.
domains (List[Union[list, tuple]]) – Range of values of the features.
e (float) – Threshold of the considered entropy to be signficant.
alpha (float) – laplacian alpha to apply at frequence.
size (Optional[int]) – Size of whole dataset.
random_state (RandomState) –

Returns

A partitioning of features.

Raises

ValueError – If the size of the data is missing.

Return type

ndarray

deeprob.spn.learning.splitting.gvs module

deeprob.spn.learning.splitting.gvs.gvs_cols(data, distributions, domains, random_state, p=5.0)[source]

Greedy Variable Splitting (GVS) independence test.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The distributions.
domains (List[Union[list, tuple]]) – The domains.
random_state (RandomState) – The random state.
p (float) – The threshold for the G-Test.

Returns

A partitioning of features.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

ndarray

deeprob.spn.learning.splitting.gvs.rgvs_cols(data, distributions, domains, random_state, p=5.0)[source]

Random Greedy Variable Splitting (RGVS) independence test.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The distributions.
domains (List[Union[list, tuple]]) – The domains.
random_state (RandomState) – The random state.
p (float) – The threshold for the G-Test.

Returns

A partitioning of features.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

ndarray

deeprob.spn.learning.splitting.gvs.wrgvs_cols(data, distributions, domains, random_state, p=5.0)[source]

Wiser Random Greedy Variable Splitting (WRGVS) independence test.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The distributions.
domains (List[Union[list, tuple]]) – The domains.
random_state (RandomState) – The random state.
p (float) – The threshold for the G-Test.

Returns

A partitioning of features.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

ndarray

deeprob.spn.learning.splitting.gvs.gtest(data, i, j, distributions, domains, p=5.0, test=True)[source]

The G-Test independence test between two features.

Parameters

data (ndarray) – The data.
i (int) – The index of the first feature.
j (int) – The index of the second feature.
distributions (List[Type[Leaf]]) – The distributions.
domains (List[Union[list, tuple]]) – The domains.
p (float) – The threshold for the G-Test.
test (bool) – If the method is called as test (true) or as value of statistics (false), default True.

Returns

False if the features are assumed to be dependent, True otherwise.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

Union[bool, float]

deeprob.spn.learning.splitting.random module

deeprob.spn.learning.splitting.random.random_rows(data, distributions, domains, random_state, a=2.0, b=2.0)[source]

Choose a binary partition horizontally randomly. The proportion of the split is sampled from a beta distribution.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions (not used).
domains (List[Union[list, tuple]]) – The data domains (not used).
random_state (RandomState) – The random state.
a (float) – The alpha parameter of the beta distribution.
b (float) – The beta parameter of the beta distribution.

Returns

A binary partition.

Return type

ndarray

deeprob.spn.learning.splitting.random.random_cols(data, distributions, domains, random_state, a=2.0, b=2.0)[source]

Choose a binary partition vertically randomly. The proportion of the split is sampled from a beta distribution.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions (not used).
domains (List[Union[list, tuple]]) – The data domains (not used).
random_state (RandomState) – The random state.
a (float) – The alpha parameter of the beta distribution.
b (float) – The beta parameter of the beta distribution.

Returns

A binary partition.

Return type

ndarray

deeprob.spn.learning.splitting.rdc module

deeprob.spn.learning.splitting.rdc.rdc_cols(data, distributions, domains, random_state, d=0.3, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Split the features using the RDC (Randomized Dependency Coefficient) method.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
d (float) – The threshold value that regulates the independence tests among the features.
k (int) – The size of the latent space.
s (float) – The standard deviation of the gaussian distribution.
nl (Callable[[ndarray], ndarray]) – The non linear function to use.

Returns

A features partitioning.

Return type

ndarray

deeprob.spn.learning.splitting.rdc.rdc_rows(data, distributions, domains, random_state, n=2, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Split the samples using the RDC (Randomized Dependency Coefficient) method.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
n (int) – The number of clusters for KMeans.
k (int) – The size of the latent space.
s (float) – The standard deviation of the gaussian distribution.
nl (Callable[[ndarray], ndarray]) – The non linear function to use.

Returns

A samples partitioning.

Return type

ndarray

deeprob.spn.learning.splitting.rdc.rdc_scores(data, distributions, domains, random_state, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Compute the RDC (Randomized Dependency Coefficient) score for each pair of features.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
k (int) – The size of the latent space.
s (float) – The standard deviation of the gaussian distribution.
nl (Callable[[ndarray], ndarray]) – The non linear function to use.

Returns

The RDC score matrix.

Return type

ndarray

deeprob.spn.learning.splitting.rdc.rdc_cca(i, j, features)[source]

Compute the RDC (Randomized Dependency Coefficient) using CCA (Canonical Correlation Analysis).

Parameters

i (int) – The index of the first feature.
j (int) – The index of the second feature.
features (List[ndarray]) – The list of the features.

Returns

The RDC coefficient (the largest canonical correlation coefficient).

Return type

float

deeprob.spn.learning.splitting.rdc.rdc_transform(data, distributions, domains, random_state, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Execute the RDC (Randomized Dependency Coefficient) pipeline on some data.

Parameters

data (ndarray) – The data.
distributions (List[Type[Leaf]]) – The data distributions.
domains (List[Union[list, tuple]]) – The data domains.
random_state (RandomState) – The random state.
k (int) – The size of the latent space.
s (float) – The standard deviation of the gaussian distribution.
nl (Callable[[ndarray], ndarray]) – The non-linear function to use.

Returns

The transformed data.

Raises

ValueError – If an unknown distribution type is found.

Return type

List[ndarray]

deeprob.spn.learning.splitting.rows module

deeprob.spn.learning.splitting.rows.SplitRowsFunc

A signature for a rows splitting function.

alias of Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting.rows.split_rows_clusters(data, clusters)[source]

Split the data horizontally given the clusters.

Parameters

data (ndarray) – The data.
clusters (ndarray) – The clusters.

Returns

(slices, weights) where slices is a list of partial data and weights is a list of proportions of the local data in respect to the original data.

Return type

Tuple[List[ndarray], List[float]]

deeprob.spn.learning.splitting.rows.get_split_rows_method(split_rows)[source]

Get the rows splitting method given a string.

Parameters: split_rows (str) – The string of the method do get.
Returns: The corresponding rows splitting function.
Raises: ValueError – If the rows splitting method is unknown.
Return type: Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting package

Submodules

deeprob.spn.learning.splitting.cluster module

deeprob.spn.learning.splitting.cols module

deeprob.spn.learning.splitting.entropy module

deeprob.spn.learning.splitting.gini module

deeprob.spn.learning.splitting.gvs module

deeprob.spn.learning.splitting.random module

deeprob.spn.learning.splitting.rdc module

deeprob.spn.learning.splitting.rows module

Module contents