deeprob.spn.learning.splitting package

Submodules

deeprob.spn.learning.splitting.cluster module

deeprob.spn.learning.splitting.cluster.gmm(data, distributions, domains, random_state, n=2)[source]

Execute GMM clustering on some data.

Parameters
Returns

An array where each element is the cluster where the corresponding data belong.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.kmeans(data, distributions, domains, random_state, n=2)[source]

Execute K-Means clustering on some data.

Parameters
Returns

An array where each element is the cluster where the corresponding data belong.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.kmeans_mb(data, distributions, domains, random_state, n=2)[source]

Execute MiniBatch K-Means clustering on some data.

Parameters
Returns

An array where each element is the cluster where the corresponding data belong.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.dbscan(data, distributions, domains, random_state, n=2)[source]

Execute DBSCAN clustering on some data (only on discrete data).

Parameters
Returns

An array where each element is the cluster where the corresponding data belong.

Raises

ValueError – If the leaf distributions are NOT discrete.

Return type

ndarray

deeprob.spn.learning.splitting.cluster.wald(data, distributions, domains, random_state, n=2)[source]

Execute Ward (Hierarchical) clustering on some data (only discrete data).

Parameters
Returns

An array where each element is the cluster where the corresponding data belong.

Raises

ValueError – If the leaf distributions are NOT discrete.

Return type

ndarray

deeprob.spn.learning.splitting.cols module

deeprob.spn.learning.splitting.cols.SplitColsFunc

A signature for a columns splitting function.

alias of Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting.cols.split_cols_clusters(data, clusters, scope)[source]

Split the data vertically given the clusters.

Parameters
  • data (ndarray) – The data.

  • clusters (ndarray) – The clusters.

  • scope (List[int]) – The original scope.

Returns

(slices, scopes) where slices is a list of partial data and scopes is a list of partial scopes.

Return type

Tuple[List[ndarray], List[List[int]]]

deeprob.spn.learning.splitting.cols.get_split_cols_method(split_cols)[source]

Get the columns splitting method given a string.

Parameters

split_cols (str) – The string of the method do get.

Returns

The corresponding columns splitting function.

Raises

ValueError – If the columns splitting method is unknown.

Return type

Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting.entropy module

deeprob.spn.learning.splitting.entropy.entropy_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1)[source]

Entropy based column splitting method.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – Distributions of the features.

  • domains (List[Union[list, tuple]]) – Range of values of the features.

  • e (float) – Threshold of the considered entropy to be signficant.

  • alpha (float) – laplacian alpha to apply at frequence.

  • random_state (RandomState) –

Returns

A partitioning of features.

Return type

ndarray

deeprob.spn.learning.splitting.entropy.entropy_adaptive_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1, size=None)[source]

Adaptive Entropy based column splitting method.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – Distributions of the features.

  • domains (List[Union[list, tuple]]) – Range of values of the features.

  • e (float) – Threshold of the considered entropy to be signficant.

  • alpha (float) – laplacian alpha to apply at frequence.

  • size (Optional[int]) – Size of whole dataset.

  • random_state (RandomState) –

Returns

A partitioning of features.

Raises

ValueError – If the size of the data is missing.

Return type

ndarray

deeprob.spn.learning.splitting.gini module

deeprob.spn.learning.splitting.gini.gini_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1)[source]

Gini index column splitting method.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – Distributions of the features.

  • domains (List[Union[list, tuple]]) – Range of values of the features.

  • e (float) – Threshold of the considered entropy to be signficant.

  • alpha (float) – laplacian alpha to apply at frequence.

  • random_state (RandomState) –

Returns

A partitioning of features.

Return type

ndarray

deeprob.spn.learning.splitting.gini.gini_adaptive_cols(data, distributions, domains, random_state, e=0.3, alpha=0.1, size=None)[source]

Adaptive Gini index column splitting method.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – Distributions of the features.

  • domains (List[Union[list, tuple]]) – Range of values of the features.

  • e (float) – Threshold of the considered entropy to be signficant.

  • alpha (float) – laplacian alpha to apply at frequence.

  • size (Optional[int]) – Size of whole dataset.

  • random_state (RandomState) –

Returns

A partitioning of features.

Raises

ValueError – If the size of the data is missing.

Return type

ndarray

deeprob.spn.learning.splitting.gvs module

deeprob.spn.learning.splitting.gvs.gvs_cols(data, distributions, domains, random_state, p=5.0)[source]

Greedy Variable Splitting (GVS) independence test.

Parameters
Returns

A partitioning of features.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

ndarray

deeprob.spn.learning.splitting.gvs.rgvs_cols(data, distributions, domains, random_state, p=5.0)[source]

Random Greedy Variable Splitting (RGVS) independence test.

Parameters
Returns

A partitioning of features.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

ndarray

deeprob.spn.learning.splitting.gvs.wrgvs_cols(data, distributions, domains, random_state, p=5.0)[source]

Wiser Random Greedy Variable Splitting (WRGVS) independence test.

Parameters
Returns

A partitioning of features.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

ndarray

deeprob.spn.learning.splitting.gvs.gtest(data, i, j, distributions, domains, p=5.0, test=True)[source]

The G-Test independence test between two features.

Parameters
  • data (ndarray) – The data.

  • i (int) – The index of the first feature.

  • j (int) – The index of the second feature.

  • distributions (List[Type[Leaf]]) – The distributions.

  • domains (List[Union[list, tuple]]) – The domains.

  • p (float) – The threshold for the G-Test.

  • test (bool) – If the method is called as test (true) or as value of statistics (false), default True.

Returns

False if the features are assumed to be dependent, True otherwise.

Raises

ValueError – If the leaf distributions are discrete and continuous.

Return type

Union[bool, float]

deeprob.spn.learning.splitting.random module

deeprob.spn.learning.splitting.random.random_rows(data, distributions, domains, random_state, a=2.0, b=2.0)[source]

Choose a binary partition horizontally randomly. The proportion of the split is sampled from a beta distribution.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – The data distributions (not used).

  • domains (List[Union[list, tuple]]) – The data domains (not used).

  • random_state (RandomState) – The random state.

  • a (float) – The alpha parameter of the beta distribution.

  • b (float) – The beta parameter of the beta distribution.

Returns

A binary partition.

Return type

ndarray

deeprob.spn.learning.splitting.random.random_cols(data, distributions, domains, random_state, a=2.0, b=2.0)[source]

Choose a binary partition vertically randomly. The proportion of the split is sampled from a beta distribution.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – The data distributions (not used).

  • domains (List[Union[list, tuple]]) – The data domains (not used).

  • random_state (RandomState) – The random state.

  • a (float) – The alpha parameter of the beta distribution.

  • b (float) – The beta parameter of the beta distribution.

Returns

A binary partition.

Return type

ndarray

deeprob.spn.learning.splitting.rdc module

deeprob.spn.learning.splitting.rdc.rdc_cols(data, distributions, domains, random_state, d=0.3, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Split the features using the RDC (Randomized Dependency Coefficient) method.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – The data distributions.

  • domains (List[Union[list, tuple]]) – The data domains.

  • random_state (RandomState) – The random state.

  • d (float) – The threshold value that regulates the independence tests among the features.

  • k (int) – The size of the latent space.

  • s (float) – The standard deviation of the gaussian distribution.

  • nl (Callable[[ndarray], ndarray]) – The non linear function to use.

Returns

A features partitioning.

Return type

ndarray

deeprob.spn.learning.splitting.rdc.rdc_rows(data, distributions, domains, random_state, n=2, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Split the samples using the RDC (Randomized Dependency Coefficient) method.

Parameters
  • data (ndarray) – The data.

  • distributions (List[Type[Leaf]]) – The data distributions.

  • domains (List[Union[list, tuple]]) – The data domains.

  • random_state (RandomState) – The random state.

  • n (int) – The number of clusters for KMeans.

  • k (int) – The size of the latent space.

  • s (float) – The standard deviation of the gaussian distribution.

  • nl (Callable[[ndarray], ndarray]) – The non linear function to use.

Returns

A samples partitioning.

Return type

ndarray

deeprob.spn.learning.splitting.rdc.rdc_scores(data, distributions, domains, random_state, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Compute the RDC (Randomized Dependency Coefficient) score for each pair of features.

Parameters
Returns

The RDC score matrix.

Return type

ndarray

deeprob.spn.learning.splitting.rdc.rdc_cca(i, j, features)[source]

Compute the RDC (Randomized Dependency Coefficient) using CCA (Canonical Correlation Analysis).

Parameters
  • i (int) – The index of the first feature.

  • j (int) – The index of the second feature.

  • features (List[ndarray]) – The list of the features.

Returns

The RDC coefficient (the largest canonical correlation coefficient).

Return type

float

deeprob.spn.learning.splitting.rdc.rdc_transform(data, distributions, domains, random_state, k=20, s=0.16666666666666666, nl=<ufunc 'sin'>)[source]

Execute the RDC (Randomized Dependency Coefficient) pipeline on some data.

Parameters
Returns

The transformed data.

Raises

ValueError – If an unknown distribution type is found.

Return type

List[ndarray]

deeprob.spn.learning.splitting.rows module

deeprob.spn.learning.splitting.rows.SplitRowsFunc

A signature for a rows splitting function.

alias of Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

deeprob.spn.learning.splitting.rows.split_rows_clusters(data, clusters)[source]

Split the data horizontally given the clusters.

Parameters
Returns

(slices, weights) where slices is a list of partial data and weights is a list of proportions of the local data in respect to the original data.

Return type

Tuple[List[ndarray], List[float]]

deeprob.spn.learning.splitting.rows.get_split_rows_method(split_rows)[source]

Get the rows splitting method given a string.

Parameters

split_rows (str) – The string of the method do get.

Returns

The corresponding rows splitting function.

Raises

ValueError – If the rows splitting method is unknown.

Return type

Callable[[ndarray, List[Type[Leaf]], List[Union[list, tuple]], RandomState, Any], ndarray]

Module contents