brainiak.fcma package

Full correlation matrix analysis

The implementation is based on the work in [Wang2015-1] and [Wang2015-2].

Wang2015-1

Full correlation matrix analysis (FCMA): An unbiased method for task-related functional connectivity”, Yida Wang, Jonathan D Cohen, Kai Li, Nicholas B Turk-Browne. Journal of Neuroscience Methods, 2015.

Wang2015-2

“Full correlation matrix analysis of fMRI data on Intel® Xeon Phi™ coprocessors”, Yida Wang, Michael J. Anderson, Jonathan D. Cohen, Alexander Heinecke, Kai Li, Nadathur Satish, Narayanan Sundaram, Nicholas B. Turk-Browne, Theodore L. Willke. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2015.

Submodules

brainiak.fcma.classifier module

Full Correlation Matrix Analysis (FCMA)

Correlation-based training and prediction

class brainiak.fcma.classifier.Classifier(clf, num_processed_voxels=2000, epochs_per_subj=0)

Bases: BaseEstimator

Correlation-based classification component of FCMA

The classifier first computes correlation of the input data, and normalizes them if needed, then uses the given classifier to train and/or predict the correlation data. NOTE: if the classifier is sklearn.svm.SVC with precomputed kernel, the test data may be provided in the fit method to compute the kernel matrix together with the training data to save the memory usage, but the test data will NEVER be seen in the model training.

Parameters
  • clf (class) – The classifier used, normally a classifier class of sklearn

  • num_processed_voxels (int, default 2000) – Used for SVM with precomputed kernel, every time it only computes correlation between num_process_voxels and the whole mask to aggregate the kernel matrices. This is to save the memory so as to handle correlations at a larger scale.

  • epochs_per_subj (int, default 0) – The number of epochs of each subject within-subject normalization will be performed during classifier training if epochs_per_subj is specified default 0 means no within-subject normalization

training_data_

training_data_ is None except clf is SVM.SVC with precomputed kernel, in which case training data is needed to compute the similarity vector for each sample to be classified. However, if the test samples are also provided during the fit, the similarity vectors can be precomputed too and then training_data_ is None

Type

2D numpy array in shape [num_samples, num_features]

test_raw_data_

default None test_raw_data_ is set after a prediction is called, if the new input data equals test_raw_data_, test_data_ can be reused

Type

a list of 2D array in shape [num_TRs, num_voxels]

test_data_

default None test_data_ is set after a prediction is called, so that the test data does not need to be regenerated in the subsequent operations, e.g. getting decision values of the prediction. test_data_ may also be set in the fit method if sklearn.svm.SVC with precomputed kernel and the test samples are known. NOTE: the test samples will never be used to fit the model.

Type

2D numpy array in shape [num_samples, num_features]

num_voxels_

The number of voxels of the first brain region used in the classifier. The first brain region is always large. When training, this region may be divided to compute the correlation portion by portion. The brain regions are defined by the applied mask, e.g. the top voxels selected by FCMA voxel selection

Type

int

num_features_

The dimension of correlation data, normally is the product of the number of voxels of brain region 1 and the number of voxels of brain region 2. num_features_ must be consistent in both training and classification

Type

int

num_samples_

The number of samples

Type

int

num_digits_

The number of digits of the first value of the kernel matrix, for normalizing the kernel values accordingly

Type

int

decision_function(X=None)

Output the decision value of the prediction.

if X is not equal to self.test_raw_data_, i.e. predict is not called, first generate the test_data after getting the test_data, get the decision value via self.clf. if X is None, test_data_ is ready to be used

Parameters

X (Optional[list of tuple (data1, data2)]) – data1 and data2 are numpy array in shape [num_TRs, num_voxels] to be computed for correlation. default None, meaning that the data to be predicted have been processed in the fit method. Otherwise, X contains the activity data filtered by ROIs and prepared for correlation computation. len(X) is the number of test samples. if len(X) > 1: normalization is done on all test samples. Within list, all data1s must have the same num_voxels value, all data2s must have the same num_voxels value.

Returns

confidence

Return type

the predictions confidence values of X, in shape [len(X),]

fit(X, y, num_training_samples=None)

Use correlation data to train a model.

First compute the correlation of the input data, and then normalize within subject if more than one sample in one subject, and then fit to a model defined by self.clf.

Parameters
  • X (list of tuple (data1, data2)) – data1 and data2 are numpy array in shape [num_TRs, num_voxels] to be computed for correlation. They contain the activity data filtered by ROIs and prepared for correlation computation. Within list, all data1s must have the same num_voxels value, all data2s must have the same num_voxels value.

  • y (1D numpy array) – labels, len(X) equals len(y)

  • num_training_samples (Optional[int]) – The number of samples used in the training. Set it to construct the kernel matrix portion by portion so the similarity vectors of the test data have to be computed here. Only set num_training_samples when sklearn.svm.SVC with precomputed kernel is used. If it is set, only those samples will be used to fit the model.

Returns

self.

Return type

Classifier

predict(X=None)

Use a trained model to predict correlation data.

first compute the correlation of the input data, and then normalize across all samples in the list if there are more than one sample, and then predict via self.clf. If X is None, use the similarity vectors produced in fit to predict

Parameters

X (Optional[list of tuple (data1, data2)]) – data1 and data2 are numpy array in shape [num_TRs, num_voxels] to be computed for correlation. default None, meaning that the data to be predicted have been processed in the fit method. Otherwise, X contains the activity data filtered by ROIs and prepared for correlation computation. len(X) is the number of test samples. if len(X) > 1: normalization is done on all test samples. Within list, all data1s must have the same num_voxels value, all data2s must have the same num_voxels value.

Returns

y_pred

Return type

the predicted label of X, in shape [len(X),]

score(X, y, sample_weight=None)

Returns the mean accuracy on the given test data and labels.

NOTE: In the condition of sklearn.svm.SVC with precomputed kernel when the kernel matrix is computed portion by portion, the function will ignore the first input argument X.

Parameters
  • X (list of tuple (data1, data2)) – data1 and data2 are numpy array in shape [num_TRs, num_voxels] to be computed for correlation. They are test samples. They contain the activity data filtered by ROIs and prepared for correlation computation. Within list, all data1s must have the same num_voxels value, all data2s must have the same num_voxels value. len(X) is the number of test samples.

  • y (1D numpy array) – labels, len(X) equals len(y), which is num_samples

  • sample_weight (1D array in shape [num_samples], optional) – Sample weights.

Returns

score – Mean accuracy of self.predict(X) wrt. y.

Return type

float

set_fit_request(*, num_training_samples: Union[bool, None, str] = '$UNCHANGED$') Classifier

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

num_training_samples (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for num_training_samples parameter in fit.

Returns

self – The updated object.

Return type

object

set_score_request(*, sample_weight: Union[bool, None, str] = '$UNCHANGED$') Classifier

Request metadata passed to the score method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

New in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

Returns

self – The updated object.

Return type

object

brainiak.fcma.mvpa_voxelselector module

Full Correlation Matrix Analysis (FCMA)

Activity-based voxel selection

class brainiak.fcma.mvpa_voxelselector.MVPAVoxelSelector(data, mask, labels, num_folds, sl)

Bases: object

Activity-based voxel selection component of FCMA

Parameters
  • data (4D array in shape [brain 3D + epoch]) – contains the averaged and normalized brain data epoch by epoch. It is generated by .io.prepare_searchlight_mvpa_data

  • mask (3D array) –

  • labels (1D array) – contains the labels of the epochs. It is generated by .io.prepare_searchlight_mvpa_data

  • num_folds (int) – the number of folds to be conducted in the cross validation

  • sl (Searchlight) – the distributed Searchlight object

run(clf)

run activity-based voxel selection

Sort the voxels based on the cross-validation accuracy of their activity vectors within the searchlight

Parameters

clf (classification function) – the classifier to be used in cross validation

Returns

  • result_volume (3D array of accuracy numbers) – contains the voxelwise accuracy numbers obtained via Searchlight

  • results (list of tuple (voxel_id, accuracy)) – the accuracy numbers of all voxels, in accuracy descending order the length of array equals the number of voxels

brainiak.fcma.preprocessing module

FCMA preprocessing.

class brainiak.fcma.preprocessing.RandomType(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Define the random types as enumeration

NORANDOM means do not randomize the image data; REPRODUCIBLE means randomize the image data with a fixed seed so that the permutation holds between different runs; UNREPRODUCIBLE means truly randomize the image data which returns different results in different runs.

NORANDOM = 0
REPRODUCIBLE = 1
UNREPRODUCIBLE = 2
brainiak.fcma.preprocessing.generate_epochs_info(epoch_list)

use epoch_list to generate epoch_info defined below

Parameters

epoch_list (list of 3D (binary) array in shape [condition, nEpochs, nTRs]) – Contains specification of epochs and conditions, assuming 1. all subjects have the same number of epochs; 2. len(epoch_list) equals the number of subjects; 3. an epoch is always a continuous time course.

Returns

epoch_info – label is the condition labels of the epochs; sid is the subject id, corresponding to the index of raw_data; start is the start TR of an epoch (inclusive); end is the end TR of an epoch(exclusive). Assuming len(labels) labels equals the number of epochs and the epochs of the same sid are adjacent in epoch_info

Return type

list of tuple (label, sid, start, end).

brainiak.fcma.preprocessing.prepare_fcma_data(images, conditions, mask1, mask2=None, random=RandomType.NORANDOM, comm=<mpi4py.MPI.Intracomm object>)

Prepare data for correlation-based computation and analysis.

Generate epochs of interests, then broadcast to all workers.

Parameters
  • images (Iterable[SpatialImage]) – Data.

  • conditions (List[UniqueLabelConditionSpec]) – Condition specification.

  • mask1 (np.ndarray) – Mask to apply to each image.

  • mask2 (Optional[np.ndarray]) – Mask to apply to each image. If it is not specified, the method will assign None to the returning variable raw_data2 and the self-correlation on raw_data1 will be computed

  • random (Optional[RandomType]) – Randomize the image data within subject or not.

  • comm (MPI.Comm) – MPI communicator to use for MPI operations.

Returns

  • raw_data1 (list of 2D array in shape [epoch length, nVoxels]) – the data organized in epochs, specified by the first mask. len(raw_data) equals the number of epochs

  • raw_data2 (Optional, list of 2D array in shape [epoch length, nVoxels]) – the data organized in epochs, specified by the second mask if any. len(raw_data2) equals the number of epochs

  • labels (list of 1D array) – the condition labels of the epochs len(labels) labels equals the number of epochs

brainiak.fcma.preprocessing.prepare_mvpa_data(images, conditions, mask)

Prepare data for activity-based model training and prediction.

Average the activity within epochs and z-scoring within subject.

Parameters
  • images (Iterable[SpatialImage]) – Data.

  • conditions (List[UniqueLabelConditionSpec]) – Condition specification.

  • mask (np.ndarray) – Mask to apply to each image.

Returns

  • processed_data (2D array in shape [num_voxels, num_epochs]) – averaged epoch by epoch processed data

  • labels (1D array) – contains labels of the data

brainiak.fcma.preprocessing.prepare_searchlight_mvpa_data(images, conditions, data_type=<class 'numpy.float32'>, random=RandomType.NORANDOM)

obtain the data for activity-based voxel selection using Searchlight

Average the activity within epochs and z-scoring within subject, while maintaining the 3D brain structure. In order to save memory, the data is processed subject by subject instead of reading all in before processing. Assuming all subjects live in the identical cube.

Parameters
  • images (Iterable[SpatialImage]) – Data.

  • conditions (List[UniqueLabelConditionSpec]) – Condition specification.

  • data_type – Type to cast image to.

  • random (Optional[RandomType]) – Randomize the image data within subject or not.

Returns

  • processed_data (4D array in shape [brain 3D + epoch]) – averaged epoch by epoch processed data

  • labels (1D array) – contains labels of the data

brainiak.fcma.util module

Full Correlation Matrix Analysis (FCMA)

Correlation related high performance routines

brainiak.fcma.util.compute_correlation(matrix1, matrix2, return_nans=False)

compute correlation between two sets of variables

Correlate the rows of matrix1 with the rows of matrix2. If matrix1 == matrix2, it is auto-correlation computation resulting in a symmetric correlation matrix. The number of columns MUST agree between set1 and set2. The correlation being computed here is the Pearson’s correlation coefficient, which can be expressed as

\[corr(X, Y) = \frac{cov(X, Y)}{\sigma_X\sigma_Y}\]

where cov(X, Y) is the covariance of variable X and Y, and

\[\sigma_X\]

is the standard deviation of variable X

Reducing the correlation computation to matrix multiplication and using BLAS GEMM API wrapped by Scipy can speedup the numpy built-in correlation computation (numpy.corrcoef) by one order of magnitude

\[\begin{split}corr(X, Y) &= \frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{(n-1) \sqrt{\frac{\sum\limits_{j=1}^n x_j^2-n\bar{x}}{n-1}} \sqrt{\frac{\sum\limits_{j=1}^{n} y_j^2-n\bar{y}}{n-1}}}\\ &= \sum\limits_{i=1}^n(\frac{(x_i-\bar{x})} {\sqrt{\sum\limits_{j=1}^n x_j^2-n\bar{x}}} \frac{(y_i-\bar{y})}{\sqrt{\sum\limits_{j=1}^n y_j^2-n\bar{y}}})\end{split}\]

By default (return_nans=False), returns zeros for vectors with NaNs. If return_nans=True, convert zeros to NaNs (np.nan) in output.

Parameters
  • matrix1 (2D array in shape [r1, c]) – MUST be continuous and row-major

  • matrix2 (2D array in shape [r2, c]) – MUST be continuous and row-major

  • return_nans (bool, default:False) – If False, return zeros for NaNs; if True, return NaNs

Returns

corr_data – continuous and row-major in np.float32

Return type

2D array in shape [r1, r2]

brainiak.fcma.voxelselector module

Full Correlation Matrix Analysis (FCMA)

Correlation-based voxel selection

class brainiak.fcma.voxelselector.VoxelSelector(labels, epochs_per_subj, num_folds, raw_data, raw_data2=None, voxel_unit=64, process_num=4, master_rank=0)

Bases: object

Correlation-based voxel selection component of FCMA.

Parameters
  • labels (list of 1D array) – the condition labels of the epochs len(labels) labels equals the number of epochs

  • epochs_per_subj (int) – The number of epochs of each subject

  • num_folds (int) – The number of folds to be conducted in the cross validation

  • raw_data (list of 2D array in shape [epoch length, nVoxels]) –

    Assumption: 1. all activity data contains the same number of voxels
    1. the activity data has been z-scored, ready to compute correlation as matrix multiplication

    2. all subjects have the same number of epochs

    3. epochs belonging to the same subject are adjacent in the list

    4. if MPI jobs are running on multiple nodes, the path used must be on a filesystem shared by all nodes

  • raw_data2 (Optional, list of 2D array in shape [epoch length, nVoxels]) – raw_data2 shares the data structure of the assumptions of raw_data If raw_data2 is None, the correlation will be computed as raw_data by raw_data. If raw_data2 is specified, len(raw_data) MUST equal len(raw_data2), the correlation will be computed as raw_data by raw_data2.

  • voxel_unit (int, default 64) – The number of voxels assigned to a worker each time

  • process_num (Optional[int]) – The maximum number of processes used in cross validation. If None, the number of processes will equal the number of available hardware threads, considering cpusets restrictions. If 0, cross validation will not use python multiprocessing.

  • master_rank (int, default 0) – The process which serves as the master

run(clf)

Run correlation-based voxel selection in master-worker model.

Sort the voxels based on the cross-validation accuracy of their correlation vectors

Parameters

clf (classification function) – the classifier to be used in cross validation

Returns

results – the accuracy numbers of all voxels, in accuracy descending order the length of array equals the number of voxels

Return type

list of tuple (voxel_id, accuracy)