Representational Similarity Analysis

V.0.2 - beta, contributions

Oranges and orange ping-ping balls are not kept next to each other in the grocery store. The oranges are usually placed with other fruits and ping-pong balls along with other sport items. This helps us shop easily, as we usually group our shopping by item categories: fruits, vegetables, drinks, and finally, somewhere far away from these food items will we go and buy a ping-pong ball. Does this conceptual grouping have a neural basis i.e., are patterns of neural activity for oranges and apples more similar to each other as compared to the patterns for oranges and ping-pong balls?

One could conceivably group items based on other attributes such as: shape and color. This grouping would place oranges, and orange colored ping-pong balls, close to each other in neural representations. In a brain region that is preferential to color, the neural similarities would be greater for oranges and orange ping-pong balls, as compared to oranges and red apples. How can we determine these similarities and how far apart are different groups of categories in their neural representations?

Representational Similarity Analysis (RSA) is a way to compare and contrast different brain states and quantify the space in which they are embedded. In RSA, we compute a similarity measure (often a correlation) between patterns of neural activity for each item. Now, to examine if neural patterns in a brain region are grouped by color, shape, or if it is edible or not, we can easily check this by ordering the similarity measure based on our model. Items next to each other must have higher similarities, as compared to items far apart.

RSA is a highly versatile tool: it can be used to compare brain activity to models, compare data across brain imaging techniques, and even to make cross-species comparisons. You can learn about the method RSA by reading the following articles:

Goal of this script

  1. Learn to compute RSA on a dataset

    Calculate and plot Pearson and Spearman correlations in ROIs
    Order these similarity matrices in a meaningful way
    Interpret a (dis)similarity matrix

  2. Learn to work with Multi Dimensional Scaling as a tool to visualize similarity results

Table of Contents

Load Data

1.1 Load the data for one subject
1.2 Helper Functions
1.3 Visualize the data

2. Create a similarity matrix

2.1 Reorder data into categories
2.2 How to read a similarity matrix
2.3 Representational dissimilarity
2.4 Comparison of representations in monkeys and humans

3. Manifolds and multi-dimensional scaling (MDS)

3.1 Plotting RDM in 2d
3.2 Plotting RDM in 3d

Exercises

Exercise 1 2 3 4 5 6 7 8 9 10

Novel contribution

Dataset

The dataset we will be using for this exercise is from Kriegeskorte et al. (2008), called 'Ninetysix' in the datasets folder. 96 visual stimuli, from the 6 categories listed below, were presented to participants. The image stimuli are stored in the subfolder Stimuli.

The data have 6 categories:

1. artificial inanimate (object/scene)
2. human bodypart 
3. human face 
4. natural inanimate (object/scene)
5. nonhuman bodypart
6. nonhuman face

Self-study: Explore the data

In [1]:
import warnings
import sys 
if not sys.warnoptions:
    warnings.simplefilter("ignore")

import os
import numpy as np
import pandas as pd
import scipy.io
from scipy import stats
from sklearn.manifold import MDS
import scipy.spatial.distance as sp_distance

import matplotlib.pyplot as plt
import seaborn as sns 
from mpl_toolkits.mplot3d import Axes3D

# %matplotlib notebook
%matplotlib inline
%autosave 5
sns.set(style = 'white', context='poster', rc={"lines.linewidth": 2.5})
Autosaving every 5 seconds

1.1 Load the data for one subject

The data for these participants are stored as a matlab file (it was 2008 after all...). Python is able to load matlab files using a scipy utility. However, the file formats can be tricky and may require transformations to make compatible with typical Python organization.

The data from matlab are stored as a dictionary where the variables in the matlab files are keys in the dictionary:

The ROI names are listed in roinames.
The category label indices for each condition are listed in labels.
The label indices correspond to entries in categoryNames.
The data for each ROI, in the order of roinames, is stored in roi_data.

Each row of roi_data represents a stimulus (as defined in labels) and each column represents a voxel (there will be different numbers of voxels in different ROIs). These data have been preprocessed and each entry is stored in terms of t-values. There is no time information and no baseline signal to be used for reference.

Self-study: What do you think these t-values reflect?

The last 4 rows of the dataset have unknown labels (dun dun dunnnnn!). We'll use only the first 92 rows for analysis, for now.

In the analyses that follow we are going to explore the data of subject 'BE'.

1.2 Helper Functions

To make it easy for you to achieve the main goals of this notebook, we have created helper functions that do the heavy lifting in terms of data extraction from matlab data files and convert them into numpy arrays.

In [2]:
"""constants"""

# Load the data
# Make sure you edit the following line to reflect the directory where you are accessing the dataset
# dir = '/gpfs/milgram/data/cmhn-s18/datasets/Ninetysix/'  # Yale
# dir = '/jukebox/pniintel/brainiak_edu/datasets/NinetySix/'  # Princeton
ns_data_dir = '/home/NEU480/datasets/NinetySix/'

all_subj_initials = {'BE', 'KO', 'SN', 'TI'}
rois_to_remove = ['lLO', 'rLO']
rois_to_keep = ['lFFA', 'rFFA', 'lPPA', 'rPPA']


"""helper funcs
"""

def load_data_for_a_subj(subj_initials):
    assert subj_initials in all_subj_initials
    images = scipy.io.loadmat(
        os.path.join(ns_data_dir, '%s_images.mat' % (subj_initials))
    )['images']  
    data = scipy.io.loadmat(
        os.path.join(ns_data_dir, '%s_roi_data.mat' % (subj_initials))
    ) 
    # unpack metadata 
    roi_data_all = data['roi_data']
    roi_names = data['roinames']
    labels = np.array(data['labels'])
    categoryNames = data['categoryNames']

    # re-format metadata labels and ROIs
    n_categories = categoryNames.shape[1]
    n_rois = roi_names.shape[1]
    categories = [categoryNames[0, i][0] for i in range(n_categories)]
    roi_names = [roi_names[0, i][0] for i in range(n_rois)]
    labels = np.squeeze(labels) 
    label_dict = {categories[i]: i+1 for i in range(len(categories))}

    # remove r/lLO
    roi_data = []
    for r in range(n_rois): 
        if roi_names[r] in rois_to_keep: 
            roi_data.append(roi_data_all[0, r])
    roi_names = rois_to_keep
    n_rois = len(rois_to_keep)
    return images, roi_data, roi_names, n_rois, categories, n_categories, labels, label_dict


def digitize_rdm(rdm_raw, n_bins = 10): 
    """Digitize an input matrix to n bins (10 bins by default)
    rdm_raw: a square matrix 
    """
    # compute the bins 
    
    rdm_bins = [np.percentile(np.ravel(rdm_raw), 100/n_bins * i) for i in range(n_bins)]
    # compute the vectorized digitized value 
    rdm_vec_digitized = np.digitize(np.ravel(rdm_raw), bins = rdm_bins) * (100 // n_bins) 
    # reshape to matrix
    rdm_digitized = np.reshape(rdm_vec_digitized, np.shape(rdm_raw)) 
    return rdm_digitized
In [3]:
# load the data for one subject
subj_initials = 'BE'
data_pack = load_data_for_a_subj(subj_initials)
images, roi_data, roi_names, n_rois, categories, n_categories, labels, label_dict = data_pack
n_stimuli = len(images)

Exercise 1: Inspect the data.

  • Print the name of each category.
  • Check the shape of the data for each ROI.
In [4]:
# Insert code here
# Print the labels
print('Category name:')
print(categories)
print()

# Print the ROIs
print('ROIs:')
for i in range(n_rois):
    print('%s\t %s' % (roi_names[i], np.shape(roi_data[i])))
Category name:
['artificial inanimate', 'human bodypart', 'human face', 'natural inanimate', 'nonhuman bodypart', 'nonhuman face']

ROIs:
lFFA	 (96, 34)
rFFA	 (96, 70)
lPPA	 (96, 348)
rPPA	 (96, 273)

1.3 Visualize the data

It is always a good idea to plot data as a sanity check before starting analysis.

We also want to see the corresponding labels. Notice the category order is random.

In [5]:
plt.figure(figsize=(9, 7))

# Data plot
plt.subplot(2,1,1)
# Plot the activity per stimulus for a single voxel
plt.plot(roi_data[0][:, 0])  
plt.ylabel('t-Value', fontsize=16)
plt.xlim(0, 96)

# Label plot
plt.subplot(2,1,2)
plt.plot(labels)
plt.xlabel('Stimuli', fontsize=16)
plt.ylabel('Category', fontsize=16)
Out[5]:
Text(0,0.5,'Category')
Notice that the category order is random i.e. the stimuli at every point are from a different category compared to the neighbors.
In [6]:
# Display the distribution of t values for different voxels in this ROI.
plt.figure()
plt.ylabel('Voxel frequency', fontsize=16)
plt.xlabel('t-Value', fontsize=16)
plt.hist(roi_data[0][:, 0])
Out[6]:
(array([ 2., 11., 13., 12., 23., 17., 10.,  4.,  3.,  1.]),
 array([-3.07862902, -2.30191212, -1.52519522, -0.74847832,  0.02823858,
         0.80495548,  1.58167238,  2.35838928,  3.13510618,  3.91182308,
         4.68853998]),
 <a list of 10 Patch objects>)

2. Create a similarity matrix

Let's examine the similarity of the neural representations of each image with the neural patterns of every other image in the dataset. If the neural patterns are similar between images, we will see high values of similarity, but if the neural patterns are dissimilar, we will see low values of similarity.

There are many ways to compute similarity. We start with one of the most common measures of similarity that you are already familiar with: Pearson correlation (see notebook-04). We compute the Pearson correlation on the neural pattern for each image with every other image. We can compute this on data for each of the ROIs that we have just loaded (left and right ffa, and left and right ppa). For each ROI, our computation will result in a 92 x 92 matrix (we only have labels for 92 images). This resulting matrix shows how similar the neural patterns of activity are between images and is called the representational similarity matrix (RSM).

In [7]:
roi_names
Out[7]:
['lFFA', 'rFFA', 'lPPA', 'rPPA']
In [8]:
# choose your ROI here!
roi_id = 1

# plot figure of these correlations
f, ax = plt.subplots(1,1, figsize=(8, 7))

plt.imshow(
    np.corrcoef(roi_data[roi_id]), 
    cmap='jet', 
)
plt.colorbar()
ax.set_title('RSM, unsorted, %s' % (roi_names[roi_id])) 
ax.set_xlabel('stimuli id')
ax.set_ylabel('stimuli id')
Out[8]:
Text(0,0.5,'stimuli id')

Exercise 2: In the plot above you used Pearson correlation to compute similarity. An alternative metric is a Spearman correlation.

  • Explain the difference and redo the analysis above with Spearman correlation.
  • Visualize the RSM based on Spearman correlation.

A:

In [9]:
# Insert code here
from scipy.stats import spearmanr

# plot figure of these correlations
f, ax = plt.subplots(1,1, figsize=(8, 7))


rank_srm, _ = spearmanr(roi_data[roi_id])

plt.imshow(
    rank_srm, 
    cmap='jet', 
)
plt.colorbar()
ax.set_title('RSM, unsorted, %s' % (roi_names[roi_id])) 
ax.set_xlabel('stimuli id')
ax.set_ylabel('stimuli id')
Out[9]:
Text(0,0.5,'stimuli id')

2.1 Reorder data into categories

Although the plot above is useful, it is hard to observe any structure because the order of the stimuli is random. To simplify, let's reorganize into label groups.

In [10]:
# Add the stimulus condition labels so that we can sort the data, collecting rows from the same condition together.
sort_ids = labels.argsort()
lffa_sorted = roi_data[0][sort_ids, :]

plt.figure(figsize=(9,7))

# Plot the new sorted results
plt.subplot(2,1,1)
plt.plot(lffa_sorted[:,0])
plt.ylabel('t-Value', fontsize=16)
plt.xlim(0, 96)

plt.subplot(2,1,2)
plt.plot(labels[sort_ids])
plt.xlabel('Stimuli', fontsize=16)
plt.ylabel('Category', fontsize=16)
plt.show()
In [11]:
# choose your ROI here! 
roi_id = 1

# calculate the RSM
rsm = np.corrcoef(roi_data[roi_id][sort_ids, :][:92,])

# plot 
f, ax = plt.subplots(1,1, figsize=(10, 8))
plt.imshow(
    rsm, cmap='jet', 
    vmin=-.6, vmax=.6
)
plt.colorbar()

# Pull out the bin edges between the different categories
binsize = np.histogram(labels[:92,], 6)[0]
edges = np.concatenate([np.asarray([0]), np.cumsum(binsize)])[:-1]
ax.set_xticks(list(np.array(edges)+8))
ax.set_xticklabels(categories, rotation = 30)
ax.set_yticks(list(np.array(edges)+8))
ax.set_yticklabels(categories)
ax.vlines(edges,0,92)
ax.hlines(edges,0,92)
ax.set_title('RSM, sorted, %s' % roi_names[roi_id])
    
# f.tight_layout()
Out[11]:
Text(0.5,1,'RSM, sorted, rFFA')
Binning the data: In Figure 1 of Kriegeskorte et al. (2008), the raw correlation values were binned into ten bins based on the percentile score of the dissimilarity value, and the percentile value was plotted. We have created a function `digitize_rdm` to perform the same calculation here and make the plots similar to Figure 1 in Kriegeskorte et al. (2008). The `digitize_rdm` functions works in the following manner:
> 1. Create `n_bins` of percentile values. > 2. Take in the matrix of correlations and reshape it into a single row. > 3. Determine the percentile value of every correlation point and assign it to a bin (`np.digitize` does this). > 4. Reshape the assigned percentile values into the original correlation matrix shape.
Finally, plot the percentile values.
In [12]:
# plot 
f, ax = plt.subplots(1,1, figsize=(10, 8))
plt.imshow(
    digitize_rdm(rsm), cmap='jet', 
)
plt.colorbar()

# Pull out the bin edges between the different categories
binsize = np.histogram(labels[:92,], 6)[0]
edges = np.concatenate([np.asarray([0]), np.cumsum(binsize)])[:-1]
ax.set_xticks(list(np.array(edges)+8))
ax.set_xticklabels(categories, rotation = 30)
ax.set_yticks(list(np.array(edges)+8))
ax.set_yticklabels(categories)
ax.vlines(edges,0,92)
ax.hlines(edges,0,92)
ax.set_title('RSM, sorted, %s' % roi_names[roi_id])
    
# f.tight_layout()
Out[12]:
Text(0.5,1,'RSM, sorted, rFFA')

Exercise 3: This new organization is helpful but could be improved (based on our knowledge of the brain). Order the datapoints so that the categories are as follows: human face, human body part, non-human face, non-human body part, natural inanimate and artificial inanimate. This will make for a nicer looking correlation matrix and will help you see any structure within and between categories.

  • Write a function to re-order the data.
  • Recompute the RSM based on the re-ordered data and visualize it.
  • Visualize the digitized RSM using the digitization function provided earlier.
In [13]:
# Insert code here

def reorder_data(X, y, new_order):
    X_ordered = np.zeros(np.shape(X))
    y_ordered = []
    m = 0 
    for i in new_order: 
        # get masked data 
        i_mask_temp = y == i
        X_i = X[i_mask_temp, :]
        y_i = list(y[i_mask_temp])
        # collect the data in order
        m_, _ = np.shape(X_i)
        X_ordered[m:m+m_,: ] = X_i
        y_ordered += [int(y_i[j]) for j in range(len(y_i))]
        # increment the range
        m+=m_
    return X_ordered, y_ordered


# pick an ROI
roi_id = 1
new_order = [2, 3, 5, 6, 4, 1]
categories_new_order = [categories[new_order[i]-1] for i in range(n_categories)]

X = roi_data[roi_id][:92,]
y = labels[:92] 
X_ordered, y_ordered = reorder_data(X, y, new_order)

print(y_ordered)
print(categories_new_order)
[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
['human bodypart', 'human face', 'nonhuman bodypart', 'nonhuman face', 'natural inanimate', 'artificial inanimate']
In [14]:
# calcualte the RSM
rsm = np.corrcoef(X_ordered)

# plot 
f, ax = plt.subplots(1,1, figsize=(10, 8))
plt.imshow(
    digitize_rdm(rsm, n_bins=10), 
    cmap='jet', 
)
plt.colorbar()

# Pull out the bin edges between the different categories
binsize = np.histogram(y, 6)[0]
edges = np.concatenate([np.asarray([0]), np.cumsum(binsize)])[:-1]
ax.set_xticks(list(np.array(edges)+8))
ax.set_xticklabels(categories_new_order, rotation = 30)
ax.set_yticks(list(np.array(edges)+8))
ax.set_yticklabels(categories_new_order)
ax.vlines(edges,0,92)
ax.hlines(edges,0,92)
ax.set_title('RSM, sorted, %s' % roi_names[roi_id])
Out[14]:
Text(0.5,1,'RSM, sorted, rFFA')

2.2 How to read a similarity matrix

It is important to be able to read a similarity matrix at a glance. There are couple features to look out for and we will highlight these with some toy data.

Imagine a scenario where we have 12 trial types (e.g., images) grouped into 4 categories (e.g., faces, objects, body parts, and scenes). We are going to simulate some data that has similar structure within category but different structure between categories.

2.2.1. Create simulated data.

In [15]:
trial_types = 12
n_sim_categories = 4
repetitions_per_trial = 24
trial_noise = 0.5 # multiplying factor for the noise for each trial.
trials_per_category = int(trial_types / n_sim_categories)

# The template per category. One common signal per category. 
#This will be the similarity within category.
category_templates = np.random.randn(n_sim_categories, repetitions_per_trial)

# Add some noise to each trial and add the category template to each trial.
# This will create the trial activity.
trials = np.zeros((trial_types, repetitions_per_trial))
for category_counter in range(n_sim_categories):
    category_trials = np.random.randn(trials_per_category, repetitions_per_trial) * trial_noise
    for trial_counter in range(trials_per_category):
        trials[(trials_per_category * category_counter) + trial_counter, :] = category_templates[category_counter, :] + category_trials[trial_counter, :]

2.2.2. Compute correlation and plot the result.

In [16]:
# Create the correlation matrix    
sim_r1 = np.corrcoef(trials)
plt.figure(figsize=(8, 8))
plt.imshow(sim_r1, interpolation='none', cmap='jet')
plt.colorbar()
Out[16]:
<matplotlib.colorbar.Colorbar at 0x7f4298390710>

In the above plot you should see some clear blocking structure along the diagonal: items within a category are more similar to one another than they are to items in different categories.

2.2.3. Make two categories similar and observe changes in the similarity matrix

Below we create a plot where there is off-diagonal structure. High similarity in off-diagonal parts of a similarity matrix means that elements that are far apart in the ordering have similar structure. In this toy simulation we create an example where the first and third categories are similar to one another (i.e., faces and body parts).

In [17]:
# Overwrite the template for the 3rd category with the template for 1st category. 
#Python indexing begins at [0].
category_templates[2, :] = category_templates[0, :]

# Create the trial activity
trials = np.zeros((trial_types, repetitions_per_trial))
for category_counter in range(n_sim_categories):
    category_trials = np.random.randn(trials_per_category, repetitions_per_trial) * trial_noise
    for trial_counter in range(trials_per_category):
        trials[(trials_per_category * category_counter) + trial_counter, :] = category_templates[category_counter, :] + category_trials[trial_counter, :]

# Create the correlation matrix    
sim_r2 = np.corrcoef(trials)
plt.figure(figsize=(8, 8))
plt.imshow(sim_r2, interpolation='none', cmap='jet')
plt.colorbar()
Out[17]:
<matplotlib.colorbar.Colorbar at 0x7f4298336438>

Exercise 4: Now use your new knowledge about reading a similarity matrix to interpret the matrix you created for the real data Exercise 3.

A:

2.3 Representational dissimilarity

The previous analyses framed everything in terms of similarity between the items. However people sometimes prefer to consider this type of data in terms of dissimilarity. This close cousin of the similarity matrix is called the representational dissimilarity matrix (RDM). The dissimilarity matrix is computed simply as 1 - correlation.

Exercise 5: Plot the RDM for the right FFA ROI using the new order as you created in Exercise 4.

In [18]:
# Insert code here

# calculate the RSM
rsm = np.corrcoef(X_ordered)

# plot 
f, ax = plt.subplots(1,1, figsize=(10, 8))
plt.imshow(
    digitize_rdm(1-rsm, n_bins=10), 
    cmap='jet', 
)
plt.colorbar()

# Pull out the bin edges between the different categories
binsize = np.histogram(y, 6)[0]
edges = np.concatenate([np.asarray([0]), np.cumsum(binsize)])[:-1]
ax.set_xticks(list(np.array(edges)+8))
ax.set_xticklabels(categories_new_order, rotation = 30)
ax.set_yticks(list(np.array(edges)+8))
ax.set_yticklabels(categories_new_order)
ax.vlines(edges,0,92)
ax.hlines(edges,0,92)
ax.set_title('RDM, sorted, %s' % roi_names[roi_id])
Out[18]:
Text(0.5,1,'RDM, sorted, rFFA')

Exercise 6: For RDM plots based on correlation values, what does an RDM value greater than 1 correspond to in terms of a correlation?

A:

2.4 Comparison of representations in monkeys and humans

The RSA can be used to compare information not just in humans, but across species too. Below is comparison of the RDM for monkeys and humans, in the inferior temporal cortex (Figure 1 in Kriegeskorte et al. (2008)).

In [19]:
# load the data, and bin to percentile
monkeyRDM = pd.read_csv(os.path.join(ns_data_dir, 'RDM_mIT_fig1.txt'), header=None)
humanRDM = pd.read_csv(os.path.join(ns_data_dir, 'RDM_hIT_fig1_manoj.txt'), header=None)

monkey_rsm_digitized = digitize_rdm(1 - monkeyRDM)
human_rsm_digitized = digitize_rdm(1 - humanRDM)
f, axes = plt.subplots(1,2, figsize = (14, 6))
axes[0].imshow(
    monkey_rsm_digitized, 
    cmap='jet', 
)
axes[1].imshow(
     human_rsm_digitized, 
     cmap='jet', 
)
# plt.colorbar()
axes[0].set_title('Monkey RSM')
axes[1].set_title('Human RSM')

#for i in range(2): 
#    axes[i].set_xlabel('stimuli id')
#    axes[i].set_ylabel('stimuli id')
Out[19]:
Text(0.5,1,'Human RSM')

3. Manifolds and multi-dimensional scaling (MDS)

The correlation matrix for the 92 images describes how similar each item is to each other item. This means that if two items have a high positive correlation then they can be thought of as eliciting a very similar activation pattern across voxels. We can reframe this to be thought of as a distance in a high-dimensional space. From this perspective, items that are similar to one another will be grouped close together and far away from points that they are dissimilar to.

Multi-Dimensional Scaling (MDS) allows us to visualize precisely the similarity of our data in a different way then plotting the matrices as we did above. This method is built-in in scikit-learn.

In [20]:
#Create a 2-D MDS
mds = MDS(n_components=2, dissimilarity="precomputed", random_state=0)  # Create the MDS object
results = mds.fit(digitize_rdm(1 - rsm))  # Use the dissimilarity matrix

Exercise 7: How does changing the order of the data (e.g., shuffling the rows/columns) in your RDM affect the distance between points calculated by MDS?

A:

3.1 Plot the 2D structure of the RDM

We'll plot the 92 images on a "map" signifying how close or far apart images are to each other. We use different colors for the 6 categories of images.

In [21]:
coords = results.embedding_

plt.figure(figsize=(10, 7))
color_list = sns.color_palette('colorblind')
for label_counter in np.unique(labels[:92]):
    label_idxs = (labels[:92] == label_counter)[:]
    plt.scatter(
        coords[label_idxs, 0], coords[label_idxs, 1], 
        marker = 'o', s = 50, c=color_list[int(label_counter - 1)]
        )
plt.legend(categories, bbox_to_anchor=(1, .8), loc="upper left")
plt.title('MDS, 2D')
Out[21]:
Text(0.5,1,'MDS, 2D')

Self-study: On the MDS plot you are currently plotting each item is a point. You could instead load in each image and plot that image on the MDS plot directly to get a feel for what images are being grouped.

3.2 Plot the 3D structure of the RDM

MDS is just trying to find a k-dimensional embedding that minimizes the stress (something akin to the goodness of fit). This means we can actually plot it in arbitrarily high dimensions to try and capture the data structure. Below we make a 3D plot.

In [22]:
# Multi-dimensional scaling
mds = MDS(n_components=3, dissimilarity="precomputed", random_state=0)
results = mds.fit(digitize_rdm(1 - rsm))

coords = results.embedding_

fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
for label_counter in np.unique(labels[:92]):
    label_idxs = (labels[:92] == label_counter)[:]
    ax.scatter(
        coords[label_idxs, 0], coords[label_idxs, 1], coords[label_idxs, 2], 
        marker = 'o', s = 50, c=color_list[int(label_counter - 1)]
    )
plt.legend(categories, bbox_to_anchor=(1,.7), loc="upper left")
plt.title('MDS, 3D')
plt.tight_layout()

There are tools available to us to evaluate what is the appropriate dimensionality for visualization of our data (i.e. what dimensionality has sufficiently low stress). We can look at the stress of the MDS with different numbers of components (i.e. different dimensionality) and determine what dimensionality of the data is appropriate.

Let's make a toy problem to start off with in order to get a handle on what this should look like. We are going to make points that lie on a line in 3 dimensional space. Because a line has only one dimension of information (along its length) the data ought to be able to be reduced in dimensionality to a single dimension. We will run MDS on this data to see if that is the case.

In [23]:
coords = np.linspace(1, 30, 30)
coords = np.vstack((coords, coords, coords)).T

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords[:, 0], coords[:, 1], coords[:, 2])
Out[23]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x7f42a02a69e8>
In [24]:
# Calculate the euclidean distance of every point from every other point
dist = sp_distance.squareform(sp_distance.pdist(coords))

# Iterate through different numbers of components
stress = []
for components in range(1, 6):
    mds = MDS(n_components=components, dissimilarity="precomputed", random_state=0)
    # Pull out the stress of the MDS fit
    stress.append(mds.fit(dist).stress_)

# Plot the stress
plt.figure(); 
plt.plot(range(1, 6), stress)
plt.ylabel('Stress')
plt.xlabel('Dimensionality')
Out[24]:
Text(0.5,0,'Dimensionality')

Exercise 8: It is not typical for data to be able to be described accurately in low dimensionality: stress generally decreases with the number of components. Perform a similar analysis below to estimate the appropriate dimensionality to visualize the RDM of the left FFA data from this participant. What is the appropriate lower dimensional representation of the data? Note: Make sure you don't calculate the stress metric from the MDS embedding, calculate the MDS fit from the RDM.

A:

In [25]:
# Insert code here
# Iterate through different numbers of components
stress = []
n_max_components = 10
for components in range(1, n_max_components):
    mds = MDS(n_components=components, dissimilarity="precomputed", random_state=0)
    # Pull out the stress of the MDS fit
    stress.append(mds.fit(1-rsm).stress_)

# Plot the stress
plt.figure(); 
plt.plot(range(1, n_max_components), stress)
plt.ylabel('Stress')
plt.xlabel('Dimensionality')
Out[25]:
Text(0.5,0,'Dimensionality')

Exercise 9: Compute RDMs and create MDS plots for the left PPA and right PPA using the reordering you created above.

In [26]:
roi_names
Out[26]:
['lFFA', 'rFFA', 'lPPA', 'rPPA']
In [27]:
# Insert code here
lPPA_id = 2
rPPA_id = 3
X_lPPA = roi_data[lPPA_id][:92,]
X_rPPA = roi_data[rPPA_id][:92,]
y = labels[:92] 

# re-order data
X_lPPA_ordered, y_ordered = reorder_data(X_lPPA, y, new_order)
X_rPPA_ordered, y_ordered = reorder_data(X_rPPA, y, new_order)

# accumulative the RSM
rsm_lPPA = np.corrcoef(X_lPPA_ordered)
rsm_rPPA = np.corrcoef(X_rPPA_ordered)
In [28]:
f, axes = plt.subplots(1, 2, figsize = (10, 5))

axes[0].imshow(digitize_rdm(rsm_lPPA), cmap='jet')
axes[0].set_title('RDM, sorted, %s' % roi_names[lPPA_id])
axes[1].imshow(digitize_rdm(rsm_rPPA), cmap='jet')
axes[1].set_title('RDM, sorted, %s' % roi_names[rPPA_id])
Out[28]:
Text(0.5,1,'RDM, sorted, rPPA')

Exercise 10: The last four rows in the dataset for subject BE have unmarked labels. One of them is a human face. Using the techniques outlined here and your knowledge of the brain regions -- LOC, PPA, and FFA, analyze the data to make a best guess of which one of the 4 data points is a human face. Show your work and reasoning that led you to this conclusion. Even if you aren't right, it is more about your process. Hint: It would really help you to be able to visualize these 4 points amongst the points you already have.

In [50]:
# Insert code here
from sklearn.neighbors import NearestCentroid

roi_id = 1

# load data 
X = roi_data[roi_id]
y = labels
# sep data 
X_train = X[:92,:]
y_train = y[:92]
X_test = X[92:, :]
# fit model 
nc_model = NearestCentroid(metric='cosine')
nc_model.fit(X_train, y_train)
y_hats = nc_model.predict(X_test)
In [52]:
print(label_dict)
print(label_dict['human face'])
print(y_hats)
{'artificial inanimate': 1, 'human bodypart': 2, 'human face': 3, 'natural inanimate': 4, 'nonhuman bodypart': 5, 'nonhuman face': 6}
3
[3. 1. 1. 6.]

Novel contribution: be creative and make one new discovery by adding an analysis, visualization, or optimization.

Some examples:

  • visualize the average RDM across subjects
  • compare the empirical RDM to some theoretical RDM
  • check the consistency between hierarchical clustering vs. ground truth label ordering
  • use other dimensionality reduction methods to visualize the data (PCA, tSNE, etc.)
  • perform some classification on this data set
  • apply RSA on previous datasets (e.g. VDC, the simulated dataset used in the 1st notebook)
  • your own ideas!

Contributions

M. Kumar, C. Ellis and N. Turk-Browne produced the initial notebook
T. Meissner minor edits
Q. Lu plot aesthetics, digitization func, modularize funcs, re-ordering, replicate the original paper.