de Bivort Lab: Systematic exploration of unsupervised methods for mapping behavior

Companion Data Page Systematic exploration of unsupervised methods for mapping behavior Jeremy Todd¹, Jamey Kain², Benjamin de Bivort^1,2 ¹Center for Brain Science and Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, USA. ²Rowland Institute at Harvard, Cambridge, Massachusetts, USA. Manuscript PDF Manuscript (hosted at Physical Biology) Pre-print at bioRxiv Archive of all raw data at Zenodo Code repository at GitHub
Supplemental movies All raw movies are available in this archive, hosted by Zenodo. All publication supplementary movies can be viewed with the links below, at YouTube.
Supplemental Movie M1	Data and clustering visualization for fly experiment 37₁: shows raw time-domain and wavelet-domain data, movie, and density maps with overlaid cluster assignments for two mapping methods.
Supplemental Movie M2	Selected behavioral motifs for co-fit data: shows frames from each of several cluster assignments, useful for comparing behavioral motifs to each other.
Supplemental Movie M3 Supplemental Movie M4 Supplemental Movie M5	Inter-fly comparisons for co-fit data: shows frames from each of several fly experiments for a single cluster assignment, useful for assessing variability within a single behavioral motif.
120826 fly37.avi 120827 fly37.avi 120828 fly37.avi 120829 fly38.avi 120829 fly39.avi 120829 fly40.avi 120830 fly38.avi 120830 fly39.avi 120916 fly42.avi 130211 fly53 day6.avi 130211 fly55.avi 130212 fly53 day7.avi 130214 fly53 day9.avi	The supplemental movies are generated by several of the .m files below, and the .m files take raw movies as input. Raw movies are available as separate downloads (~1GB each), hosted by Zenodo.
Data .mats (MATLAB 2015b) All raw data is available for download as a 335MB .zip archive, hosted by Zenodo.
legTrackerAllRawData.mat	Raw instrument-generated data for all wild type flies analyzed. This data was previously published. Flies are distinguished by different f numbers, _1 indicates the first recording of animals recorded on more than one day, etc.
f130520.txt f130522.txt f130615.txt f130616.txt f130617.txt f130621.txt f130626.txt f130627.txt	Raw instrument-generated data for nanchung mutant flies. The f numbers on these files correspond to the date the experiment was run.
datastarts/*.mat	Each raw data file contains a period at the beginning which must be discarded. These files contain the offset of valid data start, one file per fly experiment.
moviecroprects/*.mat	When preparing supplemental movies we cropped the original movie files to fit more on screen. These files contain the crop rects used for each movie.
moviestarts/*.mat	Offset of movie start, used to synchronize movies with data (use in conjunction with datastarts/*).
varthresholds/*.mat	We manually determined a cutoff between high and low variance frames for each fly experiment. These files contain the cutoffs, one file per fly experiment.
Analysis .m functions (MATLAB 2015b) All code is available on GitHub
raw data preparation and pre-processing
prep/allFlies.m	Return a cell array with flies for which we have data
prep/allMovieFilenames.m	Return all of our movie filenames by fly
prep/dataAndMovieFrameRates.m	Return our data and movie frame rates, in Hz
prep/describeVarThresholds.m	Print the number of high/low variance frames for each of our flies
prep/loadCFSPCData.m	Load PCA-compressed high-variance frame-normalized wavelet data for the given fly
prep/loadFlyData.m	Load the given fly's raw data and preprocess it, return NFramesx15 data matrix
prep/loadHighVarFNData.m	Load high-variance frame-normalized wavelet data for the given fly
prep/loadVarThreshold.m	Load high/low-variance indices and the variance threshold for the given fly
prep/pcaShuffles.m	Compute number of PCs to keep from each fly's high-var frame-normalized wavelet data. We use the shuffling procedure described in the t-SNE mapping paper
prep/plotNNeighbor.m	Plot nearest neighbor metrics for the given job, metrics must first be computed by runMetricNNeighbors()
prep/prepCFSData.m	Prepare and save wavelet data for all flies. We do this (as opposed to computing the CFS data from the raw data when we need it) because the wavelet transform was not producing exactly the same results on all platforms for us, and we want the option of running fully deterministic analyses.
prep/prepCFSPCData.m	Prepare and save PCA-compressed high-variance frame-normalized wavelet data for all flies. We do this (as opposed to computing the PCA-compressed data from the HVFN data when we need it) in case PCA isn't fully deterministic across platforms (this wasn't observed)
prep/prepErrorCheckData.m	Takes interpolated data, checks for errors, produces raw data struct.
prep/prepFilterAndInterpData.m	Takes data from VI, interpolates and filters it, produces raw data struct.
prep/selectMovieCropRects.m	Prompt the user for the crop rect for each movie
prep/selectVarThresholds.m	Prompt the user for the variance threshold for each fly, save results
prep/standardDimNames.m	Return the names for our dimensions in standard order (used after prepping our data)
mapping methods
mapping/pca2gmmMatchSWFlies.m	Run PCA2 GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run
mapping/pca2gmmMatchTMFlies.m	Run PCA2 GMM clustering on all of our flies, take k from previous t-SNE mapping run
mapping/pca2gmmSingleFly.m	Run PCA2 GMM clustering on a single fly
mapping/pcagmmAllFlies.m	Run PCA20 GMM clustering on all of our flies simultaneously
mapping/pcagmmMatchSWFlies.m	Run PCA20 GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed run, use lockfiles for synchronization
mapping/pcagmmMatchTMFlies.m	Run PCA-GMM clustering on all of our flies, take k from previous t-SNE mapping run, use lockfiles for synchronization
mapping/pcagmmRunSingleFlies.m	Run GMM clustering on all of our flies, use lockfiles for synchronization
mapping/pcagmmSingleFly.m	Run PCA20 GMM clustering on a single fly
mapping/pcagmmswaRunSingleFlies.m	Run PCA20 GMM sparse watershed (all data points) clustering on all of our flies, use lockfiles for synchronization
mapping/pcagmmswaSingleFly.m	Run PCA20 GMM sparse watershed (all data points) clustering on a single fly
mapping/pcagmmswGatherResults.m	Gather results from our pcagmmswRunSingleFlies run
mapping/pcagmmswPosteriors.m	Compute posterior probabilities for PC20-GMM-SW results
mapping/pcagmmswRunSingleFlies.m	Run PCA20 GMM sparse watershed (all data points) clustering on all of our flies, use lockfiles for synchronization
mapping/pcagmmswSingleFly.m	Run PCA20 GMM sparse watershed (mean-only) clustering on a single fly
mapping/pcawshedMatchSWFlies.m	Run PCA2 watershed clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run
mapping/pcawshedRunSingleFlies.m	Run PCA-watershed clustering on all of our flies
mapping/pcawshedSingleFly.m	Run PCA-watershed clustering on a single fly
mapping/randomMatchSWFlies.m	Run random clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run
mapping/randomMatchTMFlies.m	Run random clustering on all of our flies, take k from previous t-SNE mapping run
mapping/randomSingleFly.m	Run random clustering on a single fly (each frame gets a random cluster assignment)
mapping/runAllPostTM.m	Run all of our mappings once we have our initial t-SNE mapping results
mapping/tsnegmmMatchSWFlies.m	Run t-SNE GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run, use lockfiles for synchronization
mapping/tsnegmmMatchTMFlies.m	Run t-SNE GMM clustering on all of our flies, take k from previous t-SNE mapping run, use lockfiles for synchronization
mapping/tsnegmmSingleFly.m	Run t-SNE GMM clustering on a single fly
mapping/tsneProcess.m	Run t-SNE mapping on a single fly
mapping/tsneRunSingleFlies.m	Run t-SNE mapping on all of our flies, use lockfiles for synchronization
mapping/tsneSetParameters.m	Set our standard t-SNE mapping parameters
mapping/tsnewshedMatchSWFlies.m	Run t-SNE2 watershed clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run
mapping/tsnewshedRunSingleFlies.m	Run t-SNE watershed clustering on all of our flies, use embedded values from previous t-SNE mapping run
mapping/tsnewshedSingleFly.m	Run t-SNE watershed clustering on a single fly
mapping/wshedProcess.m	Return watersheds, 2-D density map and high-variance cluster assignments for the given embedded data points
t-SNE mapping methods
MotionMapper	The code for running t-SNE mapping methods is from Berman et al 2014, and it is hosted at this GitHub repository
metric calculations
metrics/expandClusters.m	Take the given high-variance clusters and produce cluster assignments for selected frames: -1 is an indeterminate cluster (when watershed returns a 0, indicating a point in the density map spans more than one watershed) 0 is the fixed cluster assignment for all low-variance frames 1:k we use these cluster labels for high-variance frames
metrics/gatherMarkovCofit.m	Gather Markov transition matrices for each of our flies PCA20-GMM-SW co-fit data
metrics/gatherMetrics.m	Gather metrics comparing results from various clustering algorithms
metrics/gatherMetricsSW.m	Gather metrics comparing results from various clustering algorithms, use results from our sparse watershed mapping methods
metrics/loadClusters.m	Load cluster data for the given job/fly
metrics/loadClustersSW.m	Load cluster data for the given job/fly, uses our sparse watershed mapping results
metrics/markovTransitionMatrices.m	Compute 0th and 1st order Markov transition matrices
metrics/metricEntropy.m	Measure entropy in state distributions for the given sequence of cluster assignments
metrics/metricExitStates.m	Determine the mean number of exit states per cluster in the first-order Markov transition matrix
metrics/metricMarkovLLRatio.m	Compute Markov order 1-0 log-likelihood ratio on the given sequence of cluster assignments
metrics/metricMeanClusterDist.m	Measure mean distance to cluster mean in each high-variance cluster, return mean across all clusters
metrics/metricNNeighbor.m	Compute P(ij in same clusters \| j is l'th nearest neighbor of i) for the given job/fly. We only analyze high-variance frames when computing this since low-variance frames aren't submitted to the clustering methods
metrics/metricStateTransitions.m	Count state transitions in the given cluster assignments
metrics/reportMetrics.m	Produce a report combining metrics from the given jobs, metrics must first be computed by gatherMetrics()
metrics/runMetricNNeighbors.m	Compute nearest neighbor metric for all flies in the given job
plot generation
plots/cmapStandard1.m	Return our standard colormap for plots and movies
plots/dataFrameToMovieFrame.m	Convert the given data frame to a movie frame number
plots/exploreCofitData.m	Plot the raw and prepped data for the given fly
plots/exploreData.m	Plot the raw and prepped data for the given fly
plots/extractPairwiseClusterMeanDists.m	Extract pairwise distances between co-fit cluster means for fitting in Figure 5A
plots/graphEdgeClassPlotter.m	Make a figure of a graph, with vertex position specified by a MATLAB graph object, directed edges specified by a weighted adjacency matrix, and colors on vertices indicated by an nx3 RGB array.
plots/labelSequences.m	Split the given cluster assignments into a set of sequence assignments. A sequence is a group of consecutive frames with the same cluster assignment. For each cluster we identify all sequences, and then assign numbers to sequences longest-to-shortest. This way we can render movies with the longest sequences from each cluster by taking the first N sequences from each cluster
plots/motif2DGMMMonteCarlo.m	Perform Markov Chain Monte Carlo to position standard 2d Gaussians in a plane to match 1) the distribution of pairwise distances between their centroids and 2) the distribution of maximum posterior probabilities for points randomly sampled from that plane, to the distributions observed in our PCA20-GMM fits. Produces Figure 5A.
plots/movieFrameToDataFrames.m	Convert the given movie frame number into a range of analyzed data frames.
plots/patchRects.m	Draw a set of rectangles using patch, this is much faster than calling rectangle for each one individually
plots/plotBIC.m	Plot BIC curves for all available GMM (independent or co-fit) results
plots/plotClusterData.m	Plot time-domain and wavelet data with the given cluster assignments
plots/plotClusterRunLengths.m	Plot distribution of run lengths for the given clusters
plots/plotCofit.m	Plot cluster size distribution for each fly for PCA20-GMM-SW co-fit data
plots/plotCofitFrameEnergies.m	Plot cluster energies by dimension for co-fit data
plots/plotCovarianceMatrices.m	Figure showing shape of covariance matrices for the given fly's PCA20 GMM results
plots/plotDataPrep.m	Plot our data preparation figure for the given fly
plots/plotFlyData.m	Plot time-domain and wavelet data for the given fly, we plot the given frames, defaults to all frames
plots/plotFrameEnergies.m	Plot cluster energies by dimension for PCA20-GMM-SW on the given fly
plots/plotFrameNormalization.m	Figures illustrating our frame normalization procedure
plots/plotGMMPosteriors.m	Figure showing distribution of GMM posterior probability ratios
plots/plotGMMPosteriorsCofit.m	Figure showing distribution of GMM posterior probability ratios for co-fit data
plots/plotMarkov.m	Figure showing histogram of state occupancies and markov transition matrix for a single fly
plots/plotMetrics.m	Plot metrics for the given jobs, metrics must first be computed by gatherMetrics()
plots/plotMetricsSW.m	Plot metrics for the given jobs, metrics must first be computed by gatherMetrics()
plots/plotNanClusters.m	Plot nan vs wt fly cluster size distributions for each cluster in PCA20-GMM-SW co-fit data
plots/plotPosturalTrace.m	Plot our postural trace figure with the given fly using PCA20 GMM SW clusters
plots/plotPreNormalizedData.m	Plot data after resampling and error correction but before normalization
plots/plotSigmaNumWatersheds.m	Plot num watersheds vs sigma for the given job and fly
plots/plotSparseWatershedAllDataPoints.m	Plot results from sparse watershed algorithm applied to all data points (not just means)
plots/plotStepSizeDistribution.m	Plot the step size distribution in PCA20 space for the given fly
plots/plotTMDensities.m	Plot t-SNE mapping densities for the given results
plots/plotTMDensity.m	Plot the given density map, shows t-SNE quantized space
plots/renderCofitFlyGrid.m	Render a movie with a single cluster and one grid cell for each fly, we cycle through sequences and then repeat until a fixed output movie frame count is reached
plots/renderCofitMotifGrid.m	Render a movie with one major grid cell for each of the given motifs, and one minor grid cell for each of the given flies
plots/renderDataMovie.m	Render movie of fly and t-SNE watershed density map with PCA20-GMM-SW overlay
plots/setFigureZoomMode.m	Sets the zoom mode for the given figure, also updates pan mode
plots/setIntegralXAxisLabels.m	Sets the x axis labels on the given axis (or set of axes) to integers, prevents MATLAB from using scientific notation. Note that axis labels will be incorrect during a zoom/pan operation! They'll correct themselves after mouse button is released (not sure how to fix this - is there a callback that gets called continously during zoom/pan operations?)
plots/splitMovies.m	Split each fly's movie by cluster assignment
plots/testRenderDataMovieWiggle.m	The movies produced by renderDataMovie() had their axes change width as tick labels entered from the right, this is the best workaround we've found so far
plots/waveletInfo.m	Gather wavelet scales and their corresponding frequencies used in our wavelet transforms
support for Harvard Odyssey cluster
odyssey/odysseyEnvInfo.m	Return SLURM info from our environment: our core count for this node and our SLURM task id. In non-SLURM environments, return defaults for testing locally

Supplemental movies

All raw movies are available in this archive, hosted by Zenodo.

All publication supplementary movies can be viewed with the links below, at YouTube.

Data .mats (MATLAB 2015b)

All raw data is available for download as a 335MB .zip archive, hosted by Zenodo.

Analysis .m functions (MATLAB 2015b)

All code is available on GitHub