Companion Data Page Systematic exploration of unsupervised methods for mapping behavior Jeremy Todd1, Jamey Kain2, Benjamin de Bivort1,2 1Center for Brain Science and Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, USA. 2Rowland Institute at Harvard, Cambridge, Massachusetts, USA. Manuscript PDF Manuscript (hosted at Physical Biology) Pre-print at bioRxiv Archive of all raw data at Zenodo Code repository at GitHub |
|
Supplemental moviesAll raw movies are available in this archive, hosted by Zenodo.All publication supplementary movies can be viewed with the links below, at YouTube. |
|
Supplemental Movie M1 | Data and clustering visualization for fly experiment 371: shows raw time-domain and wavelet-domain data, movie, and density maps with overlaid cluster assignments for two mapping methods. |
Supplemental Movie M2 | Selected behavioral motifs for co-fit data: shows frames from each of several cluster assignments, useful for comparing behavioral motifs to each other. |
Supplemental Movie M3 Supplemental Movie M4 Supplemental Movie M5 |
Inter-fly comparisons for co-fit data: shows frames from each of several fly experiments for a single cluster assignment, useful for assessing variability within a single behavioral motif. |
120826 fly37.avi 120827 fly37.avi 120828 fly37.avi 120829 fly38.avi 120829 fly39.avi 120829 fly40.avi 120830 fly38.avi 120830 fly39.avi 120916 fly42.avi 130211 fly53 day6.avi 130211 fly55.avi 130212 fly53 day7.avi 130214 fly53 day9.avi | The supplemental movies are generated by several of the .m files below, and the .m files take raw movies as input. Raw movies are available as separate downloads (~1GB each), hosted by Zenodo. |
Data .mats (MATLAB 2015b)All raw data is available for download as a 335MB .zip archive, hosted by Zenodo. |
|
legTrackerAllRawData.mat | Raw instrument-generated data for all wild type flies analyzed. This data was previously published. Flies are distinguished by different f numbers, _1 indicates the first recording of animals recorded on more than one day, etc. |
f130520.txt f130522.txt f130615.txt f130616.txt f130617.txt f130621.txt f130626.txt f130627.txt | Raw instrument-generated data for nanchung mutant flies. The f numbers on these files correspond to the date the experiment was run. |
datastarts/*.mat | Each raw data file contains a period at the beginning which must be discarded. These files contain the offset of valid data start, one file per fly experiment. |
moviecroprects/*.mat | When preparing supplemental movies we cropped the original movie files to fit more on screen. These files contain the crop rects used for each movie. |
moviestarts/*.mat | Offset of movie start, used to synchronize movies with data (use in conjunction with datastarts/*). |
varthresholds/*.mat | We manually determined a cutoff between high and low variance frames for each fly experiment. These files contain the cutoffs, one file per fly experiment. |
Analysis .m functions (MATLAB 2015b)All code is available on GitHub |
raw data preparation and pre-processing |
prep/allFlies.m | Return a cell array with flies for which we have data |
prep/allMovieFilenames.m | Return all of our movie filenames by fly |
prep/dataAndMovieFrameRates.m | Return our data and movie frame rates, in Hz |
prep/describeVarThresholds.m | Print the number of high/low variance frames for each of our flies |
prep/loadCFSPCData.m | Load PCA-compressed high-variance frame-normalized wavelet data for the given fly |
prep/loadFlyData.m | Load the given fly's raw data and preprocess it, return NFramesx15 data matrix |
prep/loadHighVarFNData.m | Load high-variance frame-normalized wavelet data for the given fly |
prep/loadVarThreshold.m | Load high/low-variance indices and the variance threshold for the given fly |
prep/pcaShuffles.m | Compute number of PCs to keep from each fly's high-var frame-normalized wavelet data. We use the shuffling procedure described in the t-SNE mapping paper |
prep/plotNNeighbor.m | Plot nearest neighbor metrics for the given job, metrics must first be computed by runMetricNNeighbors() |
prep/prepCFSData.m | Prepare and save wavelet data for all flies. We do this (as opposed to computing the CFS data from the raw data when we need it) because the wavelet transform was not producing exactly the same results on all platforms for us, and we want the option of running fully deterministic analyses. |
prep/prepCFSPCData.m | Prepare and save PCA-compressed high-variance frame-normalized wavelet data for all flies. We do this (as opposed to computing the PCA-compressed data from the HVFN data when we need it) in case PCA isn't fully deterministic across platforms (this wasn't observed) |
prep/prepErrorCheckData.m | Takes interpolated data, checks for errors, produces raw data struct. |
prep/prepFilterAndInterpData.m | Takes data from VI, interpolates and filters it, produces raw data struct. |
prep/selectMovieCropRects.m | Prompt the user for the crop rect for each movie |
prep/selectVarThresholds.m | Prompt the user for the variance threshold for each fly, save results |
prep/standardDimNames.m | Return the names for our dimensions in standard order (used after prepping our data) | mapping methods |
mapping/pca2gmmMatchSWFlies.m | Run PCA2 GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run |
mapping/pca2gmmMatchTMFlies.m | Run PCA2 GMM clustering on all of our flies, take k from previous t-SNE mapping run |
mapping/pca2gmmSingleFly.m | Run PCA2 GMM clustering on a single fly |
mapping/pcagmmAllFlies.m | Run PCA20 GMM clustering on all of our flies simultaneously |
mapping/pcagmmMatchSWFlies.m | Run PCA20 GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed run, use lockfiles for synchronization |
mapping/pcagmmMatchTMFlies.m | Run PCA-GMM clustering on all of our flies, take k from previous t-SNE mapping run, use lockfiles for synchronization |
mapping/pcagmmRunSingleFlies.m | Run GMM clustering on all of our flies, use lockfiles for synchronization |
mapping/pcagmmSingleFly.m | Run PCA20 GMM clustering on a single fly |
mapping/pcagmmswaRunSingleFlies.m | Run PCA20 GMM sparse watershed (all data points) clustering on all of our flies, use lockfiles for synchronization |
mapping/pcagmmswaSingleFly.m | Run PCA20 GMM sparse watershed (all data points) clustering on a single fly |
mapping/pcagmmswGatherResults.m | Gather results from our pcagmmswRunSingleFlies run |
mapping/pcagmmswPosteriors.m | Compute posterior probabilities for PC20-GMM-SW results |
mapping/pcagmmswRunSingleFlies.m | Run PCA20 GMM sparse watershed (all data points) clustering on all of our flies, use lockfiles for synchronization |
mapping/pcagmmswSingleFly.m | Run PCA20 GMM sparse watershed (mean-only) clustering on a single fly |
mapping/pcawshedMatchSWFlies.m | Run PCA2 watershed clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run |
mapping/pcawshedRunSingleFlies.m | Run PCA-watershed clustering on all of our flies |
mapping/pcawshedSingleFly.m | Run PCA-watershed clustering on a single fly |
mapping/randomMatchSWFlies.m | Run random clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run |
mapping/randomMatchTMFlies.m | Run random clustering on all of our flies, take k from previous t-SNE mapping run |
mapping/randomSingleFly.m | Run random clustering on a single fly (each frame gets a random cluster assignment) |
mapping/runAllPostTM.m | Run all of our mappings once we have our initial t-SNE mapping results |
mapping/tsnegmmMatchSWFlies.m | Run t-SNE GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run, use lockfiles for synchronization |
mapping/tsnegmmMatchTMFlies.m | Run t-SNE GMM clustering on all of our flies, take k from previous t-SNE mapping run, use lockfiles for synchronization |
mapping/tsnegmmSingleFly.m | Run t-SNE GMM clustering on a single fly |
mapping/tsneProcess.m | Run t-SNE mapping on a single fly |
mapping/tsneRunSingleFlies.m | Run t-SNE mapping on all of our flies, use lockfiles for synchronization |
mapping/tsneSetParameters.m | Set our standard t-SNE mapping parameters |
mapping/tsnewshedMatchSWFlies.m | Run t-SNE2 watershed clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run |
mapping/tsnewshedRunSingleFlies.m | Run t-SNE watershed clustering on all of our flies, use embedded values from previous t-SNE mapping run |
mapping/tsnewshedSingleFly.m | Run t-SNE watershed clustering on a single fly |
mapping/wshedProcess.m | Return watersheds, 2-D density map and high-variance cluster assignments for the given embedded data points | t-SNE mapping methods |
MotionMapper | The code for running t-SNE mapping methods is from Berman et al 2014, and it is hosted at this GitHub repository | metric calculations |
metrics/expandClusters.m | Take the given high-variance clusters and produce cluster assignments for selected frames: -1 is an indeterminate cluster (when watershed returns a 0, indicating a point in the density map spans more than one watershed) 0 is the fixed cluster assignment for all low-variance frames 1:k we use these cluster labels for high-variance frames |
metrics/gatherMarkovCofit.m | Gather Markov transition matrices for each of our flies PCA20-GMM-SW co-fit data |
metrics/gatherMetrics.m | Gather metrics comparing results from various clustering algorithms |
metrics/gatherMetricsSW.m | Gather metrics comparing results from various clustering algorithms, use results from our sparse watershed mapping methods |
metrics/loadClusters.m | Load cluster data for the given job/fly |
metrics/loadClustersSW.m | Load cluster data for the given job/fly, uses our sparse watershed mapping results |
metrics/markovTransitionMatrices.m | Compute 0th and 1st order Markov transition matrices |
metrics/metricEntropy.m | Measure entropy in state distributions for the given sequence of cluster assignments |
metrics/metricExitStates.m | Determine the mean number of exit states per cluster in the first-order Markov transition matrix |
metrics/metricMarkovLLRatio.m | Compute Markov order 1-0 log-likelihood ratio on the given sequence of cluster assignments |
metrics/metricMeanClusterDist.m | Measure mean distance to cluster mean in each high-variance cluster, return mean across all clusters |
metrics/metricNNeighbor.m | Compute P(ij in same clusters | j is l'th nearest neighbor of i) for the given job/fly. We only analyze high-variance frames when computing this since low-variance frames aren't submitted to the clustering methods |
metrics/metricStateTransitions.m | Count state transitions in the given cluster assignments |
metrics/reportMetrics.m | Produce a report combining metrics from the given jobs, metrics must first be computed by gatherMetrics() |
metrics/runMetricNNeighbors.m | Compute nearest neighbor metric for all flies in the given job | plot generation |
plots/cmapStandard1.m | Return our standard colormap for plots and movies |
plots/dataFrameToMovieFrame.m | Convert the given data frame to a movie frame number |
plots/exploreCofitData.m | Plot the raw and prepped data for the given fly |
plots/exploreData.m | Plot the raw and prepped data for the given fly |
plots/extractPairwiseClusterMeanDists.m | Extract pairwise distances between co-fit cluster means for fitting in Figure 5A |
plots/graphEdgeClassPlotter.m | Make a figure of a graph, with vertex position specified by a MATLAB graph object, directed edges specified by a weighted adjacency matrix, and colors on vertices indicated by an nx3 RGB array. |
plots/labelSequences.m | Split the given cluster assignments into a set of sequence assignments. A sequence is a group of consecutive frames with the same cluster assignment. For each cluster we identify all sequences, and then assign numbers to sequences longest-to-shortest. This way we can render movies with the longest sequences from each cluster by taking the first N sequences from each cluster |
plots/motif2DGMMMonteCarlo.m | Perform Markov Chain Monte Carlo to position standard 2d Gaussians in a plane to match 1) the distribution of pairwise distances between their centroids and 2) the distribution of maximum posterior probabilities for points randomly sampled from that plane, to the distributions observed in our PCA20-GMM fits. Produces Figure 5A. |
plots/movieFrameToDataFrames.m | Convert the given movie frame number into a range of analyzed data frames. |
plots/patchRects.m | Draw a set of rectangles using patch, this is much faster than calling rectangle for each one individually |
plots/plotBIC.m | Plot BIC curves for all available GMM (independent or co-fit) results |
plots/plotClusterData.m | Plot time-domain and wavelet data with the given cluster assignments |
plots/plotClusterRunLengths.m | Plot distribution of run lengths for the given clusters |
plots/plotCofit.m | Plot cluster size distribution for each fly for PCA20-GMM-SW co-fit data |
plots/plotCofitFrameEnergies.m | Plot cluster energies by dimension for co-fit data |
plots/plotCovarianceMatrices.m | Figure showing shape of covariance matrices for the given fly's PCA20 GMM results |
plots/plotDataPrep.m | Plot our data preparation figure for the given fly |
plots/plotFlyData.m | Plot time-domain and wavelet data for the given fly, we plot the given frames, defaults to all frames |
plots/plotFrameEnergies.m | Plot cluster energies by dimension for PCA20-GMM-SW on the given fly |
plots/plotFrameNormalization.m | Figures illustrating our frame normalization procedure |
plots/plotGMMPosteriors.m | Figure showing distribution of GMM posterior probability ratios |
plots/plotGMMPosteriorsCofit.m | Figure showing distribution of GMM posterior probability ratios for co-fit data |
plots/plotMarkov.m | Figure showing histogram of state occupancies and markov transition matrix for a single fly |
plots/plotMetrics.m | Plot metrics for the given jobs, metrics must first be computed by gatherMetrics() |
plots/plotMetricsSW.m | Plot metrics for the given jobs, metrics must first be computed by gatherMetrics() |
plots/plotNanClusters.m | Plot nan vs wt fly cluster size distributions for each cluster in PCA20-GMM-SW co-fit data |
plots/plotPosturalTrace.m | Plot our postural trace figure with the given fly using PCA20 GMM SW clusters |
plots/plotPreNormalizedData.m | Plot data after resampling and error correction but before normalization |
plots/plotSigmaNumWatersheds.m | Plot num watersheds vs sigma for the given job and fly |
plots/plotSparseWatershedAllDataPoints.m | Plot results from sparse watershed algorithm applied to all data points (not just means) |
plots/plotStepSizeDistribution.m | Plot the step size distribution in PCA20 space for the given fly |
plots/plotTMDensities.m | Plot t-SNE mapping densities for the given results |
plots/plotTMDensity.m | Plot the given density map, shows t-SNE quantized space |
plots/renderCofitFlyGrid.m | Render a movie with a single cluster and one grid cell for each fly, we cycle through sequences and then repeat until a fixed output movie frame count is reached |
plots/renderCofitMotifGrid.m | Render a movie with one major grid cell for each of the given motifs, and one minor grid cell for each of the given flies |
plots/renderDataMovie.m | Render movie of fly and t-SNE watershed density map with PCA20-GMM-SW overlay |
plots/setFigureZoomMode.m | Sets the zoom mode for the given figure, also updates pan mode |
plots/setIntegralXAxisLabels.m | Sets the x axis labels on the given axis (or set of axes) to integers, prevents MATLAB from using scientific notation. Note that axis labels will be incorrect during a zoom/pan operation! They'll correct themselves after mouse button is released (not sure how to fix this - is there a callback that gets called continously during zoom/pan operations?) |
plots/splitMovies.m | Split each fly's movie by cluster assignment |
plots/testRenderDataMovieWiggle.m | The movies produced by renderDataMovie() had their axes change width as tick labels entered from the right, this is the best workaround we've found so far |
plots/waveletInfo.m | Gather wavelet scales and their corresponding frequencies used in our wavelet transforms | support for Harvard Odyssey cluster |
odyssey/odysseyEnvInfo.m | Return SLURM info from our environment: our core count for this node and our SLURM task id. In non-SLURM environments, return defaults for testing locally |