Companion Data Page Systematic exploration of unsupervised methods for mapping behavior Jeremy Todd^{1}, Jamey Kain^{2}, Benjamin de Bivort^{1,2} ^{1}Center for Brain Science and Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts, USA. ^{2}Rowland Institute at Harvard, Cambridge, Massachusetts, USA. Manuscript PDF Manuscript (hosted at Physical Biology) Preprint at bioRxiv Archive of all raw data at Zenodo Code repository at GitHub 

Supplemental moviesAll raw movies are available in this archive, hosted by Zenodo.All publication supplementary movies can be viewed with the links below, at YouTube. 

Supplemental Movie M1  Data and clustering visualization for fly experiment 37_{1}: shows raw timedomain and waveletdomain data, movie, and density maps with overlaid cluster assignments for two mapping methods. 
Supplemental Movie M2  Selected behavioral motifs for cofit data: shows frames from each of several cluster assignments, useful for comparing behavioral motifs to each other. 
Supplemental Movie M3 Supplemental Movie M4 Supplemental Movie M5 
Interfly comparisons for cofit data: shows frames from each of several fly experiments for a single cluster assignment, useful for assessing variability within a single behavioral motif. 
120826 fly37.avi 120827 fly37.avi 120828 fly37.avi 120829 fly38.avi 120829 fly39.avi 120829 fly40.avi 120830 fly38.avi 120830 fly39.avi 120916 fly42.avi 130211 fly53 day6.avi 130211 fly55.avi 130212 fly53 day7.avi 130214 fly53 day9.avi  The supplemental movies are generated by several of the .m files below, and the .m files take raw movies as input. Raw movies are available as separate downloads (~1GB each), hosted by Zenodo. 
Data .mats (MATLAB 2015b)All raw data is available for download as a 335MB .zip archive, hosted by Zenodo. 

legTrackerAllRawData.mat  Raw instrumentgenerated data for all wild type flies analyzed. This data was previously published. Flies are distinguished by different f numbers, _1 indicates the first recording of animals recorded on more than one day, etc. 
f130520.txt f130522.txt f130615.txt f130616.txt f130617.txt f130621.txt f130626.txt f130627.txt  Raw instrumentgenerated data for nanchung mutant flies. The f numbers on these files correspond to the date the experiment was run. 
datastarts/*.mat  Each raw data file contains a period at the beginning which must be discarded. These files contain the offset of valid data start, one file per fly experiment. 
moviecroprects/*.mat  When preparing supplemental movies we cropped the original movie files to fit more on screen. These files contain the crop rects used for each movie. 
moviestarts/*.mat  Offset of movie start, used to synchronize movies with data (use in conjunction with datastarts/*). 
varthresholds/*.mat  We manually determined a cutoff between high and low variance frames for each fly experiment. These files contain the cutoffs, one file per fly experiment. 
Analysis .m functions (MATLAB 2015b)All code is available on GitHub 

raw data preparation and preprocessing  
prep/allFlies.m  Return a cell array with flies for which we have data 
prep/allMovieFilenames.m  Return all of our movie filenames by fly 
prep/dataAndMovieFrameRates.m  Return our data and movie frame rates, in Hz 
prep/describeVarThresholds.m  Print the number of high/low variance frames for each of our flies 
prep/loadCFSPCData.m  Load PCAcompressed highvariance framenormalized wavelet data for the given fly 
prep/loadFlyData.m  Load the given fly's raw data and preprocess it, return NFramesx15 data matrix 
prep/loadHighVarFNData.m  Load highvariance framenormalized wavelet data for the given fly 
prep/loadVarThreshold.m  Load high/lowvariance indices and the variance threshold for the given fly 
prep/pcaShuffles.m  Compute number of PCs to keep from each fly's highvar framenormalized wavelet data. We use the shuffling procedure described in the tSNE mapping paper 
prep/plotNNeighbor.m  Plot nearest neighbor metrics for the given job, metrics must first be computed by runMetricNNeighbors() 
prep/prepCFSData.m  Prepare and save wavelet data for all flies. We do this (as opposed to computing the CFS data from the raw data when we need it) because the wavelet transform was not producing exactly the same results on all platforms for us, and we want the option of running fully deterministic analyses. 
prep/prepCFSPCData.m  Prepare and save PCAcompressed highvariance framenormalized wavelet data for all flies. We do this (as opposed to computing the PCAcompressed data from the HVFN data when we need it) in case PCA isn't fully deterministic across platforms (this wasn't observed) 
prep/prepErrorCheckData.m  Takes interpolated data, checks for errors, produces raw data struct. 
prep/prepFilterAndInterpData.m  Takes data from VI, interpolates and filters it, produces raw data struct. 
prep/selectMovieCropRects.m  Prompt the user for the crop rect for each movie 
prep/selectVarThresholds.m  Prompt the user for the variance threshold for each fly, save results 
prep/standardDimNames.m  Return the names for our dimensions in standard order (used after prepping our data) 
mapping methods  
mapping/pca2gmmMatchSWFlies.m  Run PCA2 GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run 
mapping/pca2gmmMatchTMFlies.m  Run PCA2 GMM clustering on all of our flies, take k from previous tSNE mapping run 
mapping/pca2gmmSingleFly.m  Run PCA2 GMM clustering on a single fly 
mapping/pcagmmAllFlies.m  Run PCA20 GMM clustering on all of our flies simultaneously 
mapping/pcagmmMatchSWFlies.m  Run PCA20 GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed run, use lockfiles for synchronization 
mapping/pcagmmMatchTMFlies.m  Run PCAGMM clustering on all of our flies, take k from previous tSNE mapping run, use lockfiles for synchronization 
mapping/pcagmmRunSingleFlies.m  Run GMM clustering on all of our flies, use lockfiles for synchronization 
mapping/pcagmmSingleFly.m  Run PCA20 GMM clustering on a single fly 
mapping/pcagmmswaRunSingleFlies.m  Run PCA20 GMM sparse watershed (all data points) clustering on all of our flies, use lockfiles for synchronization 
mapping/pcagmmswaSingleFly.m  Run PCA20 GMM sparse watershed (all data points) clustering on a single fly 
mapping/pcagmmswGatherResults.m  Gather results from our pcagmmswRunSingleFlies run 
mapping/pcagmmswPosteriors.m  Compute posterior probabilities for PC20GMMSW results 
mapping/pcagmmswRunSingleFlies.m  Run PCA20 GMM sparse watershed (all data points) clustering on all of our flies, use lockfiles for synchronization 
mapping/pcagmmswSingleFly.m  Run PCA20 GMM sparse watershed (meanonly) clustering on a single fly 
mapping/pcawshedMatchSWFlies.m  Run PCA2 watershed clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run 
mapping/pcawshedRunSingleFlies.m  Run PCAwatershed clustering on all of our flies 
mapping/pcawshedSingleFly.m  Run PCAwatershed clustering on a single fly 
mapping/randomMatchSWFlies.m  Run random clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run 
mapping/randomMatchTMFlies.m  Run random clustering on all of our flies, take k from previous tSNE mapping run 
mapping/randomSingleFly.m  Run random clustering on a single fly (each frame gets a random cluster assignment) 
mapping/runAllPostTM.m  Run all of our mappings once we have our initial tSNE mapping results 
mapping/tsnegmmMatchSWFlies.m  Run tSNE GMM clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run, use lockfiles for synchronization 
mapping/tsnegmmMatchTMFlies.m  Run tSNE GMM clustering on all of our flies, take k from previous tSNE mapping run, use lockfiles for synchronization 
mapping/tsnegmmSingleFly.m  Run tSNE GMM clustering on a single fly 
mapping/tsneProcess.m  Run tSNE mapping on a single fly 
mapping/tsneRunSingleFlies.m  Run tSNE mapping on all of our flies, use lockfiles for synchronization 
mapping/tsneSetParameters.m  Set our standard tSNE mapping parameters 
mapping/tsnewshedMatchSWFlies.m  Run tSNE2 watershed clustering on all of our flies, take k from previous PCA20 GMM sparse watershed mapping run 
mapping/tsnewshedRunSingleFlies.m  Run tSNE watershed clustering on all of our flies, use embedded values from previous tSNE mapping run 
mapping/tsnewshedSingleFly.m  Run tSNE watershed clustering on a single fly 
mapping/wshedProcess.m  Return watersheds, 2D density map and highvariance cluster assignments for the given embedded data points 
tSNE mapping methods  
MotionMapper  The code for running tSNE mapping methods is from Berman et al 2014, and it is hosted at this GitHub repository 
metric calculations  
metrics/expandClusters.m  Take the given highvariance clusters and produce cluster assignments for selected frames: 1 is an indeterminate cluster (when watershed returns a 0, indicating a point in the density map spans more than one watershed) 0 is the fixed cluster assignment for all lowvariance frames 1:k we use these cluster labels for highvariance frames 
metrics/gatherMarkovCofit.m  Gather Markov transition matrices for each of our flies PCA20GMMSW cofit data 
metrics/gatherMetrics.m  Gather metrics comparing results from various clustering algorithms 
metrics/gatherMetricsSW.m  Gather metrics comparing results from various clustering algorithms, use results from our sparse watershed mapping methods 
metrics/loadClusters.m  Load cluster data for the given job/fly 
metrics/loadClustersSW.m  Load cluster data for the given job/fly, uses our sparse watershed mapping results 
metrics/markovTransitionMatrices.m  Compute 0th and 1st order Markov transition matrices 
metrics/metricEntropy.m  Measure entropy in state distributions for the given sequence of cluster assignments 
metrics/metricExitStates.m  Determine the mean number of exit states per cluster in the firstorder Markov transition matrix 
metrics/metricMarkovLLRatio.m  Compute Markov order 10 loglikelihood ratio on the given sequence of cluster assignments 
metrics/metricMeanClusterDist.m  Measure mean distance to cluster mean in each highvariance cluster, return mean across all clusters 
metrics/metricNNeighbor.m  Compute P(ij in same clusters  j is l'th nearest neighbor of i) for the given job/fly. We only analyze highvariance frames when computing this since lowvariance frames aren't submitted to the clustering methods 
metrics/metricStateTransitions.m  Count state transitions in the given cluster assignments 
metrics/reportMetrics.m  Produce a report combining metrics from the given jobs, metrics must first be computed by gatherMetrics() 
metrics/runMetricNNeighbors.m  Compute nearest neighbor metric for all flies in the given job 
plot generation  
plots/cmapStandard1.m  Return our standard colormap for plots and movies 
plots/dataFrameToMovieFrame.m  Convert the given data frame to a movie frame number 
plots/exploreCofitData.m  Plot the raw and prepped data for the given fly 
plots/exploreData.m  Plot the raw and prepped data for the given fly 
plots/extractPairwiseClusterMeanDists.m  Extract pairwise distances between cofit cluster means for fitting in Figure 5A 
plots/graphEdgeClassPlotter.m  Make a figure of a graph, with vertex position specified by a MATLAB graph object, directed edges specified by a weighted adjacency matrix, and colors on vertices indicated by an nx3 RGB array. 
plots/labelSequences.m  Split the given cluster assignments into a set of sequence assignments. A sequence is a group of consecutive frames with the same cluster assignment. For each cluster we identify all sequences, and then assign numbers to sequences longesttoshortest. This way we can render movies with the longest sequences from each cluster by taking the first N sequences from each cluster 
plots/motif2DGMMMonteCarlo.m  Perform Markov Chain Monte Carlo to position standard 2d Gaussians in a plane to match 1) the distribution of pairwise distances between their centroids and 2) the distribution of maximum posterior probabilities for points randomly sampled from that plane, to the distributions observed in our PCA20GMM fits. Produces Figure 5A. 
plots/movieFrameToDataFrames.m  Convert the given movie frame number into a range of analyzed data frames. 
plots/patchRects.m  Draw a set of rectangles using patch, this is much faster than calling rectangle for each one individually 
plots/plotBIC.m  Plot BIC curves for all available GMM (independent or cofit) results 
plots/plotClusterData.m  Plot timedomain and wavelet data with the given cluster assignments 
plots/plotClusterRunLengths.m  Plot distribution of run lengths for the given clusters 
plots/plotCofit.m  Plot cluster size distribution for each fly for PCA20GMMSW cofit data 
plots/plotCofitFrameEnergies.m  Plot cluster energies by dimension for cofit data 
plots/plotCovarianceMatrices.m  Figure showing shape of covariance matrices for the given fly's PCA20 GMM results 
plots/plotDataPrep.m  Plot our data preparation figure for the given fly 
plots/plotFlyData.m  Plot timedomain and wavelet data for the given fly, we plot the given frames, defaults to all frames 
plots/plotFrameEnergies.m  Plot cluster energies by dimension for PCA20GMMSW on the given fly 
plots/plotFrameNormalization.m  Figures illustrating our frame normalization procedure 
plots/plotGMMPosteriors.m  Figure showing distribution of GMM posterior probability ratios 
plots/plotGMMPosteriorsCofit.m  Figure showing distribution of GMM posterior probability ratios for cofit data 
plots/plotMarkov.m  Figure showing histogram of state occupancies and markov transition matrix for a single fly 
plots/plotMetrics.m  Plot metrics for the given jobs, metrics must first be computed by gatherMetrics() 
plots/plotMetricsSW.m  Plot metrics for the given jobs, metrics must first be computed by gatherMetrics() 
plots/plotNanClusters.m  Plot nan vs wt fly cluster size distributions for each cluster in PCA20GMMSW cofit data 
plots/plotPosturalTrace.m  Plot our postural trace figure with the given fly using PCA20 GMM SW clusters 
plots/plotPreNormalizedData.m  Plot data after resampling and error correction but before normalization 
plots/plotSigmaNumWatersheds.m  Plot num watersheds vs sigma for the given job and fly 
plots/plotSparseWatershedAllDataPoints.m  Plot results from sparse watershed algorithm applied to all data points (not just means) 
plots/plotStepSizeDistribution.m  Plot the step size distribution in PCA20 space for the given fly 
plots/plotTMDensities.m  Plot tSNE mapping densities for the given results 
plots/plotTMDensity.m  Plot the given density map, shows tSNE quantized space 
plots/renderCofitFlyGrid.m  Render a movie with a single cluster and one grid cell for each fly, we cycle through sequences and then repeat until a fixed output movie frame count is reached 
plots/renderCofitMotifGrid.m  Render a movie with one major grid cell for each of the given motifs, and one minor grid cell for each of the given flies 
plots/renderDataMovie.m  Render movie of fly and tSNE watershed density map with PCA20GMMSW overlay 
plots/setFigureZoomMode.m  Sets the zoom mode for the given figure, also updates pan mode 
plots/setIntegralXAxisLabels.m  Sets the x axis labels on the given axis (or set of axes) to integers, prevents MATLAB from using scientific notation. Note that axis labels will be incorrect during a zoom/pan operation! They'll correct themselves after mouse button is released (not sure how to fix this  is there a callback that gets called continously during zoom/pan operations?) 
plots/splitMovies.m  Split each fly's movie by cluster assignment 
plots/testRenderDataMovieWiggle.m  The movies produced by renderDataMovie() had their axes change width as tick labels entered from the right, this is the best workaround we've found so far 
plots/waveletInfo.m  Gather wavelet scales and their corresponding frequencies used in our wavelet transforms 
support for Harvard Odyssey cluster  
odyssey/odysseyEnvInfo.m  Return SLURM info from our environment: our core count for this node and our SLURM task id. In nonSLURM environments, return defaults for testing locally 