Abstract

Motivation: Computational analysis of datasets generated by treating cells with pharmacological and genetic perturbagens has proven useful for the discovery of functional relationships. Facilitated by technological improvements, perturbational datasets have grown in recent years to include millions of experiments. While initial studies, such as our work on Connectivity Map, used gene expression readouts, recent studies from the NIH LINCS consortium have expanded to a more diverse set of molecular readouts, including proteomic and cell morphological signatures. Sharing these diverse data creates many opportunities for research and discovery, but the unprecedented size of data generated and the complex metadata associated with experiments have also created fundamental technical challenges regarding data storage and cross-assay integration. Results: We present the GCTx file format and a suite of open-source packages for the efficient storage, serialization, and analysis of dense two-dimensional matrices. The utility of this format is not just theoretical; we have extensively used the format in the Connectivity Map to assemble and share massive data sets comprising 1.7 million experiments. We anticipate that the generalizability of the GCTx format, paired with code libraries that we provide, will stimulate wider adoption and lower barriers for integrated cross-assay analysis and algorithm development. Availability: Software packages (available in Matlab, Python, and R) are freely available at https://github.com/cmap

Details

Title
The GCTx format and cmap{Py, R, M} packages: resources for the optimized storage and integrated traversal of dense matrices of data and annotations
Author
Enache, Oana M; Lahr, David L; Natoli, Ted E; Litichevskiy, Lev; Wadden, David; Flynn, Corey; Gould, Joshua; Asiedu, Jacob K; Narayan, Rajiv; Subramanian, Aravind
University/institution
Cold Spring Harbor Laboratory Press
Section
New Results
Publication year
2018
Publication date
Jan 3, 2018
Publisher
Cold Spring Harbor Laboratory Press
ISSN
2692-8205
Source type
Working Paper
Language of publication
English
ProQuest document ID
2071121452
Copyright
�� 2018. This article is published under http://creativecommons.org/licenses/by/4.0/ (���the License���). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.