About the TCGA DataΒΆ

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery.

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories:

  • TCGA DCC: the TCGA Data Coordinating Center which provides a Data Portal from which users may download open-access or controlled-access data. This portal provides access to all TCGA data except for the low-level sequence data.
  • CGHub: the Cancer Genomics Hub is NCI’s current secure data repository for all TCGA BAM and FASTQ sequence data files.

The ISB-CGC platform is one of NCI’s Cancer Genomics Cloud Pilots and our mission is to host the TCGA data in the cloud so that researchers around the world may work with the data without needing to download and store the data at their own local institutions.

The vast majority (over 99%) of this petabyte of data consists of low-level sequence data, currently stored as files in Google Cloud Storage. Over the course of the TCGA project, this low-level (“Level 1”) data has been processed through a set of standardized pipelines and the the resulting high-level (“Level 3”) data is frequently the data that is used in most downstream analyses. The ISB-CGC platform aims to make these different types of data accessible to the widest possible variety of users within the cancer research community, using the most appropriate Google Cloud Platform technologies.

More details about the TCGA data can be found in the sections below.


Have feedback or corrections? You can file an issue here or email us at feedback@isb-cgc.org.