ETL for BigQuery Tables¶
The open-access TCGA data has been uploaded into a set of consistent tables
in the publicly-accessible BigQuery dataset called isb-cgc:tcga_201510_alpha,
tables
which can be accessed
via the BigQuery web interface (by anyone with an active GCP project).
In general, the data in the BigQuery tables is identical to the information that you can also access via the TCGA Data Coordinating Center (DCC) Data Portal, but for users interested in the nitty-gritty details, information is provided here about the ETL (extract, transform and load) steps that were performed for each of the data types.
Before we go into data-type-specific details, a few general notes on formatting and data curation:
- All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set. If the encoding of a character from the original file could not be detected, that character was ignored. Character encodings were detected using the Python library Chardet.
- All missing information value strings such as:
none,None,NONE,null,Null,NULL, ,NA,\_\_UNKNOWN\_\_,<blank>, and?; are represented as NULL values in the BigQuery tables (or may not appear at all, depending on the table schema). - Numbers are stored as integer or floating point values. The original ASCII files sometimes used scientific notation or included comma separators, but these are not preserved in the BigQuery tables.
- End of File (EOF) and End of Line (EOL) delimiters, including CTRL-M characters, were all removed when the raw files were originally parsed.
- Single and double quotes around the values were removed, but in cases where there were quotation marks within a string, they were not removed.
- Whenever necessary, the SDRF file (in the mage-tab archive associated with each data archive) was parsed to find the correct association between the aliquot barcode and the Level-3 data file(s).