2 Download RNA-Seq data from GEO

Here is an example on how to download data from GEO using the Leucegene CBF-AML RNA-Seq data uploaded at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49642. The first step in dowloading data from GEO is to download the SRA files. For that, we need to first get the SRX entries where each one corresponds to a specific sample in the RNA-Seq cohort. Usually, publications provide accession numbers which are used to looup data available online. At every accession number one can find several SRX files. Below is an example using one accession number from the Leucegene data.

2.1 Get SRX sample names

The GEOquery package can be used to extract SRX files linked to an accession number.

library(GEOquery)
library(tidyverse)
library(knitr)
library(stringr)

Below the accession number GSE49642 is used as example.

# Get matrix files for every accession number
series_matrix_info <- function(gse){
gsed <- getGEO(gse,GSEMatrix=TRUE)
gse.mat <- pData(phenoData(gsed[[1]]))
reduced <- gse.mat[,c("title","geo_accession","relation.1")]
write.csv(reduced,file.path("data",paste(gse,"_",nrow(gse.mat),".csv",sep="")),row.names = FALSE)
}

series_matrix_info("GSE49642") # 43 samples

Every row in Table 2.1 contains sample names (title) and GSM numbers. In order to download a particular sample we need the SRA terms which are the names starting with: SRX*** in the relation.1 column. The structure of the matrix might change across different studies but you should be able to find SRX entries hidden somewhere in the GSEMatrix!

matrix_file <- list.files(path = file.path("data"),pattern = "GSE",full.names = TRUE)
GSEmatrix <- read_csv(matrix_file)

kable(GSEmatrix[1:5,],caption="SRX sample names linked to the accession number GSE49642.")
Table 2.1: SRX sample names linked to the accession number GSE49642.
title geo_accession relation.1
02H053 GSM1203305 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332625
02H066 GSM1203306 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332626
03H041 GSM1203307 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332627
03H116 GSM1203308 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332628
03H119 GSM1203309 SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332629

With some string processing we can extract the SRX entries.

GSEmatrix$SRX <- stringr::str_extract(string = GSEmatrix$relation.1,pattern = "SRX[0-9][0-9][0-9][0-9][0-9][0-9]")
GSEmatrix$relation.1 <- NULL
kable(head(GSEmatrix))
title geo_accession SRX
02H053 GSM1203305 SRX332625
02H066 GSM1203306 SRX332626
03H041 GSM1203307 SRX332627
03H116 GSM1203308 SRX332628
03H119 GSM1203309 SRX332629
04H024 GSM1203310 SRX332630

2.2 Create NCBI query

search_ncbi <- paste(GSEmatrix$SRX,collapse=" OR ")
search_ncbi
## [1] "SRX332625 OR SRX332626 OR SRX332627 OR SRX332628 OR SRX332629 OR SRX332630 OR SRX332631 OR SRX332632 OR SRX332633 OR SRX332634 OR SRX332635 OR SRX332636 OR SRX332637 OR SRX332638 OR SRX332639 OR SRX332640 OR SRX332641 OR SRX332642 OR SRX332643 OR SRX332644 OR SRX332645 OR SRX332646 OR SRX332647 OR SRX332648 OR SRX332649 OR SRX332650 OR SRX332651 OR SRX332652 OR SRX332653 OR SRX332654 OR SRX332655 OR SRX332656 OR SRX332657 OR SRX332658 OR SRX332659 OR SRX332660 OR SRX332661 OR SRX332662 OR SRX332663 OR SRX332664 OR SRX332665 OR SRX332666 OR SRX332667"

Paste the search SRX332625 OR SRX332626 OR SRX332627 OR SRX332628 OR SRX332629 OR SRX332630 OR SRX332631 OR SRX332632 OR SRX332633 OR SRX332634 OR SRX332635 OR SRX332636 OR SRX332637 OR SRX332638 OR SRX332639 OR SRX332640 OR SRX332641 OR SRX332642 OR SRX332643 OR SRX332644 OR SRX332645 OR SRX332646 OR SRX332647 OR SRX332648 OR SRX332649 OR SRX332650 OR SRX332651 OR SRX332652 OR SRX332653 OR SRX332654 OR SRX332655 OR SRX332656 OR SRX332657 OR SRX332658 OR SRX332659 OR SRX332660 OR SRX332661 OR SRX332662 OR SRX332663 OR SRX332664 OR SRX332665 OR SRX332666 OR SRX332667 into NCBI https://www.ncbi.nlm.nih.gov/sra and follow the intructions in https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/#download-sequence-data-files-usi Download sequence data files using SRA Toolkit to download all the SRR run names and information of the runs.

# Files are saved in the home directory under ncbi/public/sra
prefetch --option-file SraAccList_CBF-AML_Leucegene.txt

SRA files can then be converted to fastq files with fastq-dump --split-files.