2 Download RNA-Seq data from GEO
Here is an example on how to download data from GEO using the Leucegene CBF-AML RNA-Seq data uploaded at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49642.
The first step in dowloading data from GEO is to download the SRA
files. For that, we need to first get the SRX
entries where each one corresponds to a specific sample in the RNA-Seq cohort. Usually, publications provide accession numbers which are used to looup data available online. At every accession number one can find several SRX
files. Below is an example using one accession number from the Leucegene data.
2.1 Get SRX sample names
The GEOquery
package can be used to extract SRX
files linked to an accession number.
library(GEOquery)
library(tidyverse)
library(knitr)
library(stringr)
Below the accession number GSE49642 is used as example.
# Get matrix files for every accession number
series_matrix_info <- function(gse){
gsed <- getGEO(gse,GSEMatrix=TRUE)
gse.mat <- pData(phenoData(gsed[[1]]))
reduced <- gse.mat[,c("title","geo_accession","relation.1")]
write.csv(reduced,file.path("data",paste(gse,"_",nrow(gse.mat),".csv",sep="")),row.names = FALSE)
}
series_matrix_info("GSE49642") # 43 samples
Every row in Table 2.1 contains sample names (title
) and GSM
numbers. In order to download a particular sample we need the SRA
terms which are the names starting with: SRX***
in the relation.1
column. The structure of the matrix might change across different studies but you should be able to find SRX
entries hidden somewhere in the GSEMatrix
!
matrix_file <- list.files(path = file.path("data"),pattern = "GSE",full.names = TRUE)
GSEmatrix <- read_csv(matrix_file)
kable(GSEmatrix[1:5,],caption="SRX sample names linked to the accession number GSE49642.")
title | geo_accession | relation.1 |
---|---|---|
02H053 | GSM1203305 | SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332625 |
02H066 | GSM1203306 | SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332626 |
03H041 | GSM1203307 | SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332627 |
03H116 | GSM1203308 | SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332628 |
03H119 | GSM1203309 | SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX332629 |
With some string processing we can extract the SRX
entries.
GSEmatrix$SRX <- stringr::str_extract(string = GSEmatrix$relation.1,pattern = "SRX[0-9][0-9][0-9][0-9][0-9][0-9]")
GSEmatrix$relation.1 <- NULL
kable(head(GSEmatrix))
title | geo_accession | SRX |
---|---|---|
02H053 | GSM1203305 | SRX332625 |
02H066 | GSM1203306 | SRX332626 |
03H041 | GSM1203307 | SRX332627 |
03H116 | GSM1203308 | SRX332628 |
03H119 | GSM1203309 | SRX332629 |
04H024 | GSM1203310 | SRX332630 |
2.2 Create NCBI query
search_ncbi <- paste(GSEmatrix$SRX,collapse=" OR ")
search_ncbi
## [1] "SRX332625 OR SRX332626 OR SRX332627 OR SRX332628 OR SRX332629 OR SRX332630 OR SRX332631 OR SRX332632 OR SRX332633 OR SRX332634 OR SRX332635 OR SRX332636 OR SRX332637 OR SRX332638 OR SRX332639 OR SRX332640 OR SRX332641 OR SRX332642 OR SRX332643 OR SRX332644 OR SRX332645 OR SRX332646 OR SRX332647 OR SRX332648 OR SRX332649 OR SRX332650 OR SRX332651 OR SRX332652 OR SRX332653 OR SRX332654 OR SRX332655 OR SRX332656 OR SRX332657 OR SRX332658 OR SRX332659 OR SRX332660 OR SRX332661 OR SRX332662 OR SRX332663 OR SRX332664 OR SRX332665 OR SRX332666 OR SRX332667"
Paste the search SRX332625 OR SRX332626 OR SRX332627 OR SRX332628 OR SRX332629 OR SRX332630 OR SRX332631 OR SRX332632 OR SRX332633 OR SRX332634 OR SRX332635 OR SRX332636 OR SRX332637 OR SRX332638 OR SRX332639 OR SRX332640 OR SRX332641 OR SRX332642 OR SRX332643 OR SRX332644 OR SRX332645 OR SRX332646 OR SRX332647 OR SRX332648 OR SRX332649 OR SRX332650 OR SRX332651 OR SRX332652 OR SRX332653 OR SRX332654 OR SRX332655 OR SRX332656 OR SRX332657 OR SRX332658 OR SRX332659 OR SRX332660 OR SRX332661 OR SRX332662 OR SRX332663 OR SRX332664 OR SRX332665 OR SRX332666 OR SRX332667 into NCBI https://www.ncbi.nlm.nih.gov/sra and follow the intructions in https://www.ncbi.nlm.nih.gov/sra/docs/sradownload/#download-sequence-data-files-usi Download sequence data files using SRA Toolkit to download all the SRR
run names and information of the runs.
# Files are saved in the home directory under ncbi/public/sra
prefetch --option-file SraAccList_CBF-AML_Leucegene.txt
SRA files can then be converted to fastq
files with fastq-dump --split-files
.