41--RNA-Seq 数据分析准备—SRA 数据下载及整理【原创】-临床决策研究大数据山西省重点实验室

41--RNA-Seq 数据分析准备—SRA 数据下载及整理【原创】

本文由医信融合团队成员“张皓旻”撰写，已同步至微信公众号“医信融合创新沙龙”与“表观精准治疗”，更多精彩内容欢迎关注！

生信沙龙微信公众号

近期下载 SRA 数据，应用 linux 子系统下载极其不顺利。果断放弃，下面介绍两种亲测好用的办法。

方法1 windows下使用SRA Toolkit下载

首先在官网下载 SRA Toolkit windows 版本软件。

然后解压，安装。

在 windows 命令行（CMD）中运行以下代码

#存储路径\sratoolkit.2.11.0-win64\bin\vdb-config --interactive

进入安装界面

为了防止各种插件出错，保险起见，选择默认。
按上下键选择，按 “s” 保存，再按 “exit” 退出。

然后运行代码查看是否安装成功。

如图所示既是安装成功

下载数据很方便，进入 SRA 数据库，选择要下载的数据，下载其 SRR_Acc_List.txt 文件，在数据存储目录中运行以下代码即可：

#存储路径\sratoolkit.2.11.1-win64\bin\prefetch.exe --option-file SRR_Acc_List.txt

按照以下方法可找到 SRR_Acc_List.txt 文件。

第一步进入目标数据的 GEO 信息页面，点击红框位置（GEO 数据库如何使用，我们随后详细介绍：辛苦大家帮我记住这个坑

标题: fig: ）

第二步进入下图页面后点击红框位置，另存为 SRR_Acc_List.txt 文件至数据存储路径

开始下载数据会是这样，最后等待下载完成就好了。

方法2使用sra-explorer下载

SRA Explorer（https://sra-explorer.info/）可以用来生成 SRA 数据下载命令

接着上面介绍的，选好数据后，可以找到数据编号（GSE 号或 SRA 数据号都可以，例如上面的就是 GSE176393 或 SRP323246）输入搜索框。操作如下图。

完成上述三步后会出现这个。

这里我们可以看到很多关于各种数据类型的 URL，你可以选择直接下载 FASTQ 格式文件，也可以选择下载 SRA 文件。我选择直接下载 fastq 格式文件，方便操作。

出现下载命令后有两个选择，1. 比较笨，在 linux 子系统中一个一个运行。2. 将命令复制进一个. sh 中当做一个 shell 脚本批量下载。

vim download.sh
nohup bash download.sh & #后台远行运行情况写入nohup.out文件中。

以上方法可根据每个人的爱好使用，只要网络环境好均可下载。

数据整理

如果使用方法二下载，可直接使用进行后续分析

如果使用方法一下载，会将. sra 数据存入以数据编号建立的文件夹中，需要先将数据全部整理入一个文件夹进行操作，这样会方便很多。

上代码！

##设置一个循环可以批量操作
mkdir download
cat SRR_Acc_List.txt | while read line
do
mv $line/$line.sra download/$line.sra
done

随后应用 fasterq-dump 将. sra 数据转换为. fastq 数据，

fasterq-dump 需要在 WSL 上安装（WSL 如何配置看这里：windows linux 子系统平民生信利器

标题: fig: ），运行以下代码即可安装：

conda install -c bioconda sra-tools=2.11.0

##安装完成后先查看一下fasterq-dump的帮助信息，看看如何使用
fasterq-dump -h

Usage: fasterq-dump [ options ] [ accessions(s)... ]

Parameters:

accessions(s) list of accessions to process

Options:

-o|--outfile <path> full path of outputfile (overrides usage
of current directory and given accession)
-O|--outdir <path> path for outputfile (overrides usage of
current directory, but uses given
accession)
-b|--bufsize <size> size of file-buffer (dflt=1MB, takes
number or number and unit where unit is
one of (K|M|G) case-insensitive)
-c|--curcache <size> size of cursor-cache (dflt=10MB, takes
number or number and unit where unit is
one of (K|M|G) case-insensitive)
-m|--mem <size> memory limit for sorting (dflt=100MB,
takes number or number and unit where
unit is one of (K|M|G) case-insensitive)
-t|--temp <path> path to directory for temp. files
(dflt=current dir.)
-e|--threads <count> how many threads to use (dflt=6)
-p|--progress show progress (not possible if stdout used)
-x|--details print details of all options selected
-s|--split-spot split spots into reads
-S|--split-files write reads into different files
-3|--split-3 writes single reads into special file
--concatenate-reads writes whole spots into one file
-Z|--stdout print output to stdout
-f|--force force overwrite of existing file(s)
-N|--rowid-as-name use rowid as name (avoids using the name
column)
--skip-technical skip technical reads
--include-technical explicitly include technical reads
-P|--print-read-nr include read-number in defline
-M|--min-read-len <count> filter by sequence-lenght
--table <name> which seq-table to use in case of pacbio
--strict terminate on invalid read
-B|--bases <bases> filter output by matching against given
bases
-A|--append append to output-file, instead of
overwriting it
--ngc <path> <path> to ngc file
--perm <path> <path> to permission file
--location <location> location in cloud
--cart <path> <path> to cart file
-V|--version Display the version of the program
-v|--verbose Increase the verbosity of the program
status messages. Use multiple times for
more verbosity.
-L|--log-level <level> Logging level as number or enum string.
One of
(fatal|sys|int|err|warn|info|debug) or
(0-6) Current/default is warn
--option-file file Read more options and parameters from the
file.
-h|--help print this message

"fasterq-dump" version 2.11.0

通过 help 文档可以确定，我们的双端测序，所以需要把文件分成两个，故设置参数 --split-files；由于 fasterq-dump 不能直接生成. gz 压缩文件，所以后续还需手动压缩节省分析数据所用的空间。

mkdir rawdata #建立一个存储数据的文件夹

##结合目的选择好参数，开始批量转换
cat SRR_Acc_List.txt | while read line
do
fasterq-dump -e 12 --split-files download/$line.sra -O rawdata
done

faster-dump 的运行速度很快。接下来运行一行命令就可以批量压缩。

gzip *.fastq

得到这样的数据就可以很方便地进行下面的分析啦。

怎么样，是不是很简单。接下来我们就要一步一步开始学习转录组数据分析啦！快点加入我们一起学习吧！

图文：张皓旻

本文编辑：李晨龙

ACTIVITIES学习

41--RNA-Seq 数据分析准备—SRA 数据下载及整理【原创】

方法1 windows下使用SRA Toolkit下载

方法2使用sra-explorer下载

关注微信