本文由医信融合团队成员“陈浩然”撰写,已同步至微信公众号“医信融合创新沙龙”与“研究生学生信”,更多精彩内容欢迎关注!

名人名言
面对这个问题,我们需要拿出新水平、达到新境界,通过新举措、新发展,形成新突破,为此,我们必须重视新方法、看清新形式、理准新要求,只有这样,我们才能在新期待、新关系中,用好新本领、展现新风貌、走出新高度,新知识造就新事物、新实践获得新成果。
——佈兹稲誰硕得
blood的前两集:
刚上研一,复现blood
刚上研一,复现blood(二)

Article name:A comprehensive transcriptome signature of murine hematopoietic stem cell aging
Journal:blood
Doi:10.1182/blood.2020009729
IF:23.629
Position:Figure 1E

买家秀(左)与卖家秀(右)
1
先解释一下
作图背景:
•
1. 研究一共纳入12项研究和1000 +基因
•
1. 每项研究包含1000 +基因中的部分基因
• 1+2:有一个12*1000的矩阵,矩阵中含有NA值
• 每个基因在12项研究中出现的次数作为consistency
• 12项研究从1个子集到11个子集**完全排列组合**
• 需要求每个子集的基因平均consistency和基因表达量的CV
如果按照文字描述的话,实际上只能画出左边那张图出来。而右图的漏洞也很明显,原文作者既然分析了all possiblee combinations,那么图上的点怎么就那么少,实际的点的数量应该是C(1,12)+C(2,12)+C(3,12)+C(4,12)+C(5,12)+...+C(11,12)=4094个,每个点的x轴对应每种组合的GMC(gene mean consistency score,也就是这个基因在12篇文献中平均**出现的次数),y轴对应的是数据集中基因的变异系数(CV,sd(consistency)/mean(consistency)),但是右图能有200个点就不错了,问题出在哪呢?**

原文对于此图的描述
2
结论分析

先说右图吧,原文中得到的结论为:
12项研究在不同的排列中为相同的基因产生了多重一致性。当使用少量的数据集进行分析时,我们观察到高度的变化,当数据集的数量增加时,这种变化开始减小。当使用> 6个数据集时,基因的数量和一致性稳定。
然鹅实际考虑一下,其实这并不好解释,单看1个subset研究的点,CV大的很大,小的也接近0,而纳入11项研究后,研究数量上去了,基因表达的CV反而下降了,这科学吗?
可能说起来比较晦涩,举个栗子
> matrix(data = c(100,0,100,0,100,0,100,0,100,0),nrow = 5,ncol = 2)
[,1] [,2]
[1,] 100 0
[2,] 0 100
[3,] 100 0
[4,] 0 100
[5,] 100 0
> tem<-matrix(data = c(100,0,100,0,100,0,100,0,100,0),nrow = 5,ncol = 2)
> sd(tem[,1])/mean(tem[,1])
[1] 0.9128709
> sd(tem[,2])/mean(tem[,2])
[1] 1.369306
> sd(c(100,0,100,0,100,0,100,0,100,0))/mean(c(100,0,100,0,100,0,100,0,100,0))
[1] 1.054093
例:一个矩阵,五行两列,单独计算每列CV,第一列CV为0.913,第二列CV为1.369,合并起来之后,1.369>CV(合并)>0.913,为1.054。
因此,实际上右图中subset=11的CV,能小于subset=1的CV的最小值吗???
反观左图,其实就很能说明这个概念,subset越大(例如subset=11),它的点的CV值就更加无法超过subset小(例如subset=1)的点的CV的区域,其实也就是,subset越大,其点越集中,在左图中也就显得越聚集。
欢迎各位大佬批评指正,我就是觉得blood这张图出错了。但是20多分的blood怎么会出错呢???肯定是我自己的问题。
3
上代码
library(ggplot2)
library(dplyr)
library(tidyr)
library(reshape)
library(corrplot)
library(ggpubr)
library(do)
rm(list = ls())
setwd("./file")
blood_output <- read.table("output.txt",sep = "\t",header = T)
data <- blood_output[,2:13]
# Fig 1C
plot_matrix <- matrix(nrow = ncol(data),ncol = 3,dimnames = list(NULL,c("name","Upregulated","Downregulated")))
for (x in 1:ncol(data)) {
data_name <- colnames(data)[x]
non_na_data <- na.omit(data[,x])
Upregulated <- length(non_na_data[non_na_data>0])
Downregulated <- length(non_na_data[non_na_data<0])
plot_matrix[x,] <- c(data_name,Upregulated,Downregulated)
}
plot_matrix <- as.data.frame(plot_matrix)
plot_matrix <- melt(plot_matrix,id.vars = c("name"))
plot_matrix$value <- as.numeric(plot_matrix$value)
pl <- ggplot(data=plot_matrix, aes(x=reorder(name,-value), y=value)) +
geom_bar(stat = "identity",aes(fill=variable))+
scale_fill_manual(values=c("#005187","#e5082c"))+
scale_y_continuous(breaks=c(1000,2000,3000),
labels=c("1000", "2000", "3000"))+
theme_bw()+
theme(panel.grid=element_blank())+
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 1))+
labs(x="",y="# of reported DE genes",title = "Reanalysis")+
theme(text = element_text(family = "Arial",face = "bold"))
pl
ggsave(pl, filename = "blood_figure_1c.pdf", device = cairo_pdf,
width = 8, height = 7, units = "in")
# supple Fig1D
multidata <- read.table("output.txt",sep = "\t",header = T,check.names = F)
length_compute <- function(x){
length(x[!is.na(x)])-1
}
genematrix_count <- apply(multidata,1,length_compute)
multidata <- cbind(multidata,count=genematrix_count)
point_plot_data <- as.data.frame(table(multidata$count))
point_plot_data[,1] <- as.numeric(point_plot_data[,1])
supple_Fig1D <- ggplot(data = point_plot_data, aes(x=Var1,y=Freq)) +
geom_smooth(mapping = aes(x=Var1,y=Freq),color="#4baf49",
method = "loess",se = F,span=1,formula = y~I(1/x),size=1.5) +
geom_point(shape=1,size=3,color="#4baf49",stroke=2)+
labs(x="number of overlaps across studies",y="DE genes")+
ylim(0,5000)+
scale_x_continuous(breaks = as.numeric(point_plot_data$Var1))+
theme_bw()+
theme(panel.grid=element_blank())+
theme(axis.title = element_text(size = 15,face = "bold",colour = "black"),
axis.text = element_text(size=15,face = "bold",colour = "black"),
panel.border = element_rect(fill=NA,color="black", size=3, linetype="solid"),
axis.ticks = element_line(size = 2))
ggsave(supple_Fig1D, filename = "supple Fig1D.PDF", device = cairo_pdf,
width = 5, height = 4, units = "in")
#supple Fig1F
rm(list = ls())
Fig1F_data <- read.table("output_remove.txt",sep = "\t",header = T,check.names = F,row.names = 1)
Fig1F_data <- Fig1F_data[,c(5,3,4,6,2,9,11,1,10,12,7,8)]
heatmap_data<-matrix(data = NA,nrow = ncol(Fig1F_data),
ncol = ncol(Fig1F_data),
dimnames = list(colnames(Fig1F_data),colnames(Fig1F_data)))
two_inter <- list()
for (i in colnames(Fig1F_data)) {
for (x in colnames(Fig1F_data)) {
two_inter[[i]][[x]] <- rownames(na.omit(Fig1F_data[,c(i,x)]))
heatmap_data[i,x] <- nrow(na.omit(Fig1F_data[,c(i,x)]))
}
}
heatmap_data[upper.tri(heatmap_data)]<-NA
fraction <- c()
for (n in names(two_inter)) {
inter_names <- names(two_inter)[names(two_inter)!=n]
rname <- c()
for (n2 in inter_names) {
rname <- unique(c(rname,two_inter[[n]][[n2]]))
}
fraction <- c(fraction,length(rname)/length(Fig1F_data[,n][!is.na(Fig1F_data[,n])]))
}
heatmap_data_gg<-reshape2::melt(heatmap_data)
# heatmap_plot<-
ggplot(heatmap_data_gg,aes(x=Var1,y=Var2,fill=log2(value)))+
geom_tile(color = "white",na.rm = T,size=1.5)+
scale_fill_gradient2(low = "#800080", high = "#faed5d", mid = "#65aeb0",
midpoint = 6, space = "Lab",na.value = "white",guide = "colourbar") +
labs(x="",y="",title = "Re-analysis")+
theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
panel.border = element_rect(fill = NA,color = "black",size = 2),
axis.ticks = element_line(size = 1,linetype = 1),
legend.text = element_blank(),
plot.title = element_text(size=18,face = "bold",colour = "black",vjust = 2,hjust = 0.5),
axis.text = element_text(size=15,face = "bold",colour = "black"),
legend.title = element_blank(),legend.key.width = unit(0.7,"cm")
)+
coord_fixed(1)+
geom_text(aes(label=value),colour= "white")
fraction_data <- as.data.frame(cbind(,length(fraction)),fraction,name_true = colnames(Fig1F_data)))
fraction_data$fraction <- factor(fraction_data$fraction,levels=fraction)
inter_ratio <- ggplot(data = fraction_data,aes(x=name,y=fraction,fill=as.numeric(as.character(fraction))))+
geom_tile(color = "white",size = 1.5)+
theme_bw()+
scale_fill_gradient2(low = "#800080", high = "#faed5d", mid = "#65aeb0",
midpoint = 0.5, space = "Lab",na.value = "white")+
coord_fixed(ratio=1)+
theme(axis.text = element_blank(),
axis.title = element_blank(),
legend.position="none",
panel.border = element_rect(fill = NA,color = "black",size = 2),
axis.ticks.x=element_blank(),
axis.ticks.y = element_line(size = 2),
)+
geom_text(aes(label=signif(as.numeric(as.character(fraction)), 2),colour="white"))
# scale_y_discrete(breaks=fraction,labels=fraction_data$name_true,position = "right")
all_plot <- ggarrange(heatmap_plot, inter_ratio,
ncol = 2, nrow = 1,
widths = c(12, 1), heights = 9,
common.legend = TRUE,hjust = 0)
ggsave(all_plot, filename = "supple Fig1F.PDF", device = cairo_pdf,
width = 8, height = 7, units = "in")
# Fig 1E
rm(list = ls())
Fig1E_data <- read.table("output_remove.txt",sep = "\t",header = T,check.names = F,row.names = 1)
# function
length_compute <- function(x){
length(x[!is.na(x)])
}
GMC_count_row <- function(one_col,rt){
count_all <- rt[,"count"]
one_col <- count_all[!is.na(one_col)]
count_mean <- mean(one_col)
}
GMC_count_all <- function(data_set,Fig1E_data, GMC_count_row){
rt <- Fig1E_data[,data_set]
rt <- cbind(rt, count=Fig1E_data$count)
if ((ncol(rt)>=3)) {
row_gene_mean<-mean(apply(rt[,1:(ncol(rt)-1)],2,FUN = GMC_count_row,rt=rt))
}else{
row_gene_mean<-apply(as.matrix(rt[,1]),2,FUN = GMC_count_row,rt=rt)
}
}
CV_all <- function(data_set,Fig1E_data){
rt <- as.matrix(Fig1E_data[,data_set])
delete_row <- which(apply(rt,1,function(x) all(is.na(x))))
if (length(delete_row)>0) {
rt<-rt[-delete_row,]
}
gene_EXP <- as.vector(t(rt))
gene_EXP <- gene_EXP[!is.na(gene_EXP)]
rt_CV <- sd(gene_EXP)/mean(gene_EXP)
}
genematrix_count <- apply(Fig1E_data,1,length_compute)
Fig1E_data <- cbind(Fig1E_data,count=genematrix_count)
Fig1E_data <- Fig1E_data[Fig1E_data$count>1,]
Fig1E_CV_data <- Fig1E_data
Fig1E_CV_data[is.na(Fig1E_CV_data)]<-0
result_all <- list()
for (i in 1:11) {
data_set <- t(combn(colnames(Fig1E_data)[-13],m = i))
CV_result <- apply(data_set,1,FUN = CV_all,Fig1E_data=Fig1E_data)
GMC_count_result <- apply(data_set,MARGIN = 1,FUN = GMC_count_all,Fig1E_data=Fig1E_data,GMC_count_row=GMC_count_row)
plot_data_single <- cbind(CV=CV_result,GMC=GMC_count_result,type=i)
result_all[[i]] <- plot_data_single
}
plot_data <- as.data.frame(do.call(what = rbind,args = result_all))
plot_data[plot_data<0]<-abs(plot_data[plot_data<0])
plot_data$GMC2<-(plot_data$GMC-min(plot_data$GMC))/(max(plot_data$GMC)-min(plot_data$GMC))*12
plot_data_article <- matrix(nrow = 11,ncol = 3,dimnames = list(NULL,c("GMC","CV","type")))
plot_data_article <- as.data.frame(plot_data_article)
for (x in 1:11) {
plot_data_article[x,"CV"]<-sd(plot_data[plot_data$type==x,"CV"])/mean(plot_data[plot_data$type==x,"CV"])
plot_data_article[x,"GMC"]<-mean(plot_data[plot_data$type==x,"GMC"])
plot_data_article[x,"type"]<-x
plot_data[plot_data$type==x,"CV"] <- rep(plot_data_article$CV[x],length(plot_data[plot_data$type==x,"CV"]))
}
#plot_article
ggplot(data = plot_data_article, aes(x=GMC,y=CV)) +
ylim(0,5)+
geom_point(aes(x=GMC,y=CV,colour=factor(type)),shape=1,size=3,stroke=2)+
geom_smooth(data = plot_data,mapping = aes(x=GMC,y=CV),
method = "loess",se = F,span=1,formula = y~I(x),size=1.5)+
scale_color_manual(values = c("1"="#2c3c97",
"2"="#0f75b6",
"3"="#4faed3",
"4"="#95d8e7",
"5"="#d6f1f7",
"6"="#faf5be",
"7"="#ffdf89",
"8"="#ffad5b",
"9"="#ff6336",
"10"="#fa2b1d",
"11"="#bf1625"))
#plot_chr
ggplot(data = plot_data, aes(x=GMC2,y=CV)) +
ylim(0,5)+
geom_point(aes(x=GMC2,y=CV,colour=factor(type)),shape=1,size=3,stroke=2)+
geom_smooth(data = plot_data,mapping = aes(x=GMC2,y=CV,colour = factor(type)),
method = "loess",se = F,span=1,formula = y~I(x),size=1.5)+
scale_color_manual(values = c("1"="#2c3c97",
"2"="#0f75b6",
"3"="#4faed3",
"4"="#95d8e7",
"5"="#d6f1f7",
"6"="#faf5be",
"7"="#ffdf89",
"8"="#ffad5b",
"9"="#ff6336",
"10"="#fa2b1d",
"11"="#bf1625"))
后台回复blood1c领取原始数据,整个代码和文件将以project形式发送,如果不清楚如何使用可以查看刚上研一,复现blood。
感谢观看到最后,敬请批评指正

世上的事,认真不对,不认真更不对,执着不对,一切视作空也不对,平平常常,自自然然,如果上山拜佛,见了佛像就磕头,磕了头,佛像还是佛像,你还是你——生活之累就该少下来了。—贾平凹
图文:陈浩然
本文编辑:莫状