signature=440cd3973609de77869352807f83ecd7,Robust method for identification of prognostic gene signa..._十八像朵花的博客-程序员宅基地

技术标签: signature=440cd3973609de77869352807f83ecd7  

Characteristics of the IPP score

In a conventional log-rank test, patients are classified into two or more groups based on their gene expression levels, and the relationship between prognosis and the expression level of an individual gene is analyzed. Average and median levels of gene expression are both used as threshold values to classify patients into groups. To evaluate the prognostic significance of a gene, we designed IPP to consider all of the possible thresholds for classifying patients into two groups based on gene expression levels (Fig. 1A). The calculated z-scores of all thresholds were displayed as a matrix (IPP matrix), and the average of all z-scores was the IPP score. The sign of the IPP score represents the relationship between prognosis and gene expression level. Negative and positive IPP scores indicate adverse and favorable genes, respectively, meaning that a high level of expression is associated with a poor and good prognosis, respectively. A detailed explanation of the IPP score calculation is provided in the “Overview of the IPP calculation” section within the Materials and Methods (see Eq. (1)).

Figure 1


Calculation of iterative patient partitioning scores. (A) Schematic overview of the iterative patient partitioning (IPP) score calculation. To calculate the IPP score of a gene, patients within the individual datasets are first sorted based on gene expression level (Fig. 1A. i). Next, the patients are iteratively stratified into high and low gene expression level groups while varying the gene expression cutoff thresholds (Fig. 1A. ii), and a z-score for the survival difference between the two groups is calculated for every case, as in the log-rank test (Fig. 1A. iii). The IPP matrix is constructed using the z-scores for individual genes. In each case, the sign of the z-score is negative if patients with high gene expression levels had a higher risk of event occurrence than patients with low gene expression levels; otherwise, the z-score is positive (Fig. 1A. iv). Finally, the IPP score is the average of all z-scores (Fig. 1A. v). In Fig. 1A, the z-score calculation for seven high expression patients and four low expression patients is provided as an example. (B) An example IPP matrix for the luminal subtype: results from the E-TABM-158 dataset with IPP scores of 1.57927 for SERPINE2, 0.0029739 for CD7, and −1.27916 for UBE2Q1.

An important characteristic of the IPP score is its ability to identify the prognostic significance of genes that are missed with the conventional log-rank test, which uses average or median values as gene expression thresholds. We identified prognostic genes that were shown to be statistically insignificant by the log-rank test, using average or median threshold values, based on IPP score differences among genes (Fig. 1B). Detailed examples illustrating the differences between IPP and the conventional log-rank test are provided in Supplementary Figure 1. These examples show that IPP obtain superior results to the conventional log-rank test in terms of identifying prognostic genes. We calculated IPP scores for 21 breast cancer datasets that included a total of 2,735 patients (Table S1). To investigate IPP score characteristics, we compared the IPP score distributions among the independent breast cancer datasets (Supplementary Fig. 2 and 5A; Table S2). Despite the distributions of IPP scores being dissimilar among datasets, all datasets had a maximum frequency of over 1,500 genes, suggesting that more than half of the total genes were unrelated to prognosis.

Robustness and consistency of IPP

Robustness and consistency in prognostic gene selection refer to the degree of similarity among the prognostic genes identified from independent datasets. To verify improvement of robustness and consistency, we compared prognostic genes identified by IPP with those obtained from the conventional log-rank test, which uses the median and average values as gene expression level thresholds. First, we focused on the robustness of individual scores within each dataset by investigating the degree of dependence of each score on the number of patient samples. Bootstrapping and subsampling of two datasets, E-MTAB-748 and GSE17907, were performed and we compared the scores from all samples to those from randomly selected subsamples. IPP showed smaller differences between scores from all samples versus random subsamples than the log-rank test. In addition, variation among scores using IPP was less pronounced compared with that seen with the log-rank test (Fig. 2C and Supplementary Fig. 6C,E). We also analyzed the effect of sample size by calculating Pearson correlation coefficients (r) between the scores from all samples and those from random subsamples. IPP showed a higher correlation in all cases of subsampling than the conventional log-rank test (Supplementary Fig. 7A). These results indicate the superiority of IPP in terms of robustness compared with the conventional log-rank test.

Figure 2


IPP identified prognostic genes more robustly than conventional log-rank test. (A) The number of prognostic genes shared among independent datasets, as identified by IPP and the log-rank test with the average value used as the gene expression level threshold. A gene is considered to be prognostic if its absolute score (IPP score in IPP or z-score in the log-rank test) is within the top 5% of genes, with the same number of genes used for IPP and the log-rank test to ensure a fair comparison of the genes shared among datasets. Prognostic genes shared among five or more datasets are represented by bar graphs. (B) Venn diagram (left) showing the number of genes detected only by IPP, only by the log-rank test, and by both methods. No functional groups were found for genes identified only by the log-rank test in Reactome pathway enrichment analysis (right). (C) The IPP scores represented more reliable when they were compared to the scores from subsamples versus the log-rank test. For each subsample on the x-axis, 1,000 repeated bootstrapping iterations were performed and the averaged values were plotted. In the case of the log-rank test, the z-score was used instead of a p-value. Error bars, mean ± SD. (D) Outcome relation (adverse or favorable) of the genes was determined more consistently by IPP than by the log-rank test. The numbers shown on the bar graphs correspond to the numbers of genes showing identical outcome relation (adverse or favorable) among 17 or more datasets. Cyan and purple colors indicate the results of IPP and the log-rank test, respectively, using the average value as the gene expression threshold. “Log-rank test: Avg” indicates the results obtained using the average value as the threshold in the log-rank test.

Next, we investigated the consistency of IPP and the conventional log-rank test by comparing prognostic genes that were identified in multiple datasets. For both IPP and the log-rank test, we considered the top 5% of genes as prognostic genes, and selected the same number of genes from each dataset. The number of shared prognostic genes among several independent datasets represented the consistency. Prognostic genes shared among at least five datasets were counted (Fig. 2A and Supplementary Fig. 6A). IPP (n = 178) found significantly more shared prognostic genes than the conventional log-rank test using average (n = 125) and median (n = 135) values as gene expression level thresholds. Among the prognostic genes obtained by IPP and the conventional log-rank test, some (overlap between IPP and average: n = 57, median: n = 56) were shared between both methods, but the rest of them were not (only IPP: n = 121 vs only average: n = 68, only IPP: n = 122 vs only median: n = 79) (Fig. 2B and Supplementary Fig. 6B). We suspected that the 68 and 79 genes identified using the average and median value thresholds, respectively were likely to be false-positives and, therefore, unlikely to represent functional groups. We then performed pathway enrichment analysis using the Reactome Pathway Database2B and Supplementary Fig. 6B). On the other hand, cell-cycle related functional groups, such as “mitotic prometaphase” and “cell cycle”, were major features of prognostic genes identified only by IPP (Fig. 2B). These results suggest that IPP identifies prognostic genes more reliably than the conventional log-rank test.

Lastly, we examined the consistency of the relationship between the expression levels of the identified genes and prognosis. A gene is considered adverse if high-level expression is associated with a poor prognosis, such as a short survival or relapse time. When high expression of a gene is associated with a good prognosis, then the gene is considered favorable. We refer to these relationships as outcome relation. Since the outcome relation is a significant feature of prognostic genes, we analyzed the consistency of outcome relation among datasets (Fig. 2D, Supplementary Fig. 6D and Supplementary Fig. 7B). Each gene is assigned two numbers, indicating the number of datasets in which the gene is adverse and favorable. To compare the consistency of outcome relation between IPP and the conventional log-rank test, the numbers of datasets in which the gene is adverse and favorable were counted for each gene (Supplementary Fig. 7B). We summed the numbers of genes that were adverse or favorable in at least 17 datasets. IPP had a greater number of genes in the adverse (n = 485) and favorable (n = 178) classes than the log-rank tests using the average (adverse, n = 336; favorable, n = 126) and median (adverse, n = 337; favorable, n = 139) values as expression level thresholds (Fig. 2D and Supplementary Fig. 6D). These results show the higher robustness and consistency of IPP than compared with the conventional log-rank test.

Molecular subtype-specific breast cancer prognostic genes

It is well known that different subtypes of breast cancer have distinct characteristics. To investigate subtype-specific prognostic genes, we calculated IPP scores for a total of 16 breast cancer datasets by reference to three distinct molecular subtypes: luminal (ER-positive/PR-positive; Supplementary Fig. 9A), HER2-enriched (ER-negative/PR-negative/HER2-positive; Supplementary Fig. 9C), and triple-negative (ER-negative/PR-negative/HER2-negative; Supplementary Fig. 9B) based on ER, PR, and HER2/ERBB2 immunohistochemistry information. Datasets with fewer than 20 patient samples for three molecular subtypes were excluded from the analysis. To ensure that the numbers of patients for all datasets were considered, we calculated representative IPP scores for gene using Liptak’s weighted method9D). Subtype-specific prognostic genes represented the top 5% of all genes (n = 557 genes among a total of 11,123 genes) based on the absolute values of the representative, integrated IPP scores. Next, we performed Reactome pathway enrichment to investigate the characteristics of subtype-specific prognostic genes.

In luminal (ER-positive/PR-positive) breast cancer, the proportion of adverse and favorable genes was 83.7% (n = 466) and 16.3% (n = 91), respectively, out of a total of 557 prognostic genes (Fig. 3a, upper). Based on enrichment analysis, only 54.6% (n = 304) of adverse genes and 11.5% (n = 64) of favorable genes were included within the functional groups. Enrichment analysis showed that adverse genes were composed of cell cycle-related genes, such as those involved in “mitotic prometaphase”, “mitotic metaphase and anaphase” and “cell cycle checkpoints” (Fig. 3a, lower right). In addition, 50.2% of the adverse prognostic genes (n = 234) had at least one PPI with each other, and thus serve to organize PPI networks (Supplementary Fig. 11, Middle). Among the favorable genes, we observed several functional groups, such as “phenylalanine metabolism” and “p53 signaling pathway” (Fig. 3a, lower left panel).

Figure 3


Subtype-specific functional groups of prognostic genes. (a) Luminal (ER-positive/PR-positive) specific prognostic genes and functional groups. In total, 83.7% and 16.3% of prognostic genes (n = 557) were adverse and favorable, respectively; 54.6% and 11.5% of the adverse and favorable prognostic genes, respectively, were assigned to functional groups through Reactome pathway enrichment analysis (upper panel). (b) Triple-negative (ER-negative/PR-negative/HER2-negative)-specific prognostic genes and functional groups: 65.2% and 34.8% of the prognostic genes (n = 557) were adverse and favorable, respectively. In total, 44.2% and 24.4% of the adverse and favorable prognostic genes, respectively, were assigned to Reactome functional groups (upper panel). Representative functional groups of adverse (red) and favorable (blue) prognostic genes (lower panel). The colors of the bars in the figure indicate the outcome relation (adverse or favorable), and the enriched and non-enriched portions of the prognostic genes. X-axis: common logarithm of the false discovery rate (FDR).

The results for HER2-enriched (ER-negative/PR-negative/HER2-positive) and triple-negative (ER-negative/PR-negative/HER2-negative) breast cancers differed from those for luminal (ER-positive/PR-positive) breast cancer. In HER2-enriched breast cancer, 37.7% (n = 210) of the 62.7% (n = 349) of prognostic genes that were adverse were related to extracellular matrix functions, such as “Extracellular matrix organization” and “Integrin signaling pathway” (Supplementary Fig. 10A, lower right). Fewer adverse prognostic genes (30.1%, n = 105) organized PPI networks in HER2-enriched cancer than in luminal breast cancer. In contrast, a great number of favorable prognostic genes (21.2%, n = 44) organized PPI networks in HER2-enriched cancer than luminal in breast cancer (Supplementary Fig. 12). Despite lower significance, the 25.6% (n = 143) of prognostic genes that were favorable for HER2-enriched breast cancers were mainly involved in “Keratinization”, “Wnt signaling pathway” and “Vitamin D metabolism” (Supplementary Fig. 10A, lower left panel).

In triple-negative breast cancer, 65.2% (n = 363) and 34.8% (n = 194) of prognostic genes were adverse and favorable, respectively (Fig. 3b and Supplementary Fig. 13). Among these, 44.2% (n = 246) and 24.4% (n = 136) were included in the Reactome pathway enrichment functional groups (Fig. 3b, upper panel). Unlike luminal and HER2-enriched breast cancers, pathways related to the hypoxia inducible factor, (HIF), such as “HIF-1 alpha transcription factor network” and “HIF-2 alpha transcription factor network”, were major functional features of adverse genes in triple-negative breast cancer. “VEGF & VEGFR signaling network”, “Focal adhesion”, and “Nicotine addiction” were also important functional features of adverse prognostic genes for triple-negative breast cancer (Fig. 3b, lower right panel). On the other hand, favorable prognostic genes for triple-negative breast cancer had immune-related functional features, such as “Interferon alpha/beta signaling”, “Immunoregulatory interactions between a lymphoid & a non-lymphoid cell”, and “Th1 & Th2 cell differentiation” (Fig. 3b, lower left panel). Less than 20 and 10 genes were shared in the adverse and favorable prognostic gene classes, respectively, among luminal, HER2-enriched, and triple-negative breast cancer (data not shown), suggesting remarkably distinct functional characteristics of adverse and favorable prognostic genes among the three molecular subtypes of breast cancer.

Novel prognostic genes identified by IPP

The IPP method led to the identification of prognostic genes that had unknown functions and were therefore not assigned during enrichment analysis. We hypothesized that these genes were novel prognostic genes; therefore, we performed detailed analyses of copy number alteration (CNA) and mutations of these genes using The Cancer Genome Atlas (TCGA) data provided by cBioportal

Among all prognostic genes (n = 557), about 30% were not assigned to any functional group by Reactome pathway enrichment analysis of luminal (33.9%: adverse, 29.1%; favorable, 4.8%, Fig. 4A, upper panel), HER2-enriched (36.7%: adverse, 25.0%; favorable, 11.7%, Supplementary Fig. 10B, upper panel), and triple-negative (31.4%: adverse, 21.0%; favorable, 10.4%, Fig. 4B, upper panel) breast cancers. Among the unassigned genes, the CNA states of METTL17 showed a marked association with prognosis (Fig. 4A, lower right panel). Because METTL17 was an adverse prognostic gene on IPP, higher gene expression of METTL17 indicates a poorer prognosis. Therefore, it is possible that patient groups with amplification and deletion of this gene have lower and higher survival probability than the neutral group, respectively. The other three representative genes, CADPS2 (Fig. 4A, lower left panel), PARP8, and TULP2 (Fig. 4B, lower panel) showed results consistent with those obtained by IPP. Despite few patient samples showing mutations, and a lack of information on breast cancer subtypes, we found that prognostic genes identified by IPP, such as DONSON, MKI67, FAM171A1, TENM4, C16orf45, and RABEP2, showed statistically differences in survival probability between mutation and non-mutation groups (Supplementary Fig. 14). These data demonstrate that IPP has the power to identify novel prognostic genes.

Figure 4


Prognostic significance of copy number alteration (CNA) in functionally unknown prognostic genes identified by IPP. (A) Among all prognostic genes (n = 557) for luminal breast cancer, 29.1% and 4.8% of the adverse and favorable prognostic genes, respectively, were not assigned to any functional group by Reactome pathway enrichment analysis (upper panel). Kaplan-Meier curves show the difference in disease-free survival (DFS; days) according to the CNA states (amplification, neutral, or deletion) of CADPS2 and METTL17, favorable and adverse prognostic genes, respectively, which were not assigned to any functional group by Reactome pathway enrichment analysis (lower panel). (B) Among all prognostic genes (n = 557) for triple-negative breast cancer, 21.0% and 10.4% of adverse and favorable prognostic genes, respectively, were not assigned to any functional group by Reactome pathway enrichment analysis (upper panel). Kaplan-Meier curves show the differences in overall survival (OS; days) according to the CNA state (amplification, neutral, or deletion) of PARP8 and TULP2, favorable and adverse prognostic genes, respectively, which were not assigned to any functional group by Reactome pathway enrichment analysis (lower panel). The colors of the Kaplan-Meier curves indicate the CNA states (purple: amplification, black: neutral, green: deletion). Each n on the Kaplan-Meier curves indicates the number of patients included in that CNA state.

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。





CNN及RNN序列数据建模,Dropout, 长期依赖问题,梯度爆炸/梯度消失

eslint关闭相应规则的方法_after you的博客-程序员宅基地_eslint 关闭规则

{ // 环境定义了预定义的全局变量。 "env": { //环境定义了预定义的全局变量。更多在官网查看 "browser": true, "node": true, "commonjs": true, "amd": true, "es6":true, "mocha":...




ACE_DEBUG  常规的一些输出消息   ACE_ERROR  会提供程序出错的一些低级消息2个宏的用法上是一致的   ACE_DEBUG(错误级别,"格式串",变量1...变量N)   其中部分错误级别系统定义如下:    LM_SHUTDOWN = 01,系统死机级别     LM_TRACE = 02,跟踪级别     LM_DEBUG = 04,DEBU


基于网站seo,做了一采集百度和Google搜索关键字结果的采集.在这里与大家分享一下先看先效果图 代码附加: View Code 1   private void baidu_Click(object sender, EventArgs e) 2         { 3             int num = 100;//搜索条数 4             st...


swift -SnapKit一些基本使用_weixin_30555753的博客-程序员宅基地

参考:SnapKit - 修改约束 SnapKitclass ViewController: UIViewController { private var isUpdateSnapkitV = false ...

小白大数据工程师的养成之路3 关于zookeeper_佳减乘除。的博客-程序员宅基地

Zookeeper集群的安装部署推荐这个传送门 Zookeeper分布式协调简单介绍主要用来解决分布式环境当中多个进程之间的同步控制,让他们有序的去访问某种临界资源,防止造成"脏数据"的后果上图是分布式系统分析一波每台机器各跑一个应用程序。然后我们将这三台机器通过网络...



python电话号码对应的字符组合_Python3 两种方式查找字符串里的电话号码_weixin_39729837的博客-程序员宅基地

利用非正则表达式在字符串中查找电话号码。查号码.pydef isPhoneNumber(text):if len(text) !=12:return Falsefor i in range(0,3):if not text[i].isdecimal():return Falseif text[3] != '-':return Falsefor i in range(4,7):if not text...

c++ using 前置声明_C++ 类声明 类前置声明范例_王后浪的博客-程序员宅基地

转载自在编写C++程序的时候,偶尔需要用到前置声明(Forward declaration)。下面的程序中,带注释的那行就是类B的前置说明。这是必须的,因为类A中用到了类B,而类B的声明出现在类A的后面。如果没有类B的前置说明,下面的程序将不同通过编译,编译器将会给出类似“缺少类型说明符”这样的出错提...

java单机考试系统_Java单机考试系统 基于swing_酥脆金黄的菠萝包的博客-程序员宅基地

【实例简介】Java单机考试系统 基于swing 使用eclipse编译 便于初学者学习【实例截图】【核心代码】Java考试系统├── bin│ ├── elts│ │ ├── ClientContext$1.class│ │ ├── ClientContext$2.class│ │ ├── ClientContext$3.class│ │ ├── Client...