signature=440cd3973609de77869352807f83ecd7,Robust method for identification of prognostic gene signa..._十八像朵花的博客-程序员宅基地

技术标签: signature=440cd3973609de77869352807f83ecd7  

Characteristics of the IPP score

In a conventional log-rank test, patients are classified into two or more groups based on their gene expression levels, and the relationship between prognosis and the expression level of an individual gene is analyzed. Average and median levels of gene expression are both used as threshold values to classify patients into groups. To evaluate the prognostic significance of a gene, we designed IPP to consider all of the possible thresholds for classifying patients into two groups based on gene expression levels (Fig. 1A). The calculated z-scores of all thresholds were displayed as a matrix (IPP matrix), and the average of all z-scores was the IPP score. The sign of the IPP score represents the relationship between prognosis and gene expression level. Negative and positive IPP scores indicate adverse and favorable genes, respectively, meaning that a high level of expression is associated with a poor and good prognosis, respectively. A detailed explanation of the IPP score calculation is provided in the “Overview of the IPP calculation” section within the Materials and Methods (see Eq. (1)).

Figure 1


Calculation of iterative patient partitioning scores. (A) Schematic overview of the iterative patient partitioning (IPP) score calculation. To calculate the IPP score of a gene, patients within the individual datasets are first sorted based on gene expression level (Fig. 1A. i). Next, the patients are iteratively stratified into high and low gene expression level groups while varying the gene expression cutoff thresholds (Fig. 1A. ii), and a z-score for the survival difference between the two groups is calculated for every case, as in the log-rank test (Fig. 1A. iii). The IPP matrix is constructed using the z-scores for individual genes. In each case, the sign of the z-score is negative if patients with high gene expression levels had a higher risk of event occurrence than patients with low gene expression levels; otherwise, the z-score is positive (Fig. 1A. iv). Finally, the IPP score is the average of all z-scores (Fig. 1A. v). In Fig. 1A, the z-score calculation for seven high expression patients and four low expression patients is provided as an example. (B) An example IPP matrix for the luminal subtype: results from the E-TABM-158 dataset with IPP scores of 1.57927 for SERPINE2, 0.0029739 for CD7, and −1.27916 for UBE2Q1.

An important characteristic of the IPP score is its ability to identify the prognostic significance of genes that are missed with the conventional log-rank test, which uses average or median values as gene expression thresholds. We identified prognostic genes that were shown to be statistically insignificant by the log-rank test, using average or median threshold values, based on IPP score differences among genes (Fig. 1B). Detailed examples illustrating the differences between IPP and the conventional log-rank test are provided in Supplementary Figure 1. These examples show that IPP obtain superior results to the conventional log-rank test in terms of identifying prognostic genes. We calculated IPP scores for 21 breast cancer datasets that included a total of 2,735 patients (Table S1). To investigate IPP score characteristics, we compared the IPP score distributions among the independent breast cancer datasets (Supplementary Fig. 2 and 5A; Table S2). Despite the distributions of IPP scores being dissimilar among datasets, all datasets had a maximum frequency of over 1,500 genes, suggesting that more than half of the total genes were unrelated to prognosis.

Robustness and consistency of IPP

Robustness and consistency in prognostic gene selection refer to the degree of similarity among the prognostic genes identified from independent datasets. To verify improvement of robustness and consistency, we compared prognostic genes identified by IPP with those obtained from the conventional log-rank test, which uses the median and average values as gene expression level thresholds. First, we focused on the robustness of individual scores within each dataset by investigating the degree of dependence of each score on the number of patient samples. Bootstrapping and subsampling of two datasets, E-MTAB-748 and GSE17907, were performed and we compared the scores from all samples to those from randomly selected subsamples. IPP showed smaller differences between scores from all samples versus random subsamples than the log-rank test. In addition, variation among scores using IPP was less pronounced compared with that seen with the log-rank test (Fig. 2C and Supplementary Fig. 6C,E). We also analyzed the effect of sample size by calculating Pearson correlation coefficients (r) between the scores from all samples and those from random subsamples. IPP showed a higher correlation in all cases of subsampling than the conventional log-rank test (Supplementary Fig. 7A). These results indicate the superiority of IPP in terms of robustness compared with the conventional log-rank test.

Figure 2


IPP identified prognostic genes more robustly than conventional log-rank test. (A) The number of prognostic genes shared among independent datasets, as identified by IPP and the log-rank test with the average value used as the gene expression level threshold. A gene is considered to be prognostic if its absolute score (IPP score in IPP or z-score in the log-rank test) is within the top 5% of genes, with the same number of genes used for IPP and the log-rank test to ensure a fair comparison of the genes shared among datasets. Prognostic genes shared among five or more datasets are represented by bar graphs. (B) Venn diagram (left) showing the number of genes detected only by IPP, only by the log-rank test, and by both methods. No functional groups were found for genes identified only by the log-rank test in Reactome pathway enrichment analysis (right). (C) The IPP scores represented more reliable when they were compared to the scores from subsamples versus the log-rank test. For each subsample on the x-axis, 1,000 repeated bootstrapping iterations were performed and the averaged values were plotted. In the case of the log-rank test, the z-score was used instead of a p-value. Error bars, mean ± SD. (D) Outcome relation (adverse or favorable) of the genes was determined more consistently by IPP than by the log-rank test. The numbers shown on the bar graphs correspond to the numbers of genes showing identical outcome relation (adverse or favorable) among 17 or more datasets. Cyan and purple colors indicate the results of IPP and the log-rank test, respectively, using the average value as the gene expression threshold. “Log-rank test: Avg” indicates the results obtained using the average value as the threshold in the log-rank test.

Next, we investigated the consistency of IPP and the conventional log-rank test by comparing prognostic genes that were identified in multiple datasets. For both IPP and the log-rank test, we considered the top 5% of genes as prognostic genes, and selected the same number of genes from each dataset. The number of shared prognostic genes among several independent datasets represented the consistency. Prognostic genes shared among at least five datasets were counted (Fig. 2A and Supplementary Fig. 6A). IPP (n = 178) found significantly more shared prognostic genes than the conventional log-rank test using average (n = 125) and median (n = 135) values as gene expression level thresholds. Among the prognostic genes obtained by IPP and the conventional log-rank test, some (overlap between IPP and average: n = 57, median: n = 56) were shared between both methods, but the rest of them were not (only IPP: n = 121 vs only average: n = 68, only IPP: n = 122 vs only median: n = 79) (Fig. 2B and Supplementary Fig. 6B). We suspected that the 68 and 79 genes identified using the average and median value thresholds, respectively were likely to be false-positives and, therefore, unlikely to represent functional groups. We then performed pathway enrichment analysis using the Reactome Pathway Database2B and Supplementary Fig. 6B). On the other hand, cell-cycle related functional groups, such as “mitotic prometaphase” and “cell cycle”, were major features of prognostic genes identified only by IPP (Fig. 2B). These results suggest that IPP identifies prognostic genes more reliably than the conventional log-rank test.

Lastly, we examined the consistency of the relationship between the expression levels of the identified genes and prognosis. A gene is considered adverse if high-level expression is associated with a poor prognosis, such as a short survival or relapse time. When high expression of a gene is associated with a good prognosis, then the gene is considered favorable. We refer to these relationships as outcome relation. Since the outcome relation is a significant feature of prognostic genes, we analyzed the consistency of outcome relation among datasets (Fig. 2D, Supplementary Fig. 6D and Supplementary Fig. 7B). Each gene is assigned two numbers, indicating the number of datasets in which the gene is adverse and favorable. To compare the consistency of outcome relation between IPP and the conventional log-rank test, the numbers of datasets in which the gene is adverse and favorable were counted for each gene (Supplementary Fig. 7B). We summed the numbers of genes that were adverse or favorable in at least 17 datasets. IPP had a greater number of genes in the adverse (n = 485) and favorable (n = 178) classes than the log-rank tests using the average (adverse, n = 336; favorable, n = 126) and median (adverse, n = 337; favorable, n = 139) values as expression level thresholds (Fig. 2D and Supplementary Fig. 6D). These results show the higher robustness and consistency of IPP than compared with the conventional log-rank test.

Molecular subtype-specific breast cancer prognostic genes

It is well known that different subtypes of breast cancer have distinct characteristics. To investigate subtype-specific prognostic genes, we calculated IPP scores for a total of 16 breast cancer datasets by reference to three distinct molecular subtypes: luminal (ER-positive/PR-positive; Supplementary Fig. 9A), HER2-enriched (ER-negative/PR-negative/HER2-positive; Supplementary Fig. 9C), and triple-negative (ER-negative/PR-negative/HER2-negative; Supplementary Fig. 9B) based on ER, PR, and HER2/ERBB2 immunohistochemistry information. Datasets with fewer than 20 patient samples for three molecular subtypes were excluded from the analysis. To ensure that the numbers of patients for all datasets were considered, we calculated representative IPP scores for gene using Liptak’s weighted method9D). Subtype-specific prognostic genes represented the top 5% of all genes (n = 557 genes among a total of 11,123 genes) based on the absolute values of the representative, integrated IPP scores. Next, we performed Reactome pathway enrichment to investigate the characteristics of subtype-specific prognostic genes.

In luminal (ER-positive/PR-positive) breast cancer, the proportion of adverse and favorable genes was 83.7% (n = 466) and 16.3% (n = 91), respectively, out of a total of 557 prognostic genes (Fig. 3a, upper). Based on enrichment analysis, only 54.6% (n = 304) of adverse genes and 11.5% (n = 64) of favorable genes were included within the functional groups. Enrichment analysis showed that adverse genes were composed of cell cycle-related genes, such as those involved in “mitotic prometaphase”, “mitotic metaphase and anaphase” and “cell cycle checkpoints” (Fig. 3a, lower right). In addition, 50.2% of the adverse prognostic genes (n = 234) had at least one PPI with each other, and thus serve to organize PPI networks (Supplementary Fig. 11, Middle). Among the favorable genes, we observed several functional groups, such as “phenylalanine metabolism” and “p53 signaling pathway” (Fig. 3a, lower left panel).

Figure 3


Subtype-specific functional groups of prognostic genes. (a) Luminal (ER-positive/PR-positive) specific prognostic genes and functional groups. In total, 83.7% and 16.3% of prognostic genes (n = 557) were adverse and favorable, respectively; 54.6% and 11.5% of the adverse and favorable prognostic genes, respectively, were assigned to functional groups through Reactome pathway enrichment analysis (upper panel). (b) Triple-negative (ER-negative/PR-negative/HER2-negative)-specific prognostic genes and functional groups: 65.2% and 34.8% of the prognostic genes (n = 557) were adverse and favorable, respectively. In total, 44.2% and 24.4% of the adverse and favorable prognostic genes, respectively, were assigned to Reactome functional groups (upper panel). Representative functional groups of adverse (red) and favorable (blue) prognostic genes (lower panel). The colors of the bars in the figure indicate the outcome relation (adverse or favorable), and the enriched and non-enriched portions of the prognostic genes. X-axis: common logarithm of the false discovery rate (FDR).

The results for HER2-enriched (ER-negative/PR-negative/HER2-positive) and triple-negative (ER-negative/PR-negative/HER2-negative) breast cancers differed from those for luminal (ER-positive/PR-positive) breast cancer. In HER2-enriched breast cancer, 37.7% (n = 210) of the 62.7% (n = 349) of prognostic genes that were adverse were related to extracellular matrix functions, such as “Extracellular matrix organization” and “Integrin signaling pathway” (Supplementary Fig. 10A, lower right). Fewer adverse prognostic genes (30.1%, n = 105) organized PPI networks in HER2-enriched cancer than in luminal breast cancer. In contrast, a great number of favorable prognostic genes (21.2%, n = 44) organized PPI networks in HER2-enriched cancer than luminal in breast cancer (Supplementary Fig. 12). Despite lower significance, the 25.6% (n = 143) of prognostic genes that were favorable for HER2-enriched breast cancers were mainly involved in “Keratinization”, “Wnt signaling pathway” and “Vitamin D metabolism” (Supplementary Fig. 10A, lower left panel).

In triple-negative breast cancer, 65.2% (n = 363) and 34.8% (n = 194) of prognostic genes were adverse and favorable, respectively (Fig. 3b and Supplementary Fig. 13). Among these, 44.2% (n = 246) and 24.4% (n = 136) were included in the Reactome pathway enrichment functional groups (Fig. 3b, upper panel). Unlike luminal and HER2-enriched breast cancers, pathways related to the hypoxia inducible factor, (HIF), such as “HIF-1 alpha transcription factor network” and “HIF-2 alpha transcription factor network”, were major functional features of adverse genes in triple-negative breast cancer. “VEGF & VEGFR signaling network”, “Focal adhesion”, and “Nicotine addiction” were also important functional features of adverse prognostic genes for triple-negative breast cancer (Fig. 3b, lower right panel). On the other hand, favorable prognostic genes for triple-negative breast cancer had immune-related functional features, such as “Interferon alpha/beta signaling”, “Immunoregulatory interactions between a lymphoid & a non-lymphoid cell”, and “Th1 & Th2 cell differentiation” (Fig. 3b, lower left panel). Less than 20 and 10 genes were shared in the adverse and favorable prognostic gene classes, respectively, among luminal, HER2-enriched, and triple-negative breast cancer (data not shown), suggesting remarkably distinct functional characteristics of adverse and favorable prognostic genes among the three molecular subtypes of breast cancer.

Novel prognostic genes identified by IPP

The IPP method led to the identification of prognostic genes that had unknown functions and were therefore not assigned during enrichment analysis. We hypothesized that these genes were novel prognostic genes; therefore, we performed detailed analyses of copy number alteration (CNA) and mutations of these genes using The Cancer Genome Atlas (TCGA) data provided by cBioportal

Among all prognostic genes (n = 557), about 30% were not assigned to any functional group by Reactome pathway enrichment analysis of luminal (33.9%: adverse, 29.1%; favorable, 4.8%, Fig. 4A, upper panel), HER2-enriched (36.7%: adverse, 25.0%; favorable, 11.7%, Supplementary Fig. 10B, upper panel), and triple-negative (31.4%: adverse, 21.0%; favorable, 10.4%, Fig. 4B, upper panel) breast cancers. Among the unassigned genes, the CNA states of METTL17 showed a marked association with prognosis (Fig. 4A, lower right panel). Because METTL17 was an adverse prognostic gene on IPP, higher gene expression of METTL17 indicates a poorer prognosis. Therefore, it is possible that patient groups with amplification and deletion of this gene have lower and higher survival probability than the neutral group, respectively. The other three representative genes, CADPS2 (Fig. 4A, lower left panel), PARP8, and TULP2 (Fig. 4B, lower panel) showed results consistent with those obtained by IPP. Despite few patient samples showing mutations, and a lack of information on breast cancer subtypes, we found that prognostic genes identified by IPP, such as DONSON, MKI67, FAM171A1, TENM4, C16orf45, and RABEP2, showed statistically differences in survival probability between mutation and non-mutation groups (Supplementary Fig. 14). These data demonstrate that IPP has the power to identify novel prognostic genes.

Figure 4


Prognostic significance of copy number alteration (CNA) in functionally unknown prognostic genes identified by IPP. (A) Among all prognostic genes (n = 557) for luminal breast cancer, 29.1% and 4.8% of the adverse and favorable prognostic genes, respectively, were not assigned to any functional group by Reactome pathway enrichment analysis (upper panel). Kaplan-Meier curves show the difference in disease-free survival (DFS; days) according to the CNA states (amplification, neutral, or deletion) of CADPS2 and METTL17, favorable and adverse prognostic genes, respectively, which were not assigned to any functional group by Reactome pathway enrichment analysis (lower panel). (B) Among all prognostic genes (n = 557) for triple-negative breast cancer, 21.0% and 10.4% of adverse and favorable prognostic genes, respectively, were not assigned to any functional group by Reactome pathway enrichment analysis (upper panel). Kaplan-Meier curves show the differences in overall survival (OS; days) according to the CNA state (amplification, neutral, or deletion) of PARP8 and TULP2, favorable and adverse prognostic genes, respectively, which were not assigned to any functional group by Reactome pathway enrichment analysis (lower panel). The colors of the Kaplan-Meier curves indicate the CNA states (purple: amplification, black: neutral, green: deletion). Each n on the Kaplan-Meier curves indicates the number of patients included in that CNA state.

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。


java thrift_Thrift入门及Java实例演示【转】_羊欲穷的博客-程序员宅基地

概述Thrift是一个软件框架,用来进行可扩展且跨语言的服务的开发。它结合了功能强大的软件堆栈和代码生成引擎,以构建在 C++、Java、Python、PHP、Ruby、Erlang、Perl、Haskell、C#、Cocoa、JavaScript、Node.js、Smalltalk、and OCaml 等等编程语言间无缝结合的、高效的服务。Thrift最初由facebook开发,07年四月开放源...


导航条<nav class="navbar navbar-default"> <div class="container-fluid"> <!-- Brand and toggle get grouped for better mobile display --> <div class="navbar-header"&gt...

题库 java_Java题库——Chapter4 循环_岑依惜的博客-程序员宅基地

1)How many times will the following code print "Welcome to Java"?int count = 0;while (count < 10) {System.out.println("Welcome to Java");count++;}A)8 B) 9 C) 0 D) 11 E) 102)Analyze the following co...



PHP 正规函数 --preg_match_all 匹配所有_weixin_30344131的博客-程序员宅基地



如何利用html制作一张简历在html中用代码做一个表格,首先我们要给他简历一个整体框架,比如这个表格是几行几列,弄好框架之后再去弄里面的合并列或者合并行这些东西,这里的话,代码如下** 个人简历 </thead><tbody> <tr align="center"> <td width='100'>姓名</td> <td width=



第 1 章 Java 程序设计概述1996 年 Java 第一次发布就引起了人们的极大兴趣。关注 Java 的人士不仅限于计算机出版界,还有诸如《纽约时报》《华盛顿邮报》《商业周刊》这样的主流媒体。Java 是第一种也是唯一一种在 National Public Radio 上占用了 10 分钟时间来进行介绍的程序设计语言,并且还得到了 $100 000 000 的风险投资基金。这些基金全部用来...

mysql 按汉字拼音排序_一首简单的歌-shining的博客-程序员宅基地

建表如下:+----+------+--------+| id | name | pinyin |+----+------+--------+| 1 | 李 | li || 2 | 王 | wang || 3 | 张 | zhang || 4 | 刘 | liu |+----+------+--------+表中字段的编码... 教程 3-7 窗体编程 菜单和工具栏 4 ToolStrip 2_VB.Net的博客-程序员宅基地

版权声明:本文为博主原创文章,转载请在显著位置标明本文出处以及作者网名,未经作者允许不得用于商业目的。来看一个列子:在ToolStrip上添加按钮、选择框、进度条选择框的值有100、200、500。当按钮按下时,进度条依据设置的值不断增加。 Private Sub ToolStripButton1_Click(sender As Object, e As EventA...





Linux下select, poll和epoll IO模型的详解_fanbird2008的博客-程序员宅基地  一).Epoll 介绍Epoll 可是当前在 Linux 下开发大规模并发网络程序的热门人选, Epoll 在 Linux2.6 内核中正式引入,和 select 相似,其实都 I/O 多路复用技术而已 ,并没有什么神秘的。其实在 Linux 下设计并发网络程序,向