NIA Array Analysis Tool



Glossary

ANOVA
is ANalysis Of VAriances, a statistical technique for detecting statistical significance. The major advantage of ANOVA versus a simple t-test is that variances are averaged over all factor levels, thus the statistics become more stable. In ANOVA we calculate the F-statistics which is then used to estimate P-value and determine if the variation between means is significant. Testing multiple hypotheses with ANOVA (as in the case of microarray data) requires some modifications in ANOVA: variance averaging, and FDR.
Array type
is a file with probes (or clones) in the microarray with annotations. The file is a tab-delimited text file with headers in the first row. The following three columns are required: The first column is probe ID (oligo ID, or cDNA clone ID), which should match to the gene ID in the data file that you analyze. Gene ID can be either a number or a text. If possible, select gene ID that can be referenced on the web (e.g., GenBank accession#). The second column is gene symbol. If several genes match to the probe, but a comma and space between gene symbols. If there is no space after the comma, then table columns in the output will be too wide. The third column is gene annotation. The file may have additional columns if necessary (e.g., gene bank accession number, Unigene, LocusLink, MGI, etc.). These columns should have headers to be displayed in all tables. You can use HTML hyperlinks in additional columns (recommended). Do not use hyperlinks in the first column!
Bayesian error model
was proposed by Baldi and Long (2001. Bioinformatics 17: 509-519). The mean posterior estimate of error variance was shown to be the weighted average of the actual and averaged error variance: w1*var+w2*aver_var, where weights depend on the desirable degrees of freedom (ddf): w1=df/ddf, w2=1-df/ddf, where df is the actual number of degrees of freedom. The greater is ddf, the closer would be the mean posterior estimate to the averaged error variance. See also error models.
Biplot
was proposed by Gabriel (1971. Biometrika 58: 453-467). This is a method for plotting together rows and columns of the data matrix, which can be used for examining associations between genes (rows) and tissues/experiments (columns). The technique is based on the Singular Value Decomposition (SVD) method.
Web references:
SVD and PCA for microarrays
Biplot and SVD
Clustering
Numerous methods exist for clustering genes based on their expression patterns. TIGR MEV software includes a variety of methods (hierarchical clustering, k-mean clustering, SOM, etc.). Our software generates a Stanford-formatted output file that can be used with MEV software. In this software we implemented 2 methods of gene clustering that are not available in MEV. First is finding genes that are specific for a particular cluster of tissues, and second is clustering based on PCA.
Method 1: First we do hierarchical clustering of tissues; then select genes that were significantly more expressed in all tissues in this cluster compared to all tissues outside of the cluster. Method 2: First we do PCA; then estimate regression of eigenvectors versus log-expression of each gene; then estimate the logratio of gene change based on regression line (see figure)

Then two clusters of genes are identified with each principal component (PC): those that are positively correlate with PC and with logratio above a given threshold (e.g., log(5)), and those that are negatively correlate with PC and have logratio below the negative threshold (e.g., -log(5)).
Cross-channel correction
is an option for pre-processing 2-color arrays with a common reference. Theoretically, red and green intensities should be independent, however we often observed that the intensity in red channel (reference sample) increased with the increasing intensity in the green channel. Adjustment is done on the gene-by-gene basis if the average log-intensity of the reference is by log(5) less than the average log-intensity in the other channel (i.e., at least 5-fold difference), and if the cross-channel correlation is >0.7. In this case, the reference intensity is set to its average value for this gene plus the adjustment to the average reference intensity for all genes in each array.
Cutoffs
Cutoffs are used for data filtering and adjustment. If a data value is less than the minimum cutoff, then it is replaced by the minimum cutoff value. This adjustment may artificially lower the error variance for low-expressed genes. To avoid this effect, the software adjusts the averaged error variance for genes with average intensity within 2*SD from the minimum cutoff, by not letting it decrease as the average intensity decreases. The maximum cutoff simply ignores genes with the average intensity exceeding the cutoff value.
Dye swap
is repeating hybridization on two-color microarrays with the same samples but swapped fluorescent labels. For example, sample A is labeled with Cy3 (green) and sample B with Cy5 (red) in the first array, but sample A is labeled with Cy5 and sample B with Cy3 in the second array. Dye swap is used to remove technical color bias in some genes. Dye swap is a technical replication (=subreplication).
Error function
is a plot of standard deviation, SD (=square root of the error variance), versus expression level. Error variance is averaged for genes with similar expression level.
Error model
is the model of error variance used in ANOVA for determining statistical significance of differential gene expression. The error model attempts to get a better estimate for the true error variance than the error variance estimated from data (we call it 'actual error variance'). In this software we use 5 error models: (1) actual error variance - this means that each gene is processed in ANOVA independently from other genes; (2) averaged error variance - this model helps to stabilize F-statistics in ANOVA; (3) Bayesian model - the error variance is a weighted average of the actual and averaged error variance; it is an intermediate model between #1 and #2; (4) maximum of averaged and actual error variance - this method is the most conservative, it reduces the number of false positives; (5) maximum of averaged and Bayesian error variances; it is an intermediate model between #2 and #4; However none of these models is perfect, thus we should use caution if error variance is too high. In this software, we tag genes that have high error variance. Some of these genes may appear "significant", but it is better to examine data visually before making a decision that they are really significant.
Error variance
is the variance of replications within groups. It is estimated as the sum of square differences between data and corresponding group means. Error variance can be used directly in ANOVA or indirectly via error models and variance averaging.
FDR (false discovery rate)
is the proportion of false positives among all genes that we consider significant. FDR can be viewed as an equivalent of a P-value in experiments with multiple hypotheses testing. In microarray experiments we test simultaneously null-hypotheses for all genes. If there are 20000 genes on a chip, then by using P-value=0.05 we will consider 5% genes significant even if null-hypotheses are true for all genes (i.e., no differential expression). It means that we will get 1000 false positives! This example shows that P-value is meaningless for multiple hypotheses testing. A possible solution of the problem is to use Bonferroni correction by multiplying P-value by the total number of genes. This method ensures no false positives with probability of 95%; however it is too stringent because we can tolerate some small proportion of false positives. FDR is an intermediate method between the P-value and Bonferroni correction; it is equal to the proportion of false positives among all genes that we consider significant. The equation is
where r is the rank of a gene ordered by increasing p-values, pi is the p-value for gene with rank i, and N is the total number of genes tested (Benjamini, Y. & Hochberg, Y., 1995. J Roy Stat Soc B 57: 289-300) The FDR value increases monotonously with increasing p-value. (or decreasing t-statistics or F-statistics).
F-statistics
is a ratio of factor variance to the error variance in ANOVA. F-statistics is then used to estimate the P-value according to either theoretical F-distribution or empirical F-distribution obtained from permutation analysis. The P-value is then used for determining if the variation between means is significant. If multiple hypotheses are tested, then FDR is estimated from P-values.
Gene expression
is the intensity of transcription (mRNA synthesis from DNA template) in a cell.
Highly variable gene
is a gene with error variance >3 times higher than average. In this software, we tag these genes so that they can be examined visually.
Microarray
is a slide or membrane with numerous probes that represent various genes of some biological species. Probes are either oligo-nucleotides that range in length from 25 to 60 bases, or cDNA clones with length from a hundred to several thousand bases. Microarrays are hybridized with labeled cDNA synthesized from a mRNA-sample of some tissue. The intensity of label (radioactive or fluorescent) of each spot on a microarray indicates the expression of each gene. One-color arrays (usually with radioactive label) show the absolute expression level of each gene. Two-color arrays (fluorescent label only) can indicate relative expression level of the same gene in two samples that are labeled with different colors and mixed before hybridization. One of these samples can be a universal reference which helps to compare samples that were hybridized on different arrays.
Outliers
are data that are suspiciously different from other data from the same experiment. Outliers can be detected using the z-value: z=|x-Mean|/SD, where x in the tested value, Mean is the mean value for the same experiment, and SD is standard deviation from mean. In ANOVA, SD is calculated as a square root from mean square error (NSE). Values with high z-values can be outliers. How to determine what z-value to select for outlier removal? The answer depends on the volume of data. If you analyze 22000 genes with 12 1-color arrays, then you have 264000 numbers. Assuming no real outliers, the highest z-value is expected to be 4.6. To be sure that you remove real outliers you need to select the value z somewhat higher than 4.6, for example z=6 or z=8. If you think the data have problems you may want to remove more outliers by reducing the z-value. If you don't want to remove any outliers, select z=10000. Removing outliers means replacing them with missing values.
PCA
Principal Component Analysis (PCA) is a multivariate analysis technique which finds major patterns in data variability. In mathematical terms, it is finding eigenvalues and corresponding eigenvectors (=principal components, PC). Most important are first few principal components that explain most of observed variance; the rest of them are mostly random fluctuations. Thus, by plotting data versus first 2 or 3 PC we can reduce dimensionality of the data without much loss of information. Singular Value Decomposition (SVD) is a more generic method than PCA which identifies eigenvectors both for the rows (=genes) and columns (=tissues) of the data matrix. In fact, both gene-points and tissue-points can be plotted on the same graph using technique called "biplot" which is implemented in our software.
Web references:
SVD and PCA for microarrays
Okhahoma State Univ., botany
UK, oncology
Biplot and SVD
Permutation
is a method for building an empirical F-distribution in ANOVA. The order of columns in a data file is changed randomly, and F-values are determined using ANOVA. After repeating these permutation several hundred times we can build an empirical F-distribution (using data for multiple genes with similar average intensity).
Replication
is an independent repeate of an experiment. In practice it is impossible to achieve absolute independence of replicates. For example, the same researcher often does all the replicates, but the results may differ in the hands of another person. But it is very important to reduce dependency between replicates to a minimum. For example, it is much better to take replicate samples from different animals (these are called biological replicates) than from the same animal (these would be technical replicates), unless you are interested in a particular animal. If sample preparation requires multiple steps, it is best if samples are separated from the very beginning, rather than from some intermediate step. Each replication may have several subreplications (=technical replications).
Statistical significance
means rejection of a null-hypothesis, H0, that two samples have the same probability distribution. H0 is tested using some statistics (e.g., t or F); if its value appears in the tail of the theoretical probability distribution for this statistics, and hence, the likelihood of the H0 drops below some threshold (usually P=0.05), then we consider the difference between 2 samples significant. This does not guarantee that the H0 was indeed false. A case, where H0 true but we consider the difference between means statistically significant, is called "false positive". If we did not detect significant differences but H0 was false, then it is called "false negative". When multiple hypotheses are tested, the meaning of statistical significance becomes more complicated (see FDR).
Subreplication
is a partially-independent repeat of an experiment. I always consider technical replications as subreplications. Although in many cases multiple subreplications are not needed, they may be important if measurement may produce a technical bias. For example, 2-color arrays often generate color bias for some gene samples (usually only for low- intensity genes). In this case, a dye swap would remove the bias, but it would be a subreplication.
Universal reference
is a mixture of cDNA that represent (almost) all genes of a species, and their relative abundance is standardized. Universal reference is synthesized from mRNA of various tissues. Universal reference can be used as a second sample for hybridization on 2-color microarrays. Then all other samples become comparable via the universal reference.
Variance averaging
is averaging the error variance for genes with similar average expression level (=intensity). Variance averaging is a method for stabilizing t- or F-statistics in microarray experiments with a small number of replications. Error variance often depends on the average intensity of genes (usually it increases as intensity decreases). Thus, variance should be averaged only for genes with similar intensity. First genes are sorted according to their average intensity, and then the average error variance is estimated in a sliding window of 500 or 1000 genes. We do not recommend to reduce the size of sliding window below 500. Some genes may have unusually high error variance because of outlier values. To avoid the effect of these genes on the averaged error variance, it is better to remove 1% or 5% top values of error variances before averaging. Average error variance can be used in ANOVA instead of the actual error variance, or it can be combined with the actual error variance according to various error models.
VRML
stands for Virtual Reality Markup Language. It is an object-oriented language for describing 3D objects. To view the image you need a VRML viewer (e.g., Cortona or Cosmo).
Web resources:
Floppy's Web 3D
Web 3D Consortium