Glossary
- ANOVA
- is ANalysis Of VAriances, a statistical technique for detecting
statistical significance. The major advantage
of ANOVA versus a simple t-test is that
variances are averaged over all factor levels,
thus the statistics become more stable. In ANOVA we calculate the F-statistics
which is then used to estimate P-value and determine if the
variation between means is significant. Testing multiple
hypotheses with ANOVA (as in the case of microarray data) requires
some modifications in ANOVA: variance averaging, and FDR.
- Array type
- is a file with probes (or clones) in the microarray with annotations.
The file is a tab-delimited text file with headers in the first row.
The following three columns are required:
The first column is probe ID (oligo ID, or cDNA clone ID), which should match to the
gene ID in the data file that you analyze. Gene ID can be either a number or a text.
If possible, select gene ID that can be referenced on the web (e.g., GenBank accession#).
The second column is gene symbol. If several genes match to the probe, but a comma
and space between gene symbols. If there is no space after the comma, then table
columns in the output will be too wide. The third column is gene annotation.
The file may have additional columns if necessary (e.g., gene bank
accession number, Unigene, LocusLink, MGI, etc.). These columns should have
headers to be displayed in all tables. You can use HTML hyperlinks in additional columns
(recommended). Do not use hyperlinks in the first column!
- Bayesian error model
- was proposed by Baldi and Long (2001. Bioinformatics 17: 509-519). The mean posterior
estimate of error variance was shown to be the weighted average of the actual and averaged
error variance: w1*var+w2*aver_var, where weights depend on the desirable degrees of
freedom (ddf): w1=df/ddf, w2=1-df/ddf, where df is the actual number of degrees of freedom.
The greater is ddf, the closer would be the mean posterior estimate to the averaged
error variance. See also error models.
- Biplot
- was proposed by Gabriel (1971. Biometrika 58: 453-467). This is a method for
plotting together rows and columns of the data matrix, which can be used for
examining associations between genes (rows) and tissues/experiments (columns). The
technique is based on the Singular Value Decomposition (SVD) method.
Web references:
SVD and PCA for microarrays
Biplot and SVD
- Clustering
- Numerous methods exist for clustering genes based on their expression patterns.
TIGR MEV software includes a variety
of methods (hierarchical clustering, k-mean clustering, SOM, etc.). Our software generates
a Stanford-formatted output file that can be used with MEV software.
In this software we implemented 2 methods of gene clustering that are not available
in MEV. First is finding genes that are specific for a particular cluster of tissues,
and second is clustering based on PCA.
Method 1: First we do hierarchical clustering of tissues; then select genes that were
significantly more expressed in all tissues in this cluster compared to all tissues
outside of the cluster.
Method 2: First we do PCA; then estimate regression of eigenvectors versus log-expression
of each gene; then estimate the logratio of gene change based on regression line
(see figure)

Then two clusters of genes are identified with each principal component (PC): those that are positively
correlate with PC and with logratio above a given threshold (e.g., log(5)), and
those that are negatively correlate with PC and have logratio below the negative threshold
(e.g., -log(5)).
- Cross-channel correction
- is an option for pre-processing 2-color arrays with a common
reference. Theoretically, red and green intensities should be independent, however we often observed
that the intensity in red channel (reference sample) increased with the increasing intensity in the
green channel. Adjustment is done on the gene-by-gene basis if the average
log-intensity of the reference is by log(5) less than the average log-intensity
in the other channel (i.e., at least 5-fold difference), and if the
cross-channel correlation is >0.7. In this case, the reference intensity
is set to its average value for this gene plus the adjustment to the average
reference intensity for all genes in each array.
- Cutoffs
- Cutoffs are used for data filtering and adjustment. If a data value is less
than the minimum cutoff, then it is replaced by the minimum cutoff value. This
adjustment may artificially lower the error variance for low-expressed genes. To
avoid this effect, the software adjusts the averaged error variance for genes with
average intensity within 2*SD from the minimum cutoff, by not letting it decrease
as the average intensity decreases. The maximum cutoff simply ignores genes with
the average intensity exceeding the cutoff value.
- Dye swap
- is repeating hybridization on two-color microarrays with the same samples
but swapped fluorescent labels. For example, sample A is labeled with Cy3
(green) and sample B with Cy5 (red) in the first array, but sample A is
labeled with Cy5 and sample B with Cy3 in the second array. Dye swap is
used to remove technical color bias in some genes. Dye swap is a technical
replication (=subreplication).
- Error function
- is a plot of standard deviation, SD (=square root of the error variance), versus
expression level. Error variance is averaged for genes with similar expression level.
- Error model
- is the model of error variance used in ANOVA for determining statistical
significance of differential gene expression. The error model attempts to
get a better estimate for the true error variance than the error variance
estimated from data (we call it 'actual error variance'). In this software we
use 5 error models: (1) actual error variance - this means that each gene
is processed in ANOVA independently from other genes; (2) averaged error
variance - this model helps to stabilize F-statistics in ANOVA;
(3) Bayesian model - the error variance is a weighted average of the actual and averaged
error variance; it is an intermediate model between #1 and #2;
(4) maximum of averaged and actual error variance - this method is the most
conservative, it reduces the number of false positives; (5) maximum of
averaged and Bayesian error variances; it is an intermediate model between
#2 and #4; However none of these models is perfect, thus we should use
caution if error variance is too high. In this software, we tag genes that
have high error variance. Some of these genes
may appear "significant", but it is better to examine data visually before
making a decision that they are really significant.
- Error variance
- is the variance of replications within groups. It is estimated as the
sum of square differences between data and corresponding group means.
Error variance can be used directly in ANOVA or indirectly via
error models and variance averaging.
- FDR (false discovery rate)
- is the proportion of false positives among all genes that we consider
significant. FDR can be viewed as an equivalent of a P-value in experiments
with multiple hypotheses testing.
In microarray experiments we test simultaneously null-hypotheses for all genes.
If there are 20000 genes on a chip, then by using P-value=0.05 we will consider
5% genes significant even if null-hypotheses are true for all genes (i.e., no
differential expression). It means that we will get 1000 false positives!
This example shows that P-value is meaningless for multiple hypotheses testing.
A possible solution of the problem is to use Bonferroni correction by multiplying
P-value by the total number of genes. This method ensures no false positives
with probability of 95%; however it is too stringent because we can tolerate
some small proportion of false positives. FDR is an intermediate method between the
P-value and Bonferroni correction; it is equal to the proportion of false positives
among all genes that we consider significant. The equation is
where r is the rank of a gene ordered by increasing p-values, pi is the
p-value for gene with rank i, and N is the total number of genes tested
(Benjamini, Y. & Hochberg, Y., 1995. J Roy Stat Soc B 57: 289-300)
The FDR value increases monotonously with increasing p-value.
(or decreasing t-statistics or F-statistics).
- F-statistics
- is a ratio of factor variance to the error variance in ANOVA. F-statistics
is then used to estimate the P-value according to either theoretical
F-distribution or empirical F-distribution obtained from permutation analysis.
The P-value is then used for determining if the variation between means is
significant. If multiple hypotheses are tested, then FDR is estimated from
P-values.
- Gene expression
- is the intensity of transcription (mRNA synthesis from DNA template) in a cell.
- Highly variable gene
- is a gene with error variance >3 times higher than average. In this software,
we tag these genes so that they can be examined visually.
- Microarray
- is a slide or membrane with numerous probes that represent various genes of
some biological species. Probes are either oligo-nucleotides that range in
length from 25 to 60 bases, or cDNA clones with length from a hundred to
several thousand bases. Microarrays are hybridized with labeled cDNA synthesized
from a mRNA-sample of some tissue. The intensity of label (radioactive or
fluorescent) of each spot on a microarray indicates the expression of each
gene. One-color arrays (usually with radioactive label) show the absolute expression
level of each gene. Two-color arrays (fluorescent label only) can indicate relative
expression level of the same gene in two samples that are labeled with different
colors and mixed before hybridization. One of these samples can be a universal
reference which helps to compare samples that were hybridized on different arrays.
- Outliers
- are data that are suspiciously different from other data from the same experiment.
Outliers can be detected using the z-value: z=|x-Mean|/SD, where x in the tested value,
Mean is the mean value for the same experiment, and SD is standard deviation from
mean. In ANOVA, SD is calculated as a square root from mean square error (NSE). Values
with high z-values can be outliers. How to determine what z-value to select for outlier
removal? The answer depends on the volume of data. If you analyze 22000 genes with
12 1-color arrays, then you have 264000 numbers. Assuming no real outliers, the highest
z-value is expected to be 4.6. To be sure that you remove real outliers you need to
select the value z somewhat higher than 4.6, for example z=6 or z=8. If you think the
data have problems you may want to remove more outliers by reducing the z-value. If you
don't want to remove any outliers, select z=10000. Removing outliers means replacing
them with missing values.
- PCA
- Principal Component Analysis (PCA) is a multivariate analysis technique which finds
major patterns in data variability. In mathematical terms, it is finding eigenvalues and
corresponding eigenvectors (=principal components, PC). Most important are first few principal
components that explain most of observed variance; the rest of them are mostly random
fluctuations. Thus, by plotting data versus first 2 or 3 PC we can reduce dimensionality
of the data without much loss of information. Singular Value Decomposition (SVD) is a more
generic method than PCA which identifies eigenvectors both for the rows (=genes) and columns (=tissues) of the
data matrix. In fact, both gene-points and tissue-points can be plotted on the same graph
using technique called "biplot" which is implemented in our software.
Web references:
SVD and PCA for microarrays
Okhahoma State Univ., botany
UK, oncology
Biplot and SVD
- Permutation
- is a method for building an empirical F-distribution in ANOVA. The order
of columns in a data file is changed randomly, and F-values are determined
using ANOVA. After repeating these permutation several hundred times we can
build an empirical F-distribution (using data for multiple genes with similar
average intensity).
- Replication
- is an independent repeate of an experiment. In practice it is impossible to
achieve absolute independence of replicates. For example, the same researcher
often does all the replicates, but the results may differ in the hands of
another person. But it is very important to reduce dependency between
replicates to a minimum. For example, it is much better to take replicate
samples from different animals (these are called biological replicates) than
from the same animal (these would be technical replicates), unless you are
interested in a particular animal. If sample preparation requires multiple
steps, it is best if samples are separated from the very beginning, rather
than from some intermediate step. Each replication may have several
subreplications
(=technical replications).
- Statistical significance
- means rejection of a null-hypothesis, H0, that two samples
have the same probability distribution. H0 is tested using some
statistics (e.g., t or F); if its value appears in the tail of the theoretical
probability distribution for this statistics, and hence, the likelihood of the H0
drops below some threshold (usually P=0.05), then we consider the difference
between 2 samples significant. This does not guarantee that the H0 was indeed
false. A case, where H0 true but we consider the difference between means
statistically significant, is called "false positive". If we did not detect
significant differences but H0 was false, then it is called "false negative".
When multiple hypotheses are tested, the meaning of statistical significance
becomes more complicated (see FDR).
- Subreplication
- is a partially-independent repeat of an experiment. I always consider
technical replications as subreplications. Although in many cases
multiple subreplications are not needed, they may be important if
measurement may produce a technical bias. For example, 2-color arrays
often generate color bias for some gene samples (usually only for low-
intensity genes). In this case, a dye swap would remove the bias, but
it would be a subreplication.
- Universal reference
- is a mixture of cDNA that represent (almost) all genes of a species, and
their relative abundance is standardized. Universal reference is synthesized
from mRNA of various tissues. Universal reference can be used as a second
sample for hybridization on 2-color microarrays. Then all
other samples become comparable via the universal reference.
- Variance averaging
- is averaging the error variance for genes with similar average expression
level (=intensity). Variance averaging is a method for stabilizing t- or F-statistics
in microarray experiments with a small number of replications. Error variance often
depends on the average intensity of genes (usually it increases as intensity
decreases). Thus, variance should be averaged only for genes with similar intensity.
First genes are sorted according to their average intensity, and then the average error
variance is estimated in a sliding window of 500 or 1000 genes. We do not recommend
to reduce the size of sliding window below 500. Some genes may have
unusually high error variance because of outlier values. To avoid the effect of these
genes on the averaged error variance, it is better to remove 1% or 5% top values
of error variances before averaging. Average error variance can be used in
ANOVA instead of the actual error variance, or it can be combined with the actual
error variance according to various error models.
- VRML
- stands for Virtual Reality Markup Language. It is an object-oriented language for
describing 3D objects. To view the image you need a VRML viewer (e.g.,
Cortona or
Cosmo).
Web resources:
Floppy's Web 3D
Web 3D Consortium