![]() |
NIA Mouse Gene Index Ver. 4 | ![]() |

1. Three scales for gene location in the chromosome: whole chromosome, 3-Mb window, and a 300-Kb window.
You can navigate in all 3 windows by clicking at any gene or location.
Version 3 and 4 of the NIA Mouse Gene Index differ from versions 1 & 2 in 3 respects
The NIA Mouse Gene Index ver. 4 was asssembled from sequences alignmed to the genome
(May 2004 release) using our new All-Alignment-Assembly (AAA) algorithm.
Start and end sites of each intron were examined for splicing consensus.
We used canonical (GT-AG) as well as 2 major non-canonical (GC-AG and AT-AC) splicing consensuses,
which were well validated (Burset et al. 2000).
Protein-coding genes have ORF ≥100 aa or a known function (excluding gene copies, repeats and models, see below)
Gene copies were identified as assemblies in which <30% members the best alignment.
Gene models with no EST/mRNA support were based on Ensembl and/or EefSeq-XM sequences.
Genes, gene candidates, and non-genes taken together make a set of U-clusters,
which are the basic units of the gene index. Each U-cluster has a unique location
in the genome.
Gene Index web pages belong to the following levels or "views"
Genome view
Transcript view
Sequence view
Results of comparison can be downloaded here.
The set of multi-exon protein-coding genes is well covered by TIGR, DoTS, and NIA indexes
(differences are within 200 genes range). Unigene and ESTgenes have 2,745 and 3,931
missing multi-exon protein-coding genes, respectively.
If multiple sequences in other databases were mapped to the same NIA transcript,
they were considered redundant. Redundancy was very limited in the Unigene and ESTGenes
databases, but very high in TIGR and DoTS (368,557 and 282,488 redundant transcripts,
respectively). For example, a transcript of Hbb1-b1
had 2445 entries in TIGR and 333 entries in DoTS.
In contrast to the good coverage of the gene set, the coverage of individual transcripts
in all public databases was substantially incomplete.
The NIA Mouse Gene Index had 22,456 additional transcripts of protein-coding genes
that consisted of combinations of exons not found in Unigene, TIGR, DoTS, or ESTGenes
databases (see table).
In general, transcript assemblies in the existing gene indexes had smaller numbers of
exons and introns with correct splice sites, and shorter ORFs than the NIA Mouse
Gene Index (Figure below). ESTGenes appeared to be the most deficient in the number of
exons and ORF length in comparison with other gene indexes.
We counted the number of alternative splicing ATS units and the number of supporting
mRNA/EST sequences for each intron of the main transcript in all protein-coding genes.
Each alternative splicing ATS unit was counted only once at the first intron.
Supporting mRNA/EST sequences were counted if they covered the specific intron either
in the main or alternative form. Introns were then grouped into classes according to
the number of supporting mRNA/EST sequences, and the average number of ATS units per
intron was estimated in each class. The number of detected alternative splicing ATS
units increased with increasing number of supporting mRNA/EST sequences and then had
a tendency to level off (Figure below). The average number of alternative splicing ATS units
per intron did not exceed 0.25 even for introns with >1000 supporting EST sequences
(Fig. 8). Main transcripts of protein-coding genes had total 176,705 introns. Assuming
the average 0.25 ATS units per intron, the number of alternative splicing units was
estimated at 44,176. Therefore, known alternative splicing ATS units (N=20,442)
accounted for only 46% of the estimated total number.
Because the BLAT algorithm attempted to find a genomic match for as many nucleotides
as possible, it created artificial small exons within an intron to avoid mismatches
that are close to exon boundaries (Volfovsky et al. 2003). Small exons (<6 bp) without
splicing consensus were considered artifacts and were either merged with neighboring
exons (if the number of mismatches was <50%) or removed from the alignment. Most real
micro-exons (80%) were well detected with BLAT (Volfovsky et al. 2003); thus we did not
use any additional correction procedures for micro-exon detection. Short initial and
final blocks (<40bp) in the alignment that were separated by an intron without splice
consensus were removed because most of them were random matches. Short alignments
(<70 bp) were removed if either genome span was >500000, or there were introns without
splice sites, or PID was <95%. Because sequence quality was usually lower in ESTs than
in full mRNAs, we applied more stringent criteria for the filtering of ESTs. The best
EST alignment was deleted if its genome span was >200000 bp and there were no introns
with splicing consensus, or there were >1 intron without splicing consensus, or PID
was <90%. Other EST alignments were deleted if the genome span was >100000 bp and
there were <2 introns with splice consensus, or there were >1 intron without splice
consensus, or PID was <90%. These criteria were determined iteratively by examining
the results of gene assembly and identifying sequences that caused problems.
Sequence orientation was validated in three steps. First, we identified multi-exon
alignments in which orientation was unambiguously determined by the direction of splicing
consensus. If the orientation determined from splicing consensus did not match with
the original sequence orientation, then the sequence was labeled as "wrong-strand".
At the second step, we assembled alignments with validated orientation and checked
their overlap with other alignments. If an alignment with unclear orientation overlapped
by >20% length with a some assembly with known orientation, and the overlap length with
this best matching assembly was at least twice as larger as the overlap with any assembly
in the opposite strand, then the orientation of the tested alignment was considered
valid. This procedure might have re-oriented some naturally occurring antisense RNAs
if they are unspliced. The orientation of spliced antisense RNAs was determined based
on splice sites and was not changed. We did not intend to assemble unspliced antisense
transcripts, because they could not be effectively distinguished from the genomic
contamination, and their biological function was unclear. At the third step, we
assembled all sequences with unclear orientation and determined the orientation of
each consensus based on the majority rule.
Additional filtering was done in the groups of partially overlapping alignments. If
a sequence had multiple partially overlapping alignments in the same group, then
only the best one was retained. Alignments of gene models that had no support for
any of the introns from alignments of expressed sequences in the same group were
removed. If an intron joined two distinct sets of alignments or contained several
multi-exon alignments and had insufficient evidence (i.e., supported by one mRNA/EST
sequence or by only gene models), then the alignments were truncated at that intron.
In addition, we truncated alignments at introns with insufficient evidence that
either had length>30 kb and no splicing consensus, or included a promoter or a start
or end of a RefSeq sequence alignment.
Alignments in each chromosome and each strand were grouped into non-overlapping
clusters, and then each cluster was processed sequentially by the AAA algorithm.
The proposed AAA algorithm consisted of four steps: (1) find all non-redundant left
(towards 5'-end of gene) extensions for each alignment; (2) identify all right-end
alignments that cannot be extended to the right (towards 3'-end of gene); (3)
assemble transcripts starting from right to left by branching the extension of each
alignment to the left; (4) remove redundant and low-quality transcripts.
The algorithm started with sorting all alignments by their starting position, and
sequence extensions were determined from left to right. An alignment B extended
alignment A to the left if it partially overlapped with A, was compatible with A,
and strictly left from A. From the set of all left extensions we removed redundant
left extensions using Algorithm 1. Extension B of alignment A was defined non-redundant
if for any other extension C of alignment A either (1) C was non-compatible with B,
or (2) B was longer, or (3) B was shorter, but it had left extensions (direct or
chained) that were non-compatible with C. For example, in the figure below, extension B of
alignment A was compatible with C and shorter than C. However it was non-redundant
because it had a left extension D, which was incompatible with C.
To eliminate redundant left extensions for each alignment A, the set S of all left extensions
was sorted by increasing left boundary position. Then for each subsequent extension
s we checked if it was compatible with any longer non-redundant extension n. If it
was not compatible with any, then s was added to the list of non-redundant extensions
of A. If s was compatible with a longer left extension n, we checked if any left
extension of s was compatible with n. A stack was initialized with alignment s and
then it accumulated assemblies that started from s and extended transitively to the
left. When assembly was extracted from the stack, it was extended to the left with
all non-redundant left extensions determined for its left-most element. If extension
Q was compatible with n, and its left boundary was to the right from the left end of
n, then it was combined with the assembly and added back to the stack. If Q was
compatible with n but its left boundary was equal or left to the left end of N, then
the next left extension Q of the assembly was tried. If Q was not compatible with n
then i was not redundant compared to n; in this case we went to the next non-redundant
extension n. If all non-redundant extensions were tested and s was non-redundant to
all of them then s was added to the set of non-redundant left extensions of A, and
the algorithm was repeated for the next s.
Algorithm 1. Filtering non-redundant left extensions for alignment A
1 Initialize empty set N of non-redundant extensions of A.
Transcripts were assembled (step 3) starting from the rightmost alignments, which were
then combined with all possible non-redundant left extensions. Because the assembly
could branch, we used a stack to store incomplete transcripts.
It can be proven that all possible full transcripts are generated by the algorithm. A
full transcript is the one that can be extended further neither to the right nor to
the left. We define a frame of a transcript assembly, as a set of member alignments
that were not included into any other alignment. In a frame, all alignments are
linearly ordered by the strictly left relation. If alignment B in the frame is a
redundant left extension of the previous alignment A, then it can be removed without
breaking the transcript frame. According to the definition of redundancy, there is
another longer non-redundant left extension C that extends A beyond B and is
compatible with all elements in the frame. If alignment C is not in the frame
itself, then it is included into another alignment D in the frame. If B is removed
from the frame, the transcript will remain joined either by C or D. After removal of
all redundant left extensions, the frame of the transcript should be constructed
via our algorithm starting from the rightmost alignment.
At step 4, we removed redundant transcripts with the same composition of exons or
shorter if they had a fewer number of introns with a splicing consensus. Transcripts
with unspliced alternative first exon were removed if there was no promoter within
1 kb of transcription start. Transcripts with unspliced alternative last exon
were removed if the last exon had no polyA signal.
The AAA algorithm was a part of a gene and transcript assembly system that included
pre-processing and post-processing of data. The first step in pre-processing was a
temporary removal of redundant alignments that were exact copies or slightly shorter
copies (by 15 bp) of other alignments. To increase computation speed, all alignments
that were included into other ones were considered redundant if the total number of
sequences was >150. Small gaps (<15 bp) in alignments were removed and intron
boundaries were adjusted to neighboring splice sites within 15 bp distance. Unspliced
alternative first and last exons in ESTs were truncated unless they matched to
promoters or polyA signals.
The last pre-processing step was the grouping of alignments into U-clusters (=potential
genes) based on alignment overlap and clone-linking. We distinguished gross overlap
on the level of whole alignments, and fine overlap, on the exon level. All alignments
on the same strand of a chromosome were sorted by starting position and then subdivided
into gross-overlapping groups. The starting position of alignments was adjusted
according to the clone-linking information. If a clone was sequenced from the 3' and
5' ends, then two resulting EST sequences were assumed to represent the same transcript.
After these ESTs were aligned to the genome, we considered alignments clone-linked if
they were on the same strand of the same chromosome, 3' alignment was on the 3' side
relative to the 5' alignment, and the distance between alignments was <800,000 bp. The
starting position of the EST located farther from the chromosome start was set to the
starting position of another EST with which it was clone-linked. Thus, clone-linked ESTs
always appeared in the same gross-overlapping group together with all other alignments
between them.
Each gross-overlapping group was then subdivided into U-clusters assuming that alignments
in different U-clusters had fine overlap <5% of alignment length. If 2 clone-linked EST
pairs appeared in different U-clusters within the same gross-overlapping group and each
alignment of one U-cluster was compatible with all alignments in the second U-cluster,
then these U-clusters were merged. U-clusters containing copies of the same sequence
were not clone-linked to avoid merging gene tandems. A U-cluster located entirely
within an intron of another U-cluster was considered intronic. Many intronic U-clusters
did not seem to be real genes but rather cloning artifacts. However, some of them were
real single-exon genes (e.g., Rpl12 was within Acadl, and Cks2 was within Sntg1). It is
very unlikely for a multi-exon gene to be located within an intron of another gene
because the splicing mechanism of an outer gene would not work properly. Although we
found several instances of intronic multi-exon U-clusters, we believe that most of
them were artifacts resulted from genome or alignment errors. All alignments within
the same U-cluster were submitted for the AAA algorithm to generate transcripts.
Post-processing of U-clusters and transcripts included clone-linking of transcripts,
generating genome alignments of transcripts, mending genomic gaps based on alignments
of expressed sequences, generating alignments of transcript members to transcripts,
and compiling a graph of exons. Transcripts of the same U-cluster were merged if they
contained clone-linked EST pairs and all member alignments were mutually compatible.
Gaps in the genome sequence were identified if transcripts from 2 independent sources
indicated the same gap. These gaps were mended using expressed sequence information.
Alignments of transcript members to transcripts were generated as a composition of
two alignments: the alignment of a member sequence to the genome, and a reverse
alignment of the transcript to the genome.
Exon graph has become a standard representation of possible transcript alteration
(Xing et al., 2004). Some exons are represented by multiple exon forms which differ
in their starting and ending coordinates. We constructed exon graphs for all U-clusters
using preferentially introns with splicing consensus. Introns without consensus
appeared in the graph only if no better intron was known. Retained intron was a
special case of alternative splicing that was difficult to distinguish from a splicing
error (Zhou et al. 2003). Thus, retained introns were included into the exon graph
only if their length was ≤500 bp.
Inheritance of U-cluster and transcript names from previous program runs was important
for the consistency of results. First, we matched U-clusters in the nearest
neighborhood (±4 Mbp) if they shared at least some members and the number of exons
in the new assembly was close to the number of exons in the old assembly. At the second
step we identified key sequences that matched to only one old U-cluster. Then U-clusters
were matched if they shared any of key members and the number of exons was close.
Finally we matched U-clusters that shared key sequences without considering the number
of exons. Non-matched old U-clusters were deleted, and non-matched new U-clusters
were created. Then we found matching transcripts within matching U-clusters using
key members that were found in only one old transcript.
S = L·(1 + 0.25·N/Nmax), if N ≥ 10
where L is ORF length, N is the average number of supporting mRNA/EST sequences for
each intron (RefSeq sequences were weighted as 10), and Nmax is the maximum value
for N among all transcripts of the gene.
A U-cluster was considered a copy of another U-cluster if <30% of its members were
best matches. Cross-links were established between primary genes and their copies
based on member copies. U-clusters had a suspicious orientation if they fine-overlapped
by >50% with a better supported U-cluster in the opposite strand.
Annotations for transcripts were generated from annotations of member sequences. The
preference was given to member sequences from Refseq, GenBank, and to sequences with
a valid symbol.
Haas, B.J., A.L. Delcher, S.M. Mount, J.R. Wortman, R.K. Smith, Jr., L.I. Hannick, R.
Maiti, C.M. Ronning, D.B. Rusch, C.D. Town, S.L. Salzberg, and O. White. 2003.
Improving the Arabidopsis genome annotation using maximal transcript alignment
assemblies. Nucleic Acids Res 31: 5654-5666.
Thierry-Mieg, D. et al. http://www.aceview.org/:
Danielle and Jean Thierry-Mieg, Michel Potdevin, Mark Sienkiewicz. Identification
and functional annotation of cDNA-supported genes in higher organisms using AceView,
unpublished. 2004.
Volfovsky, N., B.J. Haas, and S.L. Salzberg. 2003. Computational discovery of
internal micro-exons. Genome Res 13: 1216-1221.
Xing, Y., A. Resch, and C. Lee. 2004. The multiassembly problem: reconstructing
multiple transcript isoforms from EST fragment mixtures. Genome Res 14: 426-441.
Zhou, Y., C. Zhou, L. Ye, J. Dong, H. Xu, L. Cai, L. Zhang, and L. Wei. 2003. Database
and analyses of known alternatively spliced genes in plants. Genomics 82: 584-595.
2. Magenta triangle shows the location of the gene in each scale.
3. Upper line in each pair of lines represent the positive strand, the lower line represents the negative strand.
4. Red-bordered bar is a gene or gene candidate. Click on it to see the gene structure.
5. Green-bordered bar is a non-gene (ORF<100 aa and single exon).
6. Blue '+' indicates TSS identified using the FirstEF software.
The TSS is strand-specific.
7. Gray area indicates a CpG island identified using
8. Oligos used in NIA mouse microarrays (60-mers manufactured by Agilent). Click on the oligo to get more information.
9. Names of assembled transcripts. The first part (e.g., U000006) indicates a U-cluster (gene or transcribed non-gene),
and the second part (after dash) is the transcript number. Click on the graph of a transcript to see aligned sequences.
10. Additional information on transcripts. First column show ORF length (aa). Second column is the first aminoacid
(M for methionine, ATG, and L for lysine, CTG) and the Kozak consensus shown by the thickness of the green bar. Third column
indacates the source of sequences: N=NIA, E=Ensembl, Rf=RefSeq, Gb=GenBank, Est=dbEST.
11. U-cluster start.
12. U-cluster end.
13. U-cluster end.
14. Exons. Blue color = ORF; magenta = untranslated regions (UTR).
15. Introns. Black color = correct splie sites (canonical, GT-AG, as well as two major non-canonical, GC-AG and AT-AC).
Gray color = at least one splice site is incorrect
16. Transcripts that belong to other U-clusters in the same strand.
17. Transcription start.
18. Transcription end.
19. Exon number.
20. Intron length, bp (not in-scale with exons).
2. General Information
The initial objective of the NIA Mouse Gene Index was to annotate several hundred thousands of
ESTs obtained from early embryos and stem cells (NIA Mouse cDNA project).
Version 1 and version 2 of the Gene Index
were assembled from NIA set of ESTs plus additional databases:
RefSeq,
Ensembl,
Riken, and
GenBank (in ver. 2).
Assembly of Gene Index version 1 is described in Sharov et al. 2003. (PLoS Biology 1: 410-419).2.1. Input Sequences
Database Downloaded Before filtering After filtering
RefSeq 08/24/2004 26,600 23,647
Ensembl 09/14/2004 35,247 30,892
GenBank 08/24/2004 129,820 121,977
dbEST 08/24/2004 4,243,544 730,886
NIA 08/24/2004 390,030 378,845
2.2. Flow Chart of the Gene Index Development

2.3. Composition of the NIA Gene Index 4.0

Gene candidates = multi-exon or ORF ≥100 aa but not protein-coding genes; include:
(a) gene copies (=pseudogenes and possibly duplicated functional genes)
(b) repeats = sequences with >90% repeat
(c) gene models = have no EST/mRNA support
(d) non-coding genes = multiple exons but not protein-coding (see above)
Non-genes = single-exon genes with ORF <100 aa.2.4. Navigation in the Gene Index
You can either browse genes based on their genome location,
search by annotation term or sequences name (type-in
a search term and click on "Search"), or BLAT your
sequence against the genome (click on "BLAT").
Major data sets can be downloaded from here.
proivides information on the genome location and exon-intron structure of
U-clusters. Three upper bars present the location of the gene at 3 scales:
whole chromosome, 3Mbp window, and 300Kbp window. Positive and negative strands
are shown separately. A user can click on a chromosome location or gene box to
get to a different gene. The interface is designed for viewing a single
U-cluster (gene). Links to genome browsers that have a zoom-in/zoom-out option
(NCBI, UCSC, Ensembl) are
provided to view the same region of the chromosome. Click on any transcript sequence
to get to the transcript view.
provides information on a particular transcript. A link to the U-cluster
returns to the genomic view. At the top there is a plot of the transcript
its open reading frame (ORF). Character "M" or "L" at the start of ORF indicates
that the first aminoacid is Methyonine or Lysin, respectively. A green bar next to "M" indicates
the presence of a Kozak consensus. Below the transcript there are members
of the transcript plotted on a white background: Refseq, Ensembl, Riken,
and NIA clusters. Individual ESTs are plotted on a
gray background (NIA library name is indicated on the right). At the
bottom of the page there are lists of protein domains and a list of GO-terms
associated with the gene symbol. Click on the transcript sequence to get to
the sequence view.
provides information on the nucleotide and protein sequence of a transcript.
In addition it lists protein domains, GO-terms, repeat and regions.
There are links to several sequence analysis tools: BLAST, BLAT, ORF finder.3. Comparison with other databases
U-clusters and transcripts generated here were compared with four other whole-genome mouse
gene indexes: TIGR (downloaded on 10/27/2004), Unigene (downloaded on 10/21/2004), DoTS
(downloaded on 10/21/2004), and ESTGenes (downloaded on 11/09/2004).
Because Unigene did not have assembled sequences, we
selected the best representative from each cluster. Transcripts of each gene index
were aligned to the genome using BLAT and then the overlap with the NIA transcripts
was determined by combining genome alignments. Transcripts were considered matching
if at least 30% of their length matched within genome boundaries and at least 5% length
matched to exons. This low threshold for exon overlap was set because some transcripts
in TIGR and DoTS were assembled from unspliced ESTs with retained intron length as
long as 95% of the sequence.
4. Estimating the total number of alternative splicing ATS units
Because many genes were represented by a limited number of EST/mRNA sequences that were
extracted from a limited number of tissues, our inventory of ATS units in mouse was
incomplete. Nonetheless, the number of ATS units can be estimated from the relationship
between the frequency of known ATS units and the number of supporting mRNA/EST
sequences. We limited our analysis to alternative splicing ATS units (Fig. 7), only
because the number of alternative transcription starts and terminations was less
certain and dependent on the accuracy of the predicted locations of promoters and
termination signals.
5. Methods
4.1. Filtering
Genome alignments were selected if at least 30% length matched to the genome but
not less than 40 bp, and the ratio of the total alignment length to the best
alignment was at least 0.9. Sequences that had >100 alignments and satisfied the
above listed conditions were considered repeats and removed. None of them had an
ORF with known protein domains according to searches using RPS-BLAST and CDD database
ver 2.02 (Marchler-Bauer and Bryant 2004). For other sequences we considered not more
than 50 alignments. At the next filtering step we checked the quality of the alignment
using the percent identity (PID). Short non-intronic gaps as well as short inserts
(<30 bp) in the sequence were treated as mismatches for estimating PID. The threshold
of PID = 70% was used for filtering best alignments, and PID = 85% for additional
alignments.5.2. Assembly
The proposed All Alignment Assembly (AAA) algorithm assembled the set of all longest
transcripts from EST/mRNA sequences aligned to the genome. Each transcript consisted
of partially overlapping compatible alignments. Two alignments were considered
compatible if each sequence had no elements mapped to an intron of another sequence.
For practical purposes, we relaxed this condition so that two alignments were
considered compatible if all non-compatible fragments were shorter than 15 bp. This
made the assembly less sensitive to sequencing and alignment errors. The compatibility
relationship among alignments was non-transitive. This means that if alignments A and
B were compatible, and B and C were compatible, then A and C were not always compatible.
However, in a set of sequences that extended each other from right to left (or from
left to right), the compatibility relation became transitive and could be chained to
produce longer transcripts (Haas et al. 2003, Eyras et al. 2004).
2 Sort the set S of all left extensions for A by increasing left boundary
3 For each extension s in S{
4 For each non-redundant extension n in N{
5 If s is compatible with n{
6 Initialize stack T with alignment s
7 While T is non-empty{
8 Extract last assembly [q0,q1,q2, ... ,qm] from stack T.
9 For each non-redundant extension Q of the last element (qm){
10 If Q is compatible with n{
11 If left end of Q is equal or left to the left end of n{
12 Next Q (line 9)
13 }
14 else{
15 Push assembly [q0,q1,q2, ..., qm,Q] into stack T.
16 }
17 }
18 else{
19 s is non-redundant comparing to n; try next n (line 4)
20 }
21 }
22 }
23 s is redundant; go to next s (line 3)
24 }
25 }
26 s in non-redundant; push s into set N; go to next s (line 3)
27 }
28 Return N5.3. Analysis of transcripts
Analysis of transcripts included identification of (1) the longest open reading frame
(ORF), (2) repeat regions, (3) main transcript for each U-cluster, (4) duplicated
U-clusters, (5) U-clusters with suspicious orientation, and (6) generating annotations
for transcripts and U-clusters.
ORF was detected using the ORF Finder software (Wheeler, 2004) with both standard
and alternative genetic code options. Because generated transcripts might have
contained ORF shifts resulted from single nucleotide insertions/deletions, we
analyzed not just individual ORFs but also composite ORFs consisting of a pair of
overlapping ORFs if each portion was longer than 100 aa. The threshold of 100 aa.
was selected because ORFs of this length are highly unlikely (P<0.01) to appear in
random sequences. If the difference in length between a single ORF and composite
ORF was <100 aa, then a single ORF was selected. As a result only ca. 5% transcripts
appeared to have composite ORFs. Genomic repeat sequences were already masked in
the mouse genome database (mm4), thus we simply projected them onto transcript
sequences. Main transcripts for each U-cluster were identified based on the score
S = L, if N < 10,References
Eyras, E., M. Caccamo, V. Curwen, and M. Clamp. 2004. ESTGenes: alternative splicing
from ESTs in Ensembl. Genome Res 14: 976-987.