![]() |
NIA Mouse Gene Index Ver. 1 Sharov et al. 2003. PLoS Biology 1: 410-419 |
![]() |

Flow Chart of the Gene Index Development
Using TIGR gene indices clustering tools (Pertea et al.2003), 249,200 ESTs (170,059 cDNA clones) were clustered, generating 58,713 consensuses and singletons. NIA consensuses and singletons were further clustered with Ensembl transcripts, RIKEN transcripts, and RefSeq transcripts and transcript predictions. Alignments of these sequences to the mouse genome (UCSC February 2002 freeze data) using BLAT helped to avoid false clustering of similar sequences at nonmatching genome locations. Erroneous clusters were reassembled based on the analysis of genome alignment. A total 94,039 putative transcripts (called NAP) were thus generated and then grouped into 39,678 putative genes (called U-clusters) based on their overlap in the genome on the same chromosome strand and on clone-linking information. Using criteria of an ORF greater than 100 amino acids or of multiple exons (excluding sequences that are potentially located in a wrong strand), 29,810 mouse genes were identified. Finally, 977 genes unique to the NIA database were identified.
It is possible to download major data sets of the Gene Index: NAP sequences in fasta format, U-cluster members, and the list of U-clusters that are genes.
The Gene Index summary table lists the number of U-clusters with specific properties (ORF length, number of exons) that have members from various databases. The databases are listed at the left side: Rik=RIKEN, Ens=Ensembl, Ref=Refseq, NM=Refseq NM series. If a line has "+ - - - -" this means that it has sequences that are present only in NIA database. Column headers show the number of exons: 1 for single-exon, and N for multi-exon U-clusters. "M & Kozak" means that the first aminoacid in ORF is methyonine and the Kozak consensus is adequate. ORF length is considered short if it is <100 aa, medium if it is >=100, and <200 aa., and long if it is >=200 aa.
At the bottom of the web page you can find lists of NAP sequences that were extended at 3' or 5' end by NIA sequences, and a list of full-sequenced clones.
Transcript view
provides information on a particular transcript. A link to the U-cluster
returns to the genomic view. At the top there is a plot of the transcript
its open reading frame (ORF). Character "M" at the start of ORF indicates
that the first aminoacid is Methyonine. A green bar next to "M" indicates
the presence of a Kozak consensus. Below the transcript there are members
of the transcript plotted on a write background: Refseq, Ensembl, Riken,
and NIA clusters. Individual ESTs from NIA libararies are plotted on a
gray background (a short library name is indicated on the right). At the
bottom of the page there are lists of protein domains and a list of GO-terms
associated with the gene symbol. Click on the transcript sequence to get to
the sequence view.
Sequence view
provides information on the nucleotide and protein sequence of a transcript.
In addition it lists protein domains, GO-terms, repeat and regions.
There are links to several sequence analysis tools: BLAST, BLAT, ORF finder.