NIA Mouse Gene Index Ver. 1
Sharov et al. 2003. PLoS Biology 1: 410-419

Gene Index Help

Objectives for Gene Index development

  • Link information on 175,059 cDNA clones from mouse early development and stem cells developed at NIA with public databases
  • Find new genes specific to mouse early development and stem cells
  • Improve information on already known genes: extend the 5’-end, get new 1st exons, new splicing

    Flow Chart of the Gene Index Development

    Using TIGR gene indices clustering tools (Pertea et al.2003), 249,200 ESTs (170,059 cDNA clones) were clustered, generating 58,713 consensuses and singletons. NIA consensuses and singletons were further clustered with Ensembl transcripts, RIKEN transcripts, and RefSeq transcripts and transcript predictions. Alignments of these sequences to the mouse genome (UCSC February 2002 freeze data) using BLAT helped to avoid false clustering of similar sequences at nonmatching genome locations. Erroneous clusters were reassembled based on the analysis of genome alignment. A total 94,039 putative transcripts (called NAP) were thus generated and then grouped into 39,678 putative genes (called U-clusters) based on their overlap in the genome on the same chromosome strand and on clone-linking information. Using criteria of an ORF greater than 100 amino acids or of multiple exons (excluding sequences that are potentially located in a wrong strand), 29,810 mouse genes were identified. Finally, 977 genes unique to the NIA database were identified.

    Navigation in the Gene Index

    A user can either browse genes based on their genome location (select chromosome and click on "Browse U-clusters"), search by annotation term or sequences name (type-in a search term and click on "Search"), or BLAST your sequence against all transcripts (NAP sequences) and follow best matches (click on "BLAST").

    It is possible to download major data sets of the Gene Index: NAP sequences in fasta format, U-cluster members, and the list of U-clusters that are genes.

    The Gene Index summary table lists the number of U-clusters with specific properties (ORF length, number of exons) that have members from various databases. The databases are listed at the left side: Rik=RIKEN, Ens=Ensembl, Ref=Refseq, NM=Refseq NM series. If a line has "+ - - - -" this means that it has sequences that are present only in NIA database. Column headers show the number of exons: 1 for single-exon, and N for multi-exon U-clusters. "M & Kozak" means that the first aminoacid in ORF is methyonine and the Kozak consensus is adequate. ORF length is considered short if it is <100 aa, medium if it is >=100, and <200 aa., and long if it is >=200 aa.

    At the bottom of the web page you can find lists of NAP sequences that were extended at 3' or 5' end by NIA sequences, and a list of full-sequenced clones.

    Major Levels of the Gene Index

    Genome view
    proivides information on the genome location and exon-intron structure of U-clusters. Three upper bars present the location of the gene at 3 scales: whole chromosome, 3Mbp window, and 300Kbp window. Positive and negative strands are shown separately. A user can click on a chromosome location or gene box to get to a different gene. The interface is designed for viewing a single U-cluster (gene). Links to genome browsers (NCBI, UCSC, Ensembl) are provided to view the same region of the chromosome. These genome browsers have a zoom-in/zoom-out options. Click on any NAP sequence to get to the transcript view.

    Transcript view
    provides information on a particular transcript. A link to the U-cluster returns to the genomic view. At the top there is a plot of the transcript its open reading frame (ORF). Character "M" at the start of ORF indicates that the first aminoacid is Methyonine. A green bar next to "M" indicates the presence of a Kozak consensus. Below the transcript there are members of the transcript plotted on a write background: Refseq, Ensembl, Riken, and NIA clusters. Individual ESTs from NIA libararies are plotted on a gray background (a short library name is indicated on the right). At the bottom of the page there are lists of protein domains and a list of GO-terms associated with the gene symbol. Click on the transcript sequence to get to the sequence view.

    Sequence view
    provides information on the nucleotide and protein sequence of a transcript. In addition it lists protein domains, GO-terms, repeat and regions. There are links to several sequence analysis tools: BLAST, BLAT, ORF finder.