NIA Mouse Gene Index Ver. 3

Gene Index Help

General Information

The initial objective of the NIA Mouse Gene Index was to annotate several hundred thousands of ESTs obtained from early embryos and stem cells (NIA Mouse cDNA project).
Version 1 and version 2 of the Gene Index were assembled from NIA set of ESTs plus additional databases: RefSeq, Ensembl, Riken, and GenBank (in ver. 2). Assembly of Gene Index version 1 is described in Sharov et al. 2003. (PLoS Biology 1: 410-419).

Version 3 of the NIA Mouse Gene Index differs from earlier versions in 3 respects

  • It uses the dbEST database (in addition to above mentioned databases) to construct all gene transcripts supported by known ESTs.
  • Gene index assembly was fully automated
  • Novel 73,873 EST sequences developed at NIA after August 2003

    Flow Chart of the Gene Index Development

    The NIA Mouse Gene Index ver. 3 was asssembled from sequences alignmed to the genome (October 2003 release) using our new All-Alignment-Assembly (AAA) algorithm. Start and end sites of each intron were examined for splicing consensus. We used canonical (GT-AG) as well as 2 major non-canonical (GC-AG and AT-AC) splicing consensuses, which were well validated (Burset et al. 2000).

    The Gene Index has 145,083 U-clusters (transcription loci) and 218,812 transcripts. U-clusters were classified as genes if they had either ORF>=100 aa, or multiple exons separated by an intron with a splice site concensus, or a gene symbol for some member alignment annotated by RefSeq, GenBank, or Ensembl. Among 43,069 genes, 27,316 were protein coding (ORF>=100 aa or known function), 6,717 were non-coding genes or gene fragments with ORF < 100 aa, 959 had high repeat content (>90%), 1,842 were gene models from Ensembl and RefSeq-XM with no EST or mRNA support in our assembly, and 6,235 were gene duplications and/or pseudogenes.

    Navigation in the Gene Index

    You can either browse genes based on their genome location, search by annotation term or sequences name (type-in a search term and click on "Search"), or
    BLAT your sequence against the genome (click on "BLAT").

    Major data sets can be downloaded from here.

    Major Levels of the Gene Index

    Genome view
    proivides information on the genome location and exon-intron structure of U-clusters. Three upper bars present the location of the gene at 3 scales: whole chromosome, 3Mbp window, and 300Kbp window. Positive and negative strands are shown separately. A user can click on a chromosome location or gene box to get to a different gene. The interface is designed for viewing a single U-cluster (gene). Links to genome browsers that have a zoom-in/zoom-out option (NCBI, UCSC, Ensembl) are provided to view the same region of the chromosome. Click on any transcript sequence to get to the transcript view.

    Transcript view
    provides information on a particular transcript. A link to the U-cluster returns to the genomic view. At the top there is a plot of the transcript its open reading frame (ORF). Character "M" or "L" at the start of ORF indicates that the first aminoacid is Methyonine or Lysin, respectively. A green bar next to "M" indicates the presence of a Kozak consensus. Below the transcript there are members of the transcript plotted on a white background: Refseq, Ensembl, Riken, and NIA clusters. Individual ESTs are plotted on a gray background (NIA library name is indicated on the right). At the bottom of the page there are lists of protein domains and a list of GO-terms associated with the gene symbol. Click on the transcript sequence to get to the sequence view.

    Sequence view
    provides information on the nucleotide and protein sequence of a transcript. In addition it lists protein domains, GO-terms, repeat and regions. There are links to several sequence analysis tools: BLAST, BLAT, ORF finder.

    Terminology disclaimer

    Our use of terms "gene", "pseudogene" and "protein-coding gene" is based on formal criteria that do not always match with existing data on the function of specific genes. For example, non-coding gene H19 is shown as "protein-coding" because the software detected ORF length > 100 aa. We do not oppose pseudogenes and genes, but rather consider pseudogenes as redundant genes. It was not our objective to determine which copy of a gene is indeed functional.