![]() |
CisView: Mouse (mm9, Jul 2007) | ![]() |
Terminology
1. Seven scales for gene location in the chromosome: whole chromosome, 3-Mb, 300-Kb, 60-Kb, 4-Kb, 500-bp, and 80-bp windows.
You can navigate in all windows by clicking at a gene or location. Navigation in the whole chromosome, 3-Mb, and 300-Kb windows
will lead to a new 60-bp region that is centered at a TSS of another gene or alternative TSS
ofv the same gene. Navigation in all other windows will only change the zoom area (yellow band).
At 500 bp and 80 bp scales, click on the sequence to change the position.
.
The third subset (N = 27) of medium-quality TSS was taken from RefSeq sequences if they
matched with FirstEF software predictions. Finally, low-quality TSS (N = 12960) were
taken from the NIA Mouse Gene Index if they did not match to other data sources .
Recent experimental data with CAGE tags showed that many promoters had a cluster of
transcription starts rather than a single TSS20. However in the current version of CisView
we use only one TSS per promoter as identified by DBTSS, NIA Mouse Gene Index, or FirstEF
unless TSS have opposite orientation or separated by >500 bp. Considering all possible
transcription starts within a promoter is not feasible currently because this will make
analysis too long for an interactive web-based software. Most functions of CisView
(e.g., finding binding sites within 1 kb upsteam of TSS) are not critically affected
with uncertainty in TSS within 100 bp .
Tentative promoter boundaries for high- and medium-quality TSS were set to the bounds of a
CpG island if it was present at TSS, otherwise they were assumed to span from -200 to +100
bp. The promoter boundaries were then adjusted by excluding transposon-related repeats and
CDS, followed by merging with potential CRMs (see below). Promoters for low-quality TSS
were considered only if they coincided with a potential CRM .
Search for patterns allowed no mismatches, although patterns themselves were degenerate
(i.e., contained symbols R, Y, N, etc.). Patterns with >=18 bit information content (N = 19)
had too few hits, thus we treated them as matrices and allowed mismatches. Matrix-based search
was implemented in 2 steps: (1) search for the exact match of the core pattern, and (2) estimate
the similarity measure using the matrix. The core pattern consisted of 3 or 4 elements
characterized by 2 most dramatic changes in nucleotide frequency between positions measured by
where cj is the degree of change from position j to position j+1, pij is of
nucleotide i at position j, and Ij is the information measure at position j.
For example, the core pattern for the SP1 binding site was GCG. Core patterns were allowed to
be degenerate and included nucleotides that occured at frequencies greater than 50% of the
maximum frequency at that position. Some core patterns had 2 pairs of nucleotides separated
by some distance. For example the TF_CDP binding site had the core ATNNAT. Exact match of
the core ensured the proper position of the matrix and reduced the number of false positives.
The similarity score is equal to the sum of character heights in a sequence logo divided by
the sum of maximum heights at all positions:
where n(j) is the nucleotide in the sequence at position j.
It is equivalent to the score used in the
MatInspector (Quandt 1995). The minimum allowed similarity threshold was 0.8
(i.e., 20% mismatch), however, for abundant BSs we used higher similarity thresholds adjusted
so that the frequency of matches in CpG rich and CpG poor semi-random sequences did not
exceed 1 per 500 bp and 2000 bp, respectively. We used different thresholds for CpG rich and
CpG poor sequences because CpG rich sequences are usually more rich in functional TFBSs.
Semi-random sequences were generated using 3rd order Markov models with transition
probabilities estimated from CpG rich and CpG poor mouse promoters.
Presence of high-quality TFBS as well as multiple TFBS of the same kind in a CRM are
considered as indicators of its function as a transcription regulator (Blanchette et al. 2006).
Thus, we evaluated
regulation potential of a CRM by a score, RPS, which was a sum of scores for individual
TFBS and scores for multiple TFBS of the same kind. Our method of estimating RPS is
different from the one by Elnitski et al. (2003). We used only one genome (mouse), evolutionary
conservation score, and matches of known TFBS patterns, whereas Elnitski et al. (2003) used multiple
genomes without considering known TFBS patterns. The probability of TFBS accidental
occurrence within a CRM of length L was estimated as p = D(s)*L where s is the similarity
score of the binding site, and D(s) is the density of binding sites with a similarity score
in a semi-random sequence generated using 3rd-order Markov process. Depending on whether a
TFBS was in a CpG-rich or CpG-poor region, we used semi-random sequence generated with
transition probabilities estimated from CpG-rich or CpG-poor regions in the mouse genome,
respectively. Regulatory score for a TFBS was estimated as -log10(p)-2 if p < 0.01 or
set to 0 otherwise. The probability of accidental occurrence of multiple binding sites of
the same kind, pm, was estimated as the product of probabilities of their individual
occurrences, p. The regulatory score for multiple TFBS was estimated as -log10(pm) - 2
if pm < 0.01 or set to 0 otherwise. The regulatory potential score, RPS, which is a sum of
scores for individual TFBS and multiple TFBS, was then estimated for all CRMs in the mouse
genome. The probability distribution of RPS within CRMs of each size class (from 50 to 150;
from 150 to 250; from 250 to 350; ...; >1950 bp) was then compared with the probability
distribution of RPS estimated for semi-random sequences of size 100, 200, ..., 1900, >1900 bp
(the last class included sequence sizes from 2000 to 3000 bp) generated using 3rd-order
Markov process with transition probabilities from CpG-rich or CpG-poor regions. Probability
distributions of RPS were very similar for CpG-rich or CpG-poor semi-random sequences
(see Figure below), thus we averaged them and used for estimating of p-values and false discovery
rate (FDR) of RPS in CRMs in the same size class. After sorting all CRMs by increasing
p-values we estimated the false discovery rate for i-th CRM as FDRi = pi*N/i, where pi
is the p-value for i-th CRM, and N is the total number of CRMs. We considered that a
CRM had a significantly higher RPS than in semi-random sequences if FDR was ≤0.1 .
Figure: Examples of the cumulative distribution of regulatory potential score (RPS) in
cis-regulatiory modules (CRMs) in the mouse genome and in semi-random sequences of the
same size generated with 3rd order Markov process with transition probabilities specific
to CpG-rich and CpG-poor genome regions. (A) CRM size from 50 to 150 bp. (B) CRM size
from 1450 to 1550 bp .
3. Legend for the browser

2. Magenta triangle shows the location of the gene/position.
3. Upper line in each pair of lines represent the positive strand, the lower line represents the negative strand.
Magenta boxes in the 3-Mb window are individual genes. In other window scales, genes are shown together
with their exon-intron structure. Coding region of genes is shown by blue boxes, non-coding by magenta.
Projected transcription start sites (TSS) are shown by small circles colored red, light green, or
light blue depending on their quality (high, medium and low, respectively). In addition,
TSS identified using FirstEF software are shown as small
black vertical lines.
4. Transcription start positions from DBTSS database.
5. Cis regulatory modules (CRMs). Promoters are colored dark yellow, distal CRMs are colored red,
and 3'UTR CRMs are colored light blue. In high-resolution windows, the name of the CRM is shown
below. Click on the CRM to get the sequence and a list of transcription factor binding sites.
6. Selected transcription factor binding sites (TFBS). In 60- and 4-Kb windows, selected TFBS are
shifted up if they match to the positive DNA strand, and down if they match to the
negative DNA strand. In the 500 bp window, the strand of selected TFBS is indicated by an arrow.
In the 80 bp window selected TFBS have a color border, and the strand is shown as (+) or (-).
To select a particular TFBS
or a class of TFBS, use the form at the bottom of the screen. This form also gives
an option to hide a group of TFBS or change conservation and mismatch thresholds.
7. Yellow area indicates the region that is zoomed-in below.
8. Names of assembled transcripts. The first part (e.g., U000006) indicates a U-cluster (gene or transcribed non-gene),
and the second part (after dash) is the transcript number. To see details of assembly go to the gene index by clicking
on the U-cluster name in the header of the page.
9. Conservation scores compared with other mammals (from UCSC).
10. Abundance of specific sequence patterns/motifs: CpG pairs (CG), G-stretches, AT/TA, and
A-stretches.
11. Transcription factor binding sites (TFBS). Selected TFBS are shown in color, non-selected
TFBS are black. Color bars are shifted up if TFBS matched to the positive DNA strand,
and down if they matched to the negative DNA strand.
To select another TFBS (to be shown in a different color) or change other viewing
options use the form at the bottom of the screen.
12. DNA sequence in the 500 bp and 80 bp windows is color-coded: A-magenta, T-blue, C-yellow, G-green.
CpG pairs are shown by vertical black lines.
13. Transcription factor binding sites (TFBS). Selected TFBS are shown in color.
Their strand/orientation is shown by arrow. Non-selected TFBS are black.
To select another TFBS (to be shown in a different color) or change other viewing
options use the form at the bottom of the screen.
14. Transcription factor binding sites (TFBS) are shown as boxes with a color-coded
position-weight matrix (or pattern). Click on it to get information on this particular type
of TFBS. Below the box there is a name of the TFBS, orientation in parenthesis, and mismatch
score (e.g. D=0.056). If no mismatches, then the mismatch score (D=0) is not shown.
Selected TFBS have a thick color border (e.g., TF_OCT has a magenta border in the picture).
15. Transcription start site (TSS) is shown by a small circle colored red, light green, or
light blue depending on the quality of TSS (high, medium and low, respectively). In addition,
TSS identified using FirstEF software are
shown as small black vertical lines.
16. Repeats identified using Repeat Masker program (results downloaded from UCSC).
The color of repeats is gray for SINE, dark yellow for LINE, green-blue for LTR, light
green-blue for DNA, and light-red for simple repeats.
4. Methods: TSS and promoters
Analysis of regulatory regions is based on the mouse genome sequence assembled in July 2007(mm9). Transcription start sites (TSS) were compiled from several databases in attempt to cover main
and alternative transcripts of protein coding genes in the mouse henome. TSS from DBTSS database
ver. 5.2 (N = 18,503) were considered high-quality because they were identified using a large
set of full-length cDNA. Because the DBTSS database was applied to an older version of
mouse genome (mm5) we used BLAT to remap TSS coordinates to genome mm9. Medium-quality TSS
were identified as matches between independent data sources which were >500 bp away from
high-quality TSS. The first subset (N = 4712) of medium-quality TSS was taken from
protein-coding transcripts (ORF >= 100 aa, or known function) in the NIA Mouse Gene Index,
ver. mm9 if they matched with FirstEF software predictions within 300 bp. We used
300 bp distance threshold as a matching criterion because it corresponds to the false
discovery rate (FDR) of ca. 1% according to the following estimation. If 52,503 TSS predicted
by FirstEF were randomly distributed in the entire genome (3 Gb), then 387 of them in
average would appear within 300 bp of 36,829 TSS identified by aligning mRNA and EST
sequences to the genome. Thus, the FDR = 387/36,829 = 1%. The FirstEF software uses
discriminant functions to identify potential donor splice sites and TSS based on frequency
distributions of short motifs in the DNA sequence. The second subset (N = 4219) of
medium-quality TSS was taken from protein-coding transcripts in the NIA Mouse Gene Index
if they started within a CpG island but did not match with FirstEF predictions. CpG islands
were detected as regions with a minimum of 8 CpG pairs within 250 bp. This threshold was
selected based on the frequency distribution of CpG pairs in promoters
5. Methods: TFBS
Transcription factor binding sites (TFBS) were identified in the entire mouse genome
using either patterns or position-weight matrices that were compiled from various sources
including the TRANSFAC database, public version 7.0 (Matys et al. 2003). Because TRANSFAC database
has many redundant entries we combined 291 vertebrate matrices into 115 groups. Also we trimmed
regions with low-information or with inconsistencies between various versions of the same TFBS
as it is documented in the web site
(e.g., TF_OCT).
Then one matrix was built for each group. The
second major source of TFBS was the set of 174 patterns over-represented in conserved regions
of mammalian promoters (Xie et al. 2005). Out of these, 69 patterns corresponded to known TFBS.
References to additional TFBSs are available from the web site. In total, 134 matrices and
219 patterns were used for identifications of TFBS.
,
,6. Methods: cis-regulatory modules
Potential cis-regulatory module (below we refer to it simply as CRM) was defined as a
genomic region with at
least 4 conserved TFBSs within each 200 bp of its length, and not overlapping with
transposable repeats and/or CDS. Evolutionary conservation is a reliable indicator of
functionality of TFBSs (Zhang and Gerstein 2003). If a CRM overlapped with a promoter
then it was merged with the promoter; if it overlapped with the 3'UTR of genes, we
considered it a 3'UTR-associated CRM; and all other CRMs were considered as DCRMs.
3'UTR-associated CRMs most likely regulate post-transcriptional processes (mRNA stability,
translation, etc.) (Xie et al. 2005), thus we distinguished them from DCRMs which are
mostly involved in the regulation of transcription. Genome conservation scores and
repeat coordinates were downloaded from the UCSC database (Siepel et al. 2005). Conservation
score 0.5 was used as a threshold for considering a TFBS conserved. DCRMs were considered
high quality if they contained at least one 150 bp region with 6 conserved TFBS.
,7. Methods: browser
The browser for the Mouse Regulatory uses cgi scripts (Perl) for generating pictures and
web pages. To accelerate data processing we created data files, which include all information
on genes, sequence, and TFBS, for each 60 Kb region. Query tools include search for
specific TFBSs or their combinations in promoters or in DCRMs, search for specific genes
based on symbols, annotations, gene ontology (GO) terms or protein domains, search for
promoters of different quality and/or containing a TATA box. Any list of promoters which
resulted from queries or uploaded by a user can be further analyzed for over-represented
TFBSs (both singles and pairs), GO terms and protein domains that are preferentially
associated with the list. Over-representation of promoters with specific TFBSs or genes
with specific GO annotation was evaluated statistically using z scores estimated from
the hypergeometric distribution and FDR<0.05.
8. How to use CisView
>sequence_name
TTCCCTTAATCTCTAGAACTCCCAGCAGTGTTGGCTACT
Sequence may occupy multiple lines.
Frequently asked questions (FAQ)
If you have quyestions, please send a note to the webmaster