#******************************************************* # 2004, The National Institute on Aging (NIA/NIH). #******************************************************* This software is provided "AS IS". NIA makes no warranties, express or implied, including no representation or warranty with respect to the performance of the software and derivatives or their safety, effectiveness, or commercial viability. NIA does not warrant the merchantability or fitness of the software and derivatives for any particular purpose, or that they may be exploited without infringing the copyrights, patent rights or property rights of others. NIA shall not be liable for any claim, demand or action for any loss, harm, illness or other damage or injury arising from access to or use of the software or associated information, including without limitation any direct, indirect, incidental, exemplary, special or consequential damages. This software program may not be sold, leased, transferred, exported or otherwise disclaimed to anyone, in whole or in part, without the prior written consent of NIA. Programmer: Alexei Sharov (sharoval@grc.nia.nih.gov) National Institute on Aging, Genetics Lab, All rights reserved. The software was not sufficiently tested. Thus, problems may arize in the case of misconfiguration or missing components. If you are familiar with Perl you may try to fix the problem yourself or contact Alexei Sharov at sharoval@grc.nia.nih.gov. Please indicate the error/warning message in your e-mail. 1. GENERAL DESCRIPTION The software for gene index assembly has a modular structure. Each module can be executed independently of other components. All code is written in the Perl language, except the "togif" module that is used to generate images for the web. MAIN PROGRAM: geneindex.pl CONFIGURATION FILE: geneindex.cfg (it is parsed by geneindex.pl) To run the program you need to create the following directory tree: Geneindex (here should be all Perl programs, togif program) |-data |-output |-archive |-update (not used currently) |-CpG |-October2003 (genome version) |-Genome |-Ensembl |-RefSeq |-NIA |-Riken |-dbEST |-GenBank |-FastaFiles |-Annotations The output is generated to the web server. Thus, you need to make a directory for the web home page. In the geneindex.cfg file it is named as "www/geneindex3". In this directory you need to make the following tree of subdirectories: geneindex3 (home directory) |-bin (for cgi scripts, togif program) |-U |-T |-truncated |-exons |-images |-download |-lists 2. ADDITIONAL SOFTWARE NEEDED BLAT (http://www.genomeblat.com/genomeblat/index.asp) CpGproD (http://pbil.univ-lyon1.fr/software/cpgprod.html) First Exon Finder (http://rulai.cshl.org/tools/FirstEF/) ORFind (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) togif.exe (included) BLAT, CpGproD, and First Exon Finder are executed before assembly, and ORFind is called from within the program. ORFind should be installed at location /usr/local/bin/orfind. If you use a different location, modify the path in file "parse_orf.pl". Program togif.exe draws images for the web interface. It should be present in 2 places: together with perl scripts and in the "bin" directory for the web page. 3 compiled versions are provided for Windows, UNIX, and LINUX. If you have UNIX OS, rename the file "togif.UNIX.exe" as "togif.exe" in both locations. 2. INPUT DATA To run the whole package you need to do the following: 1) Download databases of expressed sequences in fasta format. Put files into the "FastaFiles" directory. Specify file names in the "geneindex.cfg" file. 2) Remove redundancy (if necessary) 3) Modify sequence names to add prefixes which indicate the source (e.g. use "dbEST_" prefix for dbEST sequences; "REF_" for RefSeq) 4) Run BLAT (http://www.genomeblat.com/genomeblat/index.asp) for all sequences versus the genome. The output should be generated by chromosome. Output file name is [chr_name]-[prefix].output (e.g. chr10-Ensembl.output) All output files for one database should be in one directory. Directory names and corresponding prefixs should be listed in the "geneindex.cfg" file. 5) Download sequence annotations and format them into a tab-separated file with 3 columns: sequence name, annotation, gene symbol. Place files into "Annotations" directory. Specify file names in the "geneindex.cfg" file. 6) Generate file "clone-link.txt" and place it into the "data" directory. This tab-delimited file has 3 columns: 5'EST name, 3'EST name, clone name. If some information is missing, leave the field blank. This file is used To determine which EST is 5' or 3', as well as to establish clone-links. 7) Generate information on CpG islan location for each chromosome using the CpGproD software (http://pbil.univ-lyon1.fr/software/cpgprod.html). Name files as CpG[chr_name].txt (e.g., CpGchr11.txt) and put them into the CpG directory. 8) Generate first-exon information using First Exon Finder program (http://rulai.cshl.org/tools/FirstEF/), put the output file named "promoters.txt" into the "data" directory. After transcripts are generated you can add the following data 9) If you want to plot oligo locations (in some microarray) then you need to BLAT oligo sequences versus transcripts (T-fasta.fa file that is generated in the "output" directory). Put this file called "blat-oligo.txt" into the "data" directory. 10) Generate files with protein domains and GO-annotations. These files have the following format: transcript&domain_ID#domain_description@domain_ID#domain_description ... These files are named "T-domain.txt" and "T-ontology.txt", respectively, and are placed into the "data" directory. 3. PROGRAM COMPONENTS cat.pl substitute for the 'cat' tool in UNIX filter_blat1.pl first filtering of BLAT output, compiles output into 1 file est_redundant.pl finds redundant EST sequences extract_lines.pl extracts lines from BLAT output file according to a list of genes or other criteria extract_fasta.pl extracts sequences from Fasta file according to a list of genes or other criteria filter_blat.pl Second filtering of BLAT output, optional numbering alignments splice_evidence.pl makes a file with all introns and their evidence validate_splice_sites.pl validates splice sites, finds polyA signals xm_delete.pl filtering program for gene models (RefSeq_XM) strandCorrection.pl strand correction, clone-linking filter_groups.pl additional filteing to remove conflicting alignments transcripts.pl major assembly module, assembles U-clusters and transcripts nameClusters.pl matches gene/transcript names to already existing ones extract_from_genome.pl generates Fasta sequences for transcripts parse_orf.pl finds ORF extract_exons.pl generates web page for repeat_detection.pl get repeat coordinates generate_annotations.pl generates annotations gene_strand.pl detects U-clusters with wrong strand gene_evaluation.pl Identifies genes, protein-coding genes, major transcripts, etc. oligo_location.pl determine oligo location in all transcripts Uplot.pl plot U-clusters plot_transcripts.pl plot transcripts