Your name: Your email address:
Comments on take home exam and the use of answers from previous years!
Reminder: The midterm will take place next week on Wednesday. Please come to class on Monday with any questions you want to have answered before the mif term.
Go over perl-scripts
Use of <tab> and command history on the unix command-line.
Questions you should answer today are given in red.
Instructions on how to connect to a computer using sftp and ssh are given below
A) obtaining the genome sequences
Download four genomes of your choice. The first two should be from distantly related organisms (e.g. one Archaeon, one Bacterium, if in doubt of what to choose, go for one of the Thermotoga and one of the Pyrococcus genomes. For the second exercise two or more genomes from closely related bacteria or archaea are needed. Two of four Aeromonas genomes would be interesting):
Method: Go into the bacteria sub-folder using the ftp://ftp.ncbi.nih.gov link. Click on the appropriately named folder (e.g., Pyrococcus abyssi, Pyrococcus is a member of the archaeal domain).
Among the files in these folders we are interested in is the .ffn file which contains all nucleotide sequences for individual predicted ORF's, the .faa file which contains the predicted ORF's amino acids sequences, and the .fna file, which contains the whole genome. These files are in a fasta format. Another file of interest are the ptt files that contain information on the individual annotated open reading frames. This file is useful, if you want to map ORFs by position in the genome. For this exercise we will only use the .faa files and the .ptt. If a folder contains more than one .faa file, either take only the file containing the data from the main chromosome (for the first exercise you could take all of the faa files and copy them into a single file*. The same approach is recommended, if you want to use more than one genome as target.) On your computer rename the files into something you recognize. (e.g., p_abyssi.faa and t_maritima.faa). If you want to map the location of genes in two related genomes, you only wnat to download the file the contains the main chromosome. If you are interested in all genes present in an organism, you want to combine the .faa files from the chromosomes and the plasmids. Explore the other file types present in the folder (gbk, fna, frn and rnt).
B) Setting up a directory on the cluster
Establish an ssh connection to the cluster (e.g.: ssh your_account_name@bbcsrv3.biotech.uconn.edu (enter your password when prompted))
Use your ssh or sftp connection to create a directory in your home directrory on the cluster called blasttesst. The easiest will be to right clock in your home directory window in your sftp gui. Or you could type in your terminal window :
cd $HOME mkdir blasttest
On the cluster (remote machine), you should NOT (meaning never-ever) run long processes on the master node. Type qlogin to be transferred to a compute node that is not busy, or submit your command to the queue using qsub. (qrsh also seems to work) On the cluster using the ssh connection established change to the directory, where you stored the *.faa files from the distantly related bacteria/archaea. If the directory is named blasttest and one of the genomes is called p_abyssi.faa, the following commands will create a searchable database: cd blasttestformatdb -i p_abyssi.faa -o T -p T Where p_abyssi.faa is one of the genomes you downloaded; replace the name with the *.faa file you want to work with. Under the new blast+ program package the command would be: makeblastdb -in p_abyssi.faa -input_type fasta -parse_seqids -dbtype prot
C) Doing a Blast search
The -i is to indicate your infile.name, -p T is for a protein sequence, for a nucleotide sequence, you need to use -p F (for False); the -o T instructs the formatdb program to create indices. For more information on the formatdb program type formatdb - <enter> Remember: You need to replace p_abyssi.faa with one of the genomes you downloaded. This genome will be converted into a searchable database.
To do a blast search of every sequence in T_maritima.faa against the P. abyssi genome, we use blastall. To get information on the program type blastall <enter> [For info on the programs from the blast + package type blastp -help] For more details check one of the many help pages here and here. We will use the following options (you need to change the command to reflect the names of the genomes you want to compare; -d gives the name of the databank that you created using formatdb):
blastall -i t_maritima.faa -d p_abyssi.faa -o blast.out -p blastp -e 10 -m 8 -a 2 (see below for the command using the newer blast plus package) -i the query sequence(s) -d the database -o the output file -p the blast program to use -e the e-value cut off -m the format of the output. option 8 gives a tabular output, that can be easily read into excel spreadsheet or into a database program. option 9 is similar but includes comment lines. the individual columns in the output denote the following # Query id # Subject id # % identity # alignment length # number of mismatches # number of gap openings # position of query start # position of query end # position of subject start # position of subject end # e-value of a hit # bit score of a hit
[To do the same search in the blast+ package - no need to do both, should get the same results, but blast + should be faster): blastp -db p_abyssi.faa -query t_maritima.faa -out blast.out -evalue 10 -outfmt 6 (to get more help type blastp -help The nicest thing about the new version is that you can specify what goes into the output file)]
Which option turns off the the low complexity filter? Which option, and which setting, sets the wordsize to 2? Which option allows to use two processors?
Which option turns off the the low complexity filter?
Which option, and which setting, sets the wordsize to 2?
Which option allows to use two processors?
Inspect the output typing more blast.out on the commandline on the cluster
Often one is only interested in the top scoring hit for each query. A simple perl script (here) goes through the blastall output in m8 or m9 format and retains only the top scoring hit, and it adds a header line that will be correctly interpreted by EXCEL. Save this script into your blasttest folder on the cluster (the easiest is to ctrl click on here and select save link as, then select the blasttest folder in the finder). Glance through the script using a text editor (textwrangler, or just click on the here link in your browser).
To run the blast.out file through the perl script type perl extract_lines.pl blast.out This will create a file in your directory called blast.out.top that only contains the top scoring hits, and from which the comment lines (in case you used the m9 option in blastall) are removed. [extra credit: how would you modify the script, to print out not only the first reported match for a query, but all hits that have equally good E-values to the first one? Note: in Perl the operator for "logical and" is && and "not equal" for a number is != and "not equal" for a string is ne]
(Aside: A modified version of this program (here) adds another column to the listing that for each alignment gives the bitscore per residue in the alignment.)
Import this file blast.top.out into an excel spreadsheet.
Download the file to your local machine (filezilla or sftp) start EXCEL on you local machine, open an empty spreadsheet, select Data -> Get external data; select your file (you'll need to tell EXCEL to show you all files, not only files with a recognized extension)
Select sort in the data pull-down menu (the one on top, not in the application window) and sort the data on the different columns. Which gene was found to be most similar between the two genomes.
Which were the genomes you selected? Which gene pair had the highest % identity? Which gene pair had the smallest E-value?
Which were the genomes you selected? Which gene pair had the highest % identity?
Which gene pair had the smallest E-value?
To retrieve the information on a target sequence, you can use the command fastacmd. If the GI number of the target is 14521962, and the database you searched is p_all.faa the command is as follows: fastacmd -s 14521962 -d p_all.faa By default the databank is nr as installed on the cluster. Thus fastacmd -s 14521962 also works (and you get more annotation), provided the sequences in your databank are also in nr.
Next, we will calculate a histogram for the % identity values for all the significant matches and one for all matches (including ones with an e-value above 0.0001).
The newer version of Excel for Macs no longer allows to calculate histograms via the add-ons. The simplest solution is to use use a histogram tool. Download this file and open it in excel. Allow it run macros. Then in the tools menu select better histograms, select the data you want to plot, and enter the start (e.g.: 0), step (e.g.: 10) and stop values (100). Then click ok.
Calculate % identy histograms for all matches and for matches that are significant and highly significant; e.g. all % identity values, and only the % identity values for matches with E-values < 10^-10. If you don't have time to do this yourself, go here - this compares T.maritima top scoring hits to three Pyrococcus genomes, check the %identity sheet.
===========================================
Aside: If EXCEL on your computer does not cooperate, we can calcualte the histograms in 'R' (a package for statistical analysis and graph plotting****). Download the script from here (control click and 'save link as...' -- this is an rchive with the script and with a file containing instructions, the script itself is here) and place the file histogramScript_pdf.R in the blasttest folder on the cluster via filezilla or sftp.
Copy the data you want to plot as a histogram into a text document (one number per line, using copy paste from Excel to a texteditor works). Add a first line with information on the data (these will become a label on the histogram. The label is not allowed to contain any spaces! (use underlines instead). Save the textfile as indata in the directory that contains the script histogramScript_pdf.R (indata should not have a file extension)
In your ssh connection, using the cd and cd.. commands, move to the directory that contains the file indata.txt and histogramScript_pdf.R
At the commandline type > R CMD BATCH histogramScript_pdf.R (don’t type the > :) ) The script should generated a files called plot1.pdf and plot2.pdf. Using the file transfer connection you made earlierlocate this PDF file in Finder and open it (double clicking should open it in preview).
End(aside)
Do you observe a smooth distribution of % identiy values? Do any genes constitute a second peak of more similar sequences, compared to the bulk of the comparisons? What can explain the difference in the histograms for significant and insignificant hits?
Download the *.faa, and*.ptt from two genomes (main chromosome only) from the ncbi's ftp site ftp://ftp.ncbi.nih.gov. As we want to plot syntenic relationships between the two genomes, we want to use the chromosomes from two close relatives.
We want to plot a genes genome position in one genome against the genome position of a homolog in another genome. One way to do this is to replace the GI numbers in the genome with the location in the genome. A program that does this is here. Save the two genomes (as faa files, the two ptt files, the two files, and the perlscript into a new directory on the cluster (name in gp (for genome plot)). We also will use the extract_lines.pl script (so put it into the same diertory).
Open a terminal connection to the cluster, login, then qlogin to a compute node cd gp, run the addnucnum script. For example, if the genome and ptt files are ACN.faa, ACN.ptt, CCI.faa, CCI.ppt then the commands would be perl addnucnum.pl ACN and perl addnucnum.pl CCI Check the addnucnum.pl program, the ptt files, and the output in a a texteditor (Note, at present the script uses the middle of the ORF to identify the location, if you want the plot the beginning of each ORF, modify the script).
Pick one of the numbered genomes and turn it into a "blastable" databank. e.g. formatdb -i ACN.num.faa -p T -o T Check the result of the formatdb command by typing more *.log
then run a blastall search of the other genome (or rather the proteins encoded in the other genome) against the databank blastall -i CCI.num.faa -d ACN.num.faa -o blast.out -p blastp -e 10e-4 -m 8 -a2 NOTE: You want to select a reasonable significance cut-off, because we will plot all significant matches. NOTE: You need to use the numbered genomes in both cases.
Run blast.out through extract_lines.pl perl extract_lines.pl blast.out
(Note: If you do more than one comparison, rename blast.out into something meaningful!)
Download the files blast.out and blast.out.top into excel and create scatterplots for the first two columns (alternatively, you can use the gnuplot script described linked in class11).
What if any is the difference between the two plots? Why do some proteins have more than one match? Repeat the scatterplot exercise after you removed all blast hits that had an E-value worse than 10^-20.
Plot the level of sequence conservation along a genome. An easy way to do this is to sort the spreadsheet on the ORF position, and then plot the bitscores as a bargraph, or using a scatterplot (bitscore versus position, or -log E-values versus position, or % identity versus position ... ). For this last exercise, if you want to identify the genes (see the fastacmd above), it might help to use one genome with numbers giving the position in the genome, and the other one using the GI numbers. (Even better would be do the blast search with the normal genomes and write a Perl script to add another coulms to the spreadsheet that contains the genome location of the query and the target genome. If you do, send me a copy of the script.)
Which region(s) of the genome is least conserved?
=========================================================
You will need two connections to the server, (A) a terminal window that allows you execute commands from the command line, and (B) a program that allows you to move files back and forth.
On a windows machine download and install the ssh client from ftp://ftp.uconn.edu/restricted/ssh/ . Install the program, it should include both a terminal program and a file transfer program. Open the terminal program and connect to the server. First edit a profile with your information, then use the profie to connect
On a Mac, open the terminal program (terminal.app -- the program usually resides in the application folder inside the utilities subfolder). In the terminal window type: ssh your_user_name@bbcsrv3.biotech.uconn.edu
If you want to change your password type passwd Then follow the instructions to change your password.
Load the two .faa files from the distantly related organisms into a folder in your home directory on bbcsrv3.biotech.uconn.edu. As usual there are many ways to do this.
If you have the ssh blient installed under windows, log in and open the file transfer window.
Alternative #1 (skip this if you use windows - the ssh client does the same as fielzilla): The easiest is to use a program like filezilla. Go to their web page, download and install FileZilla client (use the correct version according to your operating system and processor). Unpack the program. Double click on the program. For Mac with Intel processor, the application is also available here, and for old Macs with RISC processors, the application is also available here. Connect FileZilla to the server (top left icon, then click on the new site button, then select protocol sftp, and logon type normal. The two panes in the FileZilla window represent your computer (left) and the server (right).
Find the files you want to transfer on the left window pane and drag them into the directory on the server where you want them to go. Right click (or control click) on the right pane opens a menu that allows you create new subdirectories and change file permissions.
Alternative #2: Using sftp on your ssh connection "always" works, but it is rather pedestrian: Save the you want to transfer files onto your computer. Then open a terminal window on your mac (in the utilities folder inside the application folder) and type the commands in green ssh your_account_name@bbcsrv3.biotech.uconn.edu (enter your password when prompted) Open another terminal window on your mac and type sftp your_account_name@bbcsrv3.biotech.uconn.edu (enter your password when prompted. This establishes a secure ftp connection to the cluster) cd blasttest changes to the directory blasttest. lcd changes the directory on your local machine, lpwd reports your local directory name mput *.faa transfers all files that end with .faa in your local directory into the directory on the server.
Whatever works for you!
More preparation: Often it is nice to have a text editor that uses context dependent coloring, especially, for creating scripts. For Macs a very nice program is textwrangler. The latest version is here, older versions for older operating systems are here.
* You can use a text editor for this (textwrangler), but the files might be pretty large. An alternative is to use unix:
open a terminal window (the application is in the utility folder inside the application folder). cd to the directory where you downloaded the two files (e.g., cd Desktop) cat name1.faa name2.faa > new_name_for_both_faa_files.faa where name1.faa name2.faa are the names of the files you want to copy. cat list the content of one ore more files and > directs the output from the default (the screen) to a file
**** If you end up doing research in Science, Medicine, or Engineering you'll need to do statistical analyses. R is free, open source, and probably the most powerful software package available. You can also do publication quality plotting. There is a learning curve at the beginning, but it is a worthwhile investment of time! See here and here for more info.
Type logout to release the compute node form the queue. Check the queue for abandoned sessions using qstat. If there are abandoned sessions under your account, kill them by deleting them from the queue by typing qdel job-ID, e.g. "qdel 40000" would delete Job # 40000
Check the appropriate radio button below before pressing the submit button:
Send email to your instructor (and yourself) upon submit Send email to yourself only upon submit (as a backup) Show summary upon submit but do not send email to anyone.