#!perl -w use strict; # perl nassinov.pl rna1.data =head1 NAME nassinov.pl =head1 DESCRIPTION The Nussinov method is a simple dynamic programming algorithm to align the biological sequences and secondary structure determination. It searches for a secondary structure scheme that maximizes the number of base pairs and considers other factors, such as base stacking and strength of the base pairs What patterns can we use in RNA bioinformatics? o Base pairing o within the same RNA o sense/anti-sense pairing between two RNAs o conserved by evolution (sequence may diverge) o Base frequency bias o Example: G-C base pairs are most stable For the purpose of sequence alignment and structure determination, DNA can simply be described as a flexible single-stranded biopolymer. The challenge in the alignment and determination of structure is the proper implemented algorithmic framework. The secondary structure is obtained from the primary structure. =head1 SYNOPSIS nassinov.pl [sequence] where sequence is a filename representing an RNA sequence =head1 NOTES For implementation, a Bio-Perl program is written that does the following: 1. It takes (as input) exactly one command-line argument, which is a siRNA sequence (in upper- or lower-case, or a mixture of both). 2. It validates the sequence to ensure it contains no characters that are not A, C, G or U (if any other characters are present, it should print an error and die. 3. It prints (on standard output) the maximum possible number of Watson-Crick base pairs in the "optimal structure" for the sequence (where "optimal structure" means the secondary structure that contains the most Watson-Crick base pairs). =head1 BACKGROUND I would strongly recommend to take a read of the following references which I found very informative. References: 1. International Journal of Computer Applications (0975 8887) Volume 13 No.7, January 2011 An Algorithmic Framework for the Study of Behavior of siRNA Sequences http://www.ijcaonline.org/volume13/number7/pxc3872480.pdf 2. Original Perl Code http://biowiki.org/view/Teaching/NussinovAlgorithmCode 3. Article: BioInfo3D: a suite of tools for structural bioinformatics http://icanhasinter.net/proxy/browse.php/Oi8vbmFy/Lm94Zm9y/ZGpvdXJu/YWxzLm9y/Zy9jb250/ZW50LzMy/L3N1cHBs/XzIvVzUw/My5mdWxs/b13/ 4. Great Set of software tools for Protein Structural Analysis http://icanhasinter.net/proxy/browse.php/Oi8vYmlv/aW5mbzNk/LmNzLnRh/dS5hYy5p/bC8_3D/b13/fframe 5. PowerPoint Presentation: Computational analysis of non-coding RNA http://users.soe.ucsc.edu/~auzilov/BME110/Fall2010/2010-11-16_BME110_Uzilov_CompRna_bgWhite.pdf =head1 COMMENTS Another excellent example from a previous student follows ... Code notes -- it seems straight forward algorithmically. I used strict and -w to clean it up and changed the input to be a file. I put reference notes into POD and suggest creating a printout like the article which produced the dots to the screen or to a printer. Comments were modified to be in-line comments; cleaning up variables insures the algorithm doesn't pick on some default Perl value. I'd recommend that a few small examples be taken to test the program from the original article. And that a DEBUG version do a print out to show the looping structure a bit. This is a partial implementation of the above notes. I tested it enough to show me that it works. =head1 EXAMPLE perl nassinov.pl rna1.data =head1 BUGS It allows for bases immediately adjacent to one another to form a base pair in the counting being done. In the lab this never happens, even in RNAs with weird structures, the tightest hairpin I've ever seen was 1 base unpaired in the loop, but typically it is more than 1. For microarrays we were generating in my last lab, we were forming stable hairpins with 3 bases in the loop. I think steric constraints would prevent bases next to one another from pairing. I realize that would complicate the nested loops in the code, but perhaps the insertion of a few if statements would prevent counting those instances when i and j are adjacent from being counted. =head1 AUTHOR The following is the Perl code (tested and slightly modified) that requires an RNA file (attached). The RNA is supplied as a sequence parameter to the program as in perl Nussinov "content of pccx1.rna file as a sequence string". =cut # DESC: Script that takes an RNA sequence as input and calculates the maximum # number of Watson-Crick basepairs using the Nussinov algorithm. # USAGE: nussinov [sequence] # read in file as one string or exit with a usage message my $filename = shift || usage(); open(FILE, "< $filename") or die("Unable to open sequence file: $filename:$!\n"); my $seq = ""; { local $\ = undef; $seq = <FILE>; $seq =~ s/\s//g; } # scoring matrix for base pair combinations my %score = ( 'AA' => 0, 'AC' => 0, 'AG' => 0, 'AU' => 1, 'CA' => 0, 'CC' => 0, 'CG' => 1, 'CU' => 0, 'GA' => 0, 'GC' => 1, 'GG' => 0, 'GU' => 0, 'UA' => 1, 'UC' => 0, 'UG' => 0, 'UU' => 0 ); my $tempMax = 0; # check $seq only contains A C G U if ( $seq ) { if ( $seq =~ /[^acguACGU]+/ ) { die "Unknown character in sequence $seq!"; } $seq = uc($seq); my $len = length($seq); # set diagonal and lower diagonal equal to zero my @s = (); $s[0][0] = 0; for ( my $i = 1; $i <= $len-1; $i++ ) { $s[$i][$i] = 0; $s[$i][$i-1] = 0; } # loop across each diagonals for ( my $h = 1; $h <= $len-1; $h++ ) { my $i = 0; # loop across each element of diagonal for ( my $j = $h; $j <= $len-1; $j++ ) { # CASE: 'i' is unpaired $s[$i][$j] = $s[$i+1][$j]; # CASE: 'j' is unpaired $tempMax = $s[$i][$j-1]; if ( $tempMax > $s[$i][$j] ) { $s[$i][$j] = $tempMax; } # select bp's to pair my $pair = substr($seq,$i,1) . substr($seq,$j,1); # CASE: 'i' and 'j' are paired $tempMax = $s[$i+1][$j-1] + $score{$pair}; # score of bp is returned from hash if ( $tempMax > $s[$i][$j] ) { $s[$i][$j] = $tempMax; } # CASE: bifurcation for ( my $k = $i+1; $k < $j; $k++ ) { $tempMax = $s[$i][$k] + $s[$k+1][$j]; if ( $tempMax > $s[$i][$j] ) { $s[$i][$j] = $tempMax; } } # increment i to traverse diagonal $i++; } } print "Maximum number of Watson-Crick basepairs: $s[0][$len-1]\n"; } else { die "No sequence provided!\n"; } exit(0); sub usage { print "$0 <filename>\nPlease specify a filename on the command line\n"; exit(1); }