NAME

Bio::PreSeq - bioperl sequence object


SYNOPSIS


Object Creation

 $seq = Bio::PreSeq->new;
 
 $seq = Bio::PreSeq->new($filename);
 
 $seq = Bio::PreSeq->new(-seq=>'ACTGTGGCGTCAACTG');
 
 $seq = Bio::PreSeq->new(-seq=>$sequence_string);
 
 $seq = Bio::PreSeq->new(-seq=>@character_list);
 
 $seq = Bio::PreSeq->new(-file=>'seqfile.aa',
                      -desc=>'Sample Bio::PreSeq sequence',
                      -numbering=>'1',
                      -type=>'Amino',
                      -ffmt=>'Fasta');
 
 $seq = Bio::PreSeq->new($file,$seq,$id,$desc,$names,
                     $numbering,$type,$ffmt,$descffmt);


Object Manipulation

 $seq->[METHOD];

 $result = $seq->[METHOD];
 
 
 Accessors
 --------------------------------------------------------
 There are a wide variety of methods designed to give easy
 and flexible access to the contents of sequence objects
 
 The following accessors can be invoked upon a sequence object

 ary()        - access sequence (or slice of sequence) as an array
 str()        - access sequence (or slice of sequence) as a string
 getseq()     - access sequence (or slice) as string or array
 seq_len()    - access sequence length
 id()         - access/change object id 
 desc()       - access/change object description
 names()      - access/change object names
 numbering()  - access/change sequence numbering offset
 origin()     - access/change sequence origin
 type()       - access/change sequence type
 ffmt()       - access/change default output format
 descffmt()   - access/change description format
 setseq()     - change sequence
 
 Methods
 --------------------------------------------------------
 The following methods can be invoked upon a sequence object

 copy()        - returns an exact copy of an object
 alphabet_ok() - check sequence against genetic alphabet  
 layout()      - sequence formatter for output
 revcom()      - reverse complement of sequence
 complement()  - complement of sequence  
 reverse()     - reverse of sequence
 Dna_to_Rna()  - translate Dna seq to Rna
 Rna_to_Dna()  - translate Rna seq to Dna
 translate()   - protein translation of Dna/Rna sequence


DESCRIPTION

This module is the generic sequence object which lies at the core of the bioperl project. It stores Dna, Rna, or Protein sequence information and annotation. It has associated methods to perform various manipulations of sequences.

Bio::PreSeq is the precursor to what will eventually become Bio::Seq when things are fully stable. =head2 Sequence Types

 Currently the following sequence types are recognized:

 Dna
 Rna
 Amino


Alphabets

 This module uses the standard extended single-letter genetic
 alphabets to represent nucleotide and amino acid sequences.
 
 In addition to the standard alphabet, the following symbols
 are also acceptable in a biosequence:
 
 ?  (a missing nucleotide or amino acid)
 -  (gap in sequence)


Extended Dna / Rna alphabet

 (includes symbols for nucleotide ambiguity)
 ------------------------------------------
 Symbol       Meaning      Nucleic Acid
 ------------------------------------------
  A            A           Adenine
  C            C           Cytosine
  G            G           Guanine
  T            T           Thymine
  U            U           Uracil
  M          A or C  
  R          A or G   
  W          A or T    
  S          C or G     
  Y          C or T     
  K          G or T     
  V        A or C or G  
  H        A or C or T  
  D        A or G or T  
  B        C or G or T   
  X      G or A or T or C 
  N      G or A or T or C 


Amino Acid alphabet

 ------------------------------------------
 Symbol           Meaning   
 ------------------------------------------
 A        Alanine
 B        Aspartic Acid, Asparagine
 C        Cystine
 D        Aspartic Acid
 E        Glutamic Acid
 F        Phenylalanine
 G        Glycine
 H        Histidine
 I        Isoleucine
 K        Lysine
 L        Leucine
 M        Methionine
 N        Asparagine
 P        Proline
 Q        Glutamine
 R        Arginine
 S        Serine
 T        Threonine
 V        Valine
 W        Tryptophan
 X        Unknown
 Y        Tyrosine
 Z        Glutamic Acid, Glutamine
 *        Terminator


Output Formats

The following output formats are currently supported: Raw, Fasta, GCG, GenBank, PIR =head2 Input Formats

In addition to ``raw'' sequence files, PreSeq.pm is currently only able to read in Fasta and GCG formatted single sequence files. Support for additional formats is forthcoming.

PreSeq.pm has the ability to make use of D.G. Gilbert's ReadSeq program when reading in sequence files. ReadSeq has the ability to read and interconvert between many different biological sequence formats.

When readseq is present and PreSeq.pm has been properly configured to use it, ReadSeq will be invoked when internal parsing code fails to recognize the sequence.

Formats which readseq currently understands:

  - IG/Stanford
  - GenBank/GB
  - NBRF
  - EMBL
  - GCG
  - DnaStrider
  - Fitch format
  - Pearson/Fasta
  - Zuker format
  - Olsen format
  - Phylip3.2
  - Phylip
  - Plain/Raw
  * MSF
  * PAUP's multiple sequence (NEXUS) format
  * PIR/CODATA format used by PIR
  * ASN.1 format used by NCBI

  Note: Formats indicated with a '*' allow for multiple
        sequences to be contained within one file. At this
        time, the behaviour of PreSeq.pm with regard to these
        multiple-sequence files has not been spefified.

Readseq is freely distributed and is available in shell archive (.shar) form via FTP from ftp.bio.indiana.edu (129.79.224.25) in the molbio/readseq directory. (URL) ftp://ftp.bio.indiana.edu/molbio/readseq/

If ReadSeq is not available or PreSeq.pm is not configured to use it, internal parsing mechanisms will be used.

Currently supported filetypes for input: Raw, Fasta


USAGE


Installation


The easy way

Perl 5.002 or higher is required.

PreSeq.pm is one part of the larger Bio::Perl project. Bio::Perl will eventually encompass a range of molecular-biology related perl modules and object-oriented classes.

This distribution should be able to be installed just like any other perl module:

 `perl Makefile.PL`   # makes a system-specific makefile
 `make`               # makes the distribution
 `make test`          # runs the test code
 `make install`       # [may need root access for system install]

Makefile.PL will ask if you want the modules to be configured so that they may use the ReadSeq sequence conversion program. If you do not have ReadSeq installed or do not wish it to be used, simply answer 'no' to the question. If you do want ReadSeq support enabled, you will have to provide a fully qualified pathname at this time. Makefile.PL will then auto-configure the modules using a series of in-place edits.


The hard way

To use PreSeq.pm by itself, simply copy it into the directory location on your system where all the other site-specific perl *.pm files are. You can find this directory by invoking the command 'perl -V' and checking the contents of the @INC array. Perl checks all the directories listed in the @INC array when looking for modules. All of the perl modules that are part of the standard distribution can be found in /usr/local/lib/perl5/ [your system paths may vary slightly]. There should also be a directory such as ``/usr/local/lib/perl5/site_perl/'', this is where PreSeq.pm belongs. User-installed perl modules that are not part of the standard perl distribution should be kept in the site_perl/ directory, this separation is needed to protect site-specific modules from getting inadvertently altered when installing new patches or versions of perl. Once in this location, PreSeq.pm can be accessed by invoking ``use Seq;'' in your perl script.

If PreSeq.pm is part of a larger bio::perl distribution, the individual modules making up the distribution should be placed within their own ``Bio/'' subdirectory off of the main perl5/site_perl/ location. PreSeq.pm in this case would be found in the path Bio/PreSeq.pm. To use PreSeq.pm in your perl script, invoke ``use Bio::PreSeq;''

If you lack permission or are unable to access the perl distribution directories, ask your system administrator to place the files there for you, or keep PreSeq.pm in the same location as the perl script you are writing. As a last resort when looking for a module, perl will always check the current directory.

You can also explicitly tell perl where to look for PreSeq.pm by including the following code in your script (set the value of $INSTALL_PATH to whatever is appropriate on your local system):

   BEGIN { use vars qw($INSTALL_PATH);
           $INSTALL_PATH = "/usr2/users/dag/bioperl/dist/Perl"; }

   use lib "$INSTALL_PATH/Bio/PreSeq";
   use PreSeq;
 


Why modules and object-oriented code?

Perl5 is nice in that it allows users to use OO-style programming only in the situations where they feel like doing so.

 o From the prospective of novice or occasional perl users, objects are useful
because they can offer direct and simple ways to do things that in reality
may be somewhat complex or arcane. Users interact with and manipulate
objects via specific, documented methods and never have to worry about what
is going on "behind the scenes." Many  perl programmers have devoted
significant amounts of time and effort creating easy-to-use "wrappers"
around complex or abstract tasks. Visit the CPAN Module list at 
(URL) http://www.perl.com/perl/CPAN/CPAN.html to see the fruits of their labor.
 
 o From the prospective of a perl power-user, object-oriented programming
allows programmers to write code that is easily scalable and reusable. This
allows powerful applications to be built rapidly with and with a minimum of
waste or repeated effort.
 


Using Bio::PreSeq in your perl programs

PreSeq.pm is invoked via the perl 'use' command

   use PreSeq;


Creating a biosequence object

The ``constructor'' method in PreSeq.pm is the new() function.

The proper syntax for accessing the new() function in PreSeq.pm is as follows:

   $myseq = Bio::PreSeq->new;

Of course, objects are only useful if they have something in them so you would probably want to pass along some additional information or arguments to the constructor. The foundation of any biosequence object is course the sequence itself.

You can address new() with a sequence directly:

   $myseq = Bio::PreSeq->new(-seq=>'AACTGGCGTTCGTG');

Or you can pass in a string or a list:

   $myseq = Bio::PreSeq->new(-seq=>$sequence_string);
   $myseq = Bio::PreSeq->new(-seq=>@sequence_list);

It is also possible to create a new sequence object based on a sequence contained in a file. You can tell constructor where to find the sequence file by passing in the 'file' parameter:

   $myseq  = Bio::PreSeq->new(-file=>'seqfile.gcg');

Because there are so many different conventions or formats for storing sequence information in files, it would be polite (although not absolutely necessary) to tell the constructor what format the sequence file is in. We can provide that information via the file-format or 'ffmt' field. To create a sequence object based upon a GCG-formatted sequence file:

   $myseq  = Bio::PreSeq->new(-file=>'seqfile.gcg',-ffmt=>'GCG');
 
We've already introduced three different object attributes or arguments
that can be passed to the new() object constructor ('seq','file' and
'ffmt') so now would be a good time to introduce them all:

BioSeq Constructor Arguments

file: The ``file'' argument should be a string value containing path and filename information for a sequence file that is to be read into an object.

seq: The ``seq'' argument is for passing in sequence directly instead of reading in a sequence file. The sequence should consist of RAW info (no whitespace, newlines or formatting) and can be passed in as either an array/list or string.

id: The ``id'' argument should be a ONE-WORD string value giving a short name for the sequence.

desc: The ``desc'' argument should be a string containing a description of the sequence. This field is not limited to one word.

names: The ``names'' argument should be a hash or reference to a hash that contains any number of user generated key-value pairs. Various bits of identifying information can be stored here including name(s), database locations, accession numbers, URL's, etc.

type: The ``type'' argument should be a string value describing the sequence type eg; ``Dna'', ``Rna'' or ``Amino''.

origin: The ``origin'' argument should be a string value describing sequence origin info

numbering: The ``numbering'' argument should be an integer value containing the sequence numbering offset value. By default all sequence are numbered starting with 1.

ffmt: The ``ffmt'' argument should be a string describing sequence file-format. If a sequence is being read from a file via the ``file'' argument, ``ffmt'' is used to invoke the proper parsing code. ``ffmt'' is also the default format for sequence output when the layout method is called. See elsewhere in this documentation for info regarding recognized sequence file-formats.

If most of these arguments were used at once to create a sequence object, it would look something like this:

   #Set up the name hash
   %names = (
   'CloneID','DB1',
   'Isolate','5',
   'Tissue','Xenopus',
   'Location','/usr2/users/dag/bioperl/sample.tfa'
   );

   $name_ref = \%names;

   #Create the object
   $myseq = new Bio::PreSeq(-file=>'sample.tfa',
                         -names=>$name_ref,
                         -type=>'Dna',
                         -origin=>'Xenopus mesoderm',
                         -numbering=>'1',
                         -desc=>'Sample Bio::PreSeq sequence',
                         -ffmt=>'Fasta');
 


Methods

Once an object has been created, there are defined ways to go about accessing the information -- users are encouraged to poke around ``under the hood'' of PreSeq.pm to see what is going on but it is considered bad form to bypass the defined accession methods and mess around with the internal code. Bypassing the defined methods ``voids the warrantee'' of the module and can lead to problems down the road. The implied agreement between module creators and users is that the creators will strive to keep the interface standard and backwards-compatible while the users will avoid becoming dependent on bits of internal code that may change or disappear in future revisions. Detailed information about each method described here can be found in the Appendix.


Accessing information

    
For each defined way to access information from a biosequence object, there
is a corresponding "method" that is invoked. What follows is a brief
description of each accessor method. For more detailed information see the
individual annotations for each method near the end of this document. 
 
Sequence
The sequence can be accessed in several ways via the seq() method.
Depending on how it is invoked, it can return either a string or a list
value.
 
Both examples are appropriate:
 
   @sequence_list   = $myseq->seq;
   $sequence_string = $myseq->seq;
 
Sequence "slices" can be accessed by passing start and stop integer
position arguments to getseq():
 
   @slice = $myseq->getseq($start,$stop);
   @slice = $myseq->getseq(1,50);
   @slice = $myseq->getseq(100);
 
If no stop value is passed in, seq() will return a slice from the start
position to the end of the sequence. Slices are returned in the context of
the object "numbering" attribute, not absolute position so be aware of the
objects numbering scheme.
 
Sequences can also be accessed in with the ary() and str() methods. The
ary() method will always return a list value and str() will always return a
string. Otherwise they are functionally identical to the seq() method.
 
   $sequence = $myseq->str;
   @sequence = $myseq->ary;
 
   @slice = $myseq->ary($start,$stop);
   $slice = $myseq->str($start,$stop);
 
Sequence length
The sequence length can be accessed by
   $len = $myseq->seq_len;
 
Sequence ID
The ID field can be accessed by
   $ID = $myseq->id;
  
Description
The object description field can be accessed by
   $description = $myseq->desc;
  
Names
The associative array (hash) that contains flexible information regarding
alternative sequence names, database locations, accession numbers, etc. can
be accessed by
   %name_hash = $myseq->names;
 
Sequence numbering
The default numbering offset for the sequence can be accessed by
   $numbering = $myseq->numbering;
 
Sequence Origin
The object origin field can be accessed by
  $seq_origin = $myseq->origin;
 
File input format / default output format
The object format field can be accessed by
   $format = $myseq->ffmt;
 


Changing Information in Sequence Objects

 
In the previous section it was shown how object attributes and values could
be retrieved from a sequence object by calling upon various methods. Many
of the above methods will also allow the user to CHANGE object attributes
by passing in additional arguments. Detailed information on each method can
be found in the Appendix.
 
Changing the sequence
The sequence information for an object can be changed by passing a string
or list value to the _seq() method. Here are some ways that sequence
information can be changed
 
   $myseq->seqseq($new_sequence_string);
   $myseq->setseq(@new_sequence_list);
   $myseq->setseq("aaccttgcctgc");
 
The setseq() method checks sequence elements and warns if it finds
non-standard characters. Because of this, arbitrary sequence compositions
are not supported at this time. This method is considered slightly
'insecure' because the 'id','desc' and 'type' fields are not updated
along with the sequence. If necessary, the user must make the appropriate
changes to these fields whenever sequence information is updated or changed.
 
Changing the sequence ID
The ID field can be changed by passing in a new ID argument
   $myseq->id($new_id);
 
Changing the object description
The object description field can be changed by passing in a new argument
   $myseq->desc($new_desc);
 
Changing the object names hash
The associative array (hash) that contains flexible information regarding
alternative sequence names, database locations, accession numbers, etc. can
be changed by passing in a reference to a new hash.
 
   $hash_ref = \%name_hash;
   $myseq->names($hash_ref);
 
Changing the sequence numbering offset
The default numbering offset for the sequence can be changed by passing in
a new value
   $myseq->numbering(1);
   $myseq->numbering($new_value);
 
Sequence Origin
The object origin field can be changed by passing in a new string value
  $myseq->origin("mitochondrial");
  $myseq->origin($origin_string);
 
File input format / default output format
The object format field can be accessed by passing in a new value
   $myseq->ffmt("GCG"); 


Manipulating sequences

     
Creating, accessing and changing biosequence objects and fields is all well
and good, but eventually you are going to want to actually do some work.

Included with PreSeq.pm are some commonly used utility methods for manipulating sequence data. So far PreSeq.pm contains methods for: Copying a biosequence object: $new_obj = $myseq->copy;

 Reversing a sequence
    $reversed_seq = $myseq->reverse;

 Complementing a sequence
 The 2nd strand, or "complement" of a biosequence can be obtained by calling
 upon the complement method.
    $comp_seq = $myseq->complement;

 Reverse complementing a sequence
    $rev_comp = $myseq->revcom;

 Translating Dna to Rna
    $rna_seq = $myseq->Dna_to_Rna;

 Translating Rna to Dna
    $dna_seq = $myseq->Rna_to_Dna;

 Translating Dna or Rna to protein
    $peptide_seq = $myseq->translate;

 Checking the sequence alphabet
 To check if any nonstandard characters are present in a biosequence, an
 alphabet_ok() method is provided. The method returns "1" if everything is
 OK, otherwise it returns a "0".

   if($myseq->alphabet_ok) { print "OK!!\n"; }
    else { print "Not OK! \n"; }


Sequence Output

     
There are several methods for outputting formatted sequences. For your
convenience, a "meta-output" method called layout() also exists.
 
If layout() is called without any arguments, it calls upon the output
methods as defined by the "ffmt" field.
   print $myseq->layout;
 
The "ffmt" field is mainly used to describe the format of a sequence
being read in from a file. It is also used as the default format for
all sequence output. If these differ (ie; the format that the 
sequence was read in is not desired as a default output style) then
"ffmt" should be set manually via the ffmt() accessor method. Of course,
after reading the sequence in you are free to change "ffmt" at will.
 
layout() can also be called with specific formats:
   $gcg_formatted_seq = $myseq->layout("GCG"):
   $fasta_seq = $myseq->layout("Fasta"):
 
Calling output methods directly
 
Many output methods accept unique named parameters/arguments that allow a
greater degree of control over output format and style, to take advantage
of these abilities, the formatting methods must be called directly. See the
appendix notes describing each output format for detailed information.
 
  print $myseq->out_GCG(-date->"10 May 1996",
                        -caps-"up");
 
Most output methods will return either a string or list value depending
on how they are invoked, check the detailed method  documentation in 
the Appendix to be sure. 
 
   @formatted_seqlist = $myseq->out_genbank(-id=>'New ID',
                                            -def=>'User defined definition',
                                            -acc=>'User defined accession');
 
   $formatted_seqstring = $myseq->out_genbank(-id=>'New ID',
                                              -def=>'User defined definition',
                                              -acc=>'User defined accession');
 


DIY - Using Bio::PreSeq as a base for your own work

 
[to be completed]


We want *you* / getting involved

 
[to be completed]


Bugs

[to be completed]


APPENDIX

The following documentation describes the various functions contained in this module. Some functions are for internal use and are not meant to be called by the user; they are preceded by an underscore (``_'').


new()

   
 Title     : new
 Usage     : $mySeq = Bio::PreSeq->new($file,$seq,$id,$desc,$names,
                         $numbering,$type,$ffmt,$descffmt);
           :                - or -
           : $mySeq = Bio::PreSeq->new(-file=$file,
                                   -seq=>$seq,
                                   -id=>$id,
                                   -desc=>$desc,
                                   -names=>$names,
                                   -numbering=>$numbering,
                                   -type=>$type,
                                   -origin=>$origin,
                                   -ffmt=>$ffmt,
                                   -descffmt=>$descffmt);
 Function  : The constructor for this class, returns a new object.
 Example   : See usage
 Returns   : Bio::PreSeq object
 Argument  : $file: file from which the sequence data can be read; all
               the other arguments will overwrite the data read in.
               "_nofile" is recommanded if no file is given.
             $seq: String or array of characters
             $id: String describing the ID the user wishes to assign.
             $desc: String giving a description of the sequence
             $names: A reference to a hash which stores {loc,name}
                     pairs of other database locations and corresponding names
                     where the sequence is located.
             $numbering: The offset of the sequence, as an integer
             $type: The type of the sequence, see type()
             $origin: The sequence origin
             $ffmt: Sequence format, see ffmt()
             $descffmt: format of $desc, see descffmt()


_initialize()

    
 Title     : _initialize
 Usage     : n/a (internal function)
 Function  : Assigns initial parameters to a blessed object.
 Example   : 
 Returns   : 
 Argument  : As Bio::PreSeq->new, allows for named or listed parameters.
             See ->new for the legal types of these values.


_rearrange()

 
 Title     : _rearrange
 Usage     : n/a (internal function)
 Function  : Rearranges named parameters to requested order.
 Example   : $self->_rearrange([SEQUENCE,ID,DESC],@p);
 Returns   : @params - an array of parameters in the requested order.
 Argument  : $order : a reference to an array which describes the desired
                      order of the named parameters.
             @param : an array of parameters, either as a list (in
                      which case the function simply returns the list),
                      or as an associative array (in which case the
                      function sorts the values according to @{$order}
                      and returns that new array.


_seq()

 
 Title     : _seq()
 Usage     : n/a, internal function
 Function  : called by new() to set sequence field. Checks
           : alphabet before setting.
           :
 Returns   : n/a
 Argument  : sequence string


_monomer()

 
 Title     : _monomer()
 Usage     : n/a, internal function
 Function  : Returns the internal monomer that represents
           : sequence type.
           :
           : Sequence type is treated internally as a monomer
           : defined by the %SeqAlph hash. The type field
           : is a list of format [monomer,origin]. For any
           : output outside the module, the monomer is resolved
           : back into string form via the %TypeSeq hash.
           :
 Returns   : original type setting [as monomer]
 Argument  : none


_file_read()

 
 Title     : _file_read()
 Usage     : n/a (Internal Function)
 Function  : _file_read is called whenever the constructor is called 
           : with the name of a sequence to be read from disk.
           :
 Example   : n/a, only called upon by _initialize()
 Returns   : 
 Argument  : 


## Accessors ##


str()

 
 Title     : str
 Usage     : str([$start,[$end]])
 Function  : Returns the sequence of the object as a string, or a slice
             of the sequence if $start/$end are defined. If $start is
             defined and $end isn't, the slice is from $start to the
             end of the sequence.
 Example   : $slice = $myObject->str(3,9);
 Returns   : string scalar
 Argument  : $start,$end (both integers). They are interpreted w.r.t. the
             specific numeration of the sequence!! ($self->{numbering})


getseq()

 
 Title     : getseq
 Usage     : getseq([$start,[$end]])
 Function  : Returns the sequence of the object as an array or a char
             string, depending on the value of wantarray. Will rtn a slice
             of the sequence if $start/$end are defined. If $start is
             defined and $end isn't, the slice is from $start to the
             end of the sequence.
 Example   : @slice = $myObject->seq(3,9);
 Returns   : regular array of characters, or a scalar string
 Argument  : $start,$end (both integers). They are interpreted w.r.t. the
             specific numeration of the sequence!! ($self->{numbering})


id()

 
 Title     : id()
 Usage     : $seq_id = $myseq->id; 
           : $myseq->id($id_string);
           :
 Function  : Sets field if an ID argument string is
           : passed in. If no arguments, returns ID value for
           : object.
           :
 Returns   : original ID value
 Argument  : sequence string


desc()

 
 Title     : desc()
 Usage     : $description = $myseq->desc; 
           : $myseq->desc($desc_string);
           :
 Function  : Sets field if an argument string is
           : passed in. If no arguments, returns original value for
           : object description field.
           :
 Returns   : original value for description
 Argument  : sequence string


names()

 
 Title     : names()
 Usage     : %names = $myseq->names; 
           : $myseq->names($hash_ref);
           :
 Function  : Sets field if a name hash refrence is
           : passed in. If no arguments, returns original 
           : names hash.
           :
 Returns   : hash refrence (associative array)
 Argument  : refrence to a hash (associative array)


numbering()

   
 Title     : numbering()
 Usage     : $num_start = $myseq->numbering; 
           : $myseq->numbering($value);
           :
 Function  : Sets field if an argument is
           : passed in. If no arguments, returns original value.
           :
 Returns   : original value 
 Argument  : new value


origin()

   
 Title     : origin()
 Usage     : myseq->origin($value) 
 Function  : Sets the origin field which is actually the second
           : field of the Type list. The {type} field is a 2 value list
           : with a format of ["Monomer","Origin"]
           :
 Returns   : Original value
 Argument  : string


type()

 
 Title     : type()
 Usage     : myseq->type($value) 
 Function  : Sets the type field which is the first
           : field of the Type list. The {type} field is a 2 value list
           : with a format of ["Monomer","Origin"]
           :
 Returns   : Original value
 Argument  : string containing a valid sequence type


ffmt()

    
 Title     : ffmt()
 Usage     : $format = $myseq->ffmt;
           : $myseq->ffmt("Fasta");
           : 
 Function  : The file format field is used by the internal
           : sequence parsing code when trying to read 
           : in a sequence file. It is also what is used
           : as a default output format if the layout
           : method is called without an argument.
           :
           : If a sequence object is created without
           : reading in a file, or if the file is read
           : in with the use of the ReadSeq package then
           : the ffmt field can be set to indicate any default
           : output-format preference.
           :
           : If a sequence is read from a file and parsed
           : by internal code (ReadSeq not used) then the ffmt
           : field should describe the format of the sequence
           : file. The ffmt field is used to send the sequence
           : to the correct internal parsing code.
           :
 Returns   : original ffmt value
 Argument  : recognized ffmt string value (see list of recognized 
           : formats)


descffmt()

 
 Title     : descffmt()
 Usage     : $desc = $myseq->descffmt;
           : $myseq->descffmt($new_value); 
 Function  : 
           :
 Returns   : original value


setseq()

 
 Title     : setseq()
 Usage     : $self->setseq($new_sequence);
 Function  : Changes the sequence inside a bioseq object
           :
 Returns   :
 Argument  : sequence string


parse()

 
 Title     : parse
 Usage     : parse($ent,[$ffmt]);
 Function  : Invokes the proper parsing code depending on
           : the value of the object 'ffmt' field.
 Example   : $self->parse;
 Returns   : n/a
 Argument  : the prospective sequence to be parsed, 
           : and optionally its format so that it doesn't need to
           : be estimated


parse_unknown()

 
 Title     : parse_unknown
 Usage     : parse_unknown($ent);
 Function  : tries to figure out the format of $ent and then
           : calls the appropriate function to parse it into $self->{seq}.
 Example   : $self->parse_unknown;
 Returns   : n/a
 Argument  : $ent : the rough multi-line string to be parsed


parse_bad()

 
 Title     : parse_bad
 Usage     : parse_bad;
 Function  : complains of un-parsable sequence, last-ditch attempt via
           : Parse.pm if sequence is being read from a file.
           :
 Example   : $self->parse_bad;
 Returns   : n/a
 Argument  : n/a


parse_raw()

 
 Title     : parse_raw
 Usage     : parse_raw;
 Function  : parses $ent into the $self->{seq} field, using Raw
           : file format.
 Example   : $self->parse_raw;
 Returns   : n/a
 Argument  : n/a


parse_fasta()

 
 Title     : parse_fasta
 Usage     : parse_fasta;
 Function  : parses $ent into the "seq" field, using Fasta
           : file format.
           :
 To-do     : use benchmark module to find best/fastest parse
           : method
           :
 Example   : $self->parse_fasta;
 Returns   : n/a
 Argument  : n/a


parse_gcg()

 
 Title    : parse_gcg
 Usage    : used by internal code
 Function : Parses the sequence out of a gcg-format string and
          : sets the object sequence field accordingly. This is
          : a simple, ineffecient method for grabbing JUST the
          : sequence.
          :
 To-do    : - parse out more info than just sequence 
          : - implement alphabet checking
          : - better regular expressions/efficiency
          : - carp on unexpected / wrong-format situations
          :
 Version  : .01 / 16 Jan 1997 
 Returns  : 1
 Argument : gcg-formatted sequence string


## Methods for file format and output ##

#_______________________________________________________________________ =head2 layout()

 
 Title     : layout()
 Usage     : layout([$format]);
 Function  : Returns the sequence in whichever format the user specifies,
             or in the "ffmt" field if the user does not specify a format.
 Example   : $fastaFormattedSeq = $myObj->layout("Fasta");
 Returns   : varies
 Argument  : $format (one of the formats as defined in $SeqForm).


out_bad()

 
 Title     : out_bad()
 Usage     : out_bad;
 Function  : Croaks if we don't know the output format.
 Example   : $self->out_bad;
 Returns   : n/a
 Argument  : n/a


out_raw()

 
 Title     : out_raw
 Usage     : out_raw;
 Function  : Returns the sequence in Raw format.
 Example   : $self->out_raw;
 Returns   : string sequence, in raw format
 Argument  : n/a


out_fasta()

 
 Title     : out_fasta
 Usage     : out_fasta;
 Function  : Returns the sequence as a string in FASTA format.
 Example   : $self->out_fasta;
           :
 To-do     : benchmark code / find fastest method
           :
 Returns   : string sequence in Fasta format
 Argument  : n/a


dump()

 
 Title     : dump
 Usage     : @results = $mySeq->dump; -or- 
           : $results = $mySeq->dump;
           :
 Function  : Returns a formatted array or string (depending on how it
           : is invoked) containing the contents of a 
           : Bio::PreSeq object. Useful for debugging
           :
           : ***This is used by Chris Dagdigian for debugging ***
           : ***Probably should be removed before distribution***
           :
 Example   :  @results = $mySeq->dump;
           :  foreach(@results){print;}
           :     -or-
           :  print $myseq->dump;
           :
 Returns   : Array or string depending on value of wantarray
 Argument  : n/a


out_primer()

 
 Title     : out_primer()
 Usage     : $formatted_seq = $myseq->out_primer;
           : @formatted_seq = $myseq->out_primer;
           :
           : print $myseq->out_primer(-id=>'New ID',
           :                          -header=>'This is my header');
           :
 Function  : outputs a sequence in primer format
           :
 Note      : Not a supported output type -  (cant be invoked via layout)
           : Use at your own risk :)
           : 
 Example   : see usage
           :
 Revision  : 0.01 / 20 Dec 1996
 Returns   : string or list, depending on how it is invoked
 Argument  : named list parameters for "id" and "header" are alowed


out_pir()

 
 Title     : out_pir()
 Usage     : $formatted_seq = $myseq->layout("PIR");
           : $formatted_seq = $myseq->out_pir;
           : @formatted_seq = $myseq->out_pir;
           :
           : print $myseq->out_pir(-title=>'New TITLE',
           :                       -entry=>'New ENTRY',
           :                       -acc=>'User defined accession',
           :                       -date=>'User defined date',
           :                       -reference=>'User defined ref info');
           :
 Function  : Returns a string or an array depending on how it
           : is invoked. Can be easily accessed via the layout()
           : method, or if more output control is desired it can
           : be called directly with the folowing named parameters:
           :
           :  -entry      PIR entry
           :  -title      PIR title
           :  -acc        user defined accession number
           :  -reference  user defined reference
           :  -date       user defined date/time info
           :
           : All named parameters will take precedance over any
           : default behavior. When there are no user arguments,
           : the default output is as follows:
           :
           : PIR 'ENTRY'     = sequence object "id" field
           : PIR 'TITLE'     = sequence object "desc" field
           : PIR 'DATE'      = curent date/time
           : PIR 'ACC'       = not used in default output
           : PIR 'REFERENCE' = not used in default output
           :
 Note      : Not tested stringently.
           :
 WARNING   : Does not deal with numbering issue
           :
 To-do     : - Allow user to pass in hash of additional fields/values
           : - Deal with numbering issue
           :
 Example   : see usage
           :
 Revision  : 0.02 / 12 Jan 1997
 Returns   : string or list, depending on how it is invoked
 Argument  : named list parameters are allowed, see above


out_genbank()

 
 Title     : out_genbank()
 Usage     : $formatted_seq = $myseq->out_genbank;
           : @formatted_seq = $myseq->out_genbank;
           : print $myseq->out_genbank(-id=>'New ID',
           :                           -def=>'User defined definition',
           :                           -acc=>'User defined accession',
           :                           -origin=>'User defined origin info',
           :                           -spacing=>'single',
           :                           -caps=>'up',
           :                           -date=>'DATE GOES HERE',
           :                           -type=>'mRna');
           :   
 Function  : Returns a GenBank formatted sequence array or string
           : depending on the value of wantarray when invoked via layout(). 
           : If more control is desired over output format, out_genbank() 
           : can be addressed directly with the following named parameters:
           :
           : def          - Sequence definition information
           : acc          - Sequence accession number
           : origin       - Sequence origin information
           : id           - short name 
           : date         - new date info
           : type         - sequence type (Dna, mRna, Amino, etc.)
           : spacing      - "single" or "double" sequence line spacing
           : caps         - "up" or "down" sequence capitalization
           :
           : When invoked via layout() or called directly with no 
           : arguments, the following default behaviours apply:
           :  DATE = Current date and time
           :  DEFINITION = object's description field
           :  ID = object's ID field
           :  SPACING = single
           :
           : All named parameters must be strings. Passed in parameters will
           : always take precedence over any fields with default settings.
           :
 Note      : Format not stringently tested for accuracy. Sequence is numbered
           : according to the integer specified in the object 'numbering' field
           : but the implementation has not been robustly tested.
           :
 To-do     : - allow user hash reference for additional format fields
           :
 Example   : see usage
           :
 Revision  : 0.02 / 12 Jan 1997
 Returns   : string or list, depending on how it is invoked
 Argument  : named list parameters are allowed, see above


out_GCG()

 
 Title    : out_GCG
 Usage    : $formatted_seq = $mySeq->layout("GCG"); 
          : @formatted_seq = $mySeq->layout("GCG");
          : 
          : print $myseq->out_GCG(-id=>'New ID',
          :                      -spacing=>'single',
          :                      -caps=>'up',
          :                      -date=>'DATE GOES HERE',
          :                      -header=>'This is a user submitted header',
          :                      -type=>'n');
          :   
 Function : Returns a GCG formatted sequence array or string
          : depending on the value of wantarray when invoked via layout(). 
          : If more control is desired over output format, out_GCG() 
          : can be addressed directly with the following named parameters:
          :
          : header       - first line(s) of formatted sequence
          : id           - short name that appears before 'Length:' field
          : date         - overwrite default date info
          : type         - can be "N" or "P", for nucleotide/protein
          : spacing      - "single" or "double" sequence line spacing
          : caps         - "up" or "down" sequence capitalization
          :
          : When invoked via layout() or called directly with no 
          : arguments, the following default behaviours apply:
          :  DATE = Current date and time
          :  DEFINITION = object's description field
          :  ID = object's ID field
          :  SPACING = single
          :         
          : All named parameters must be strings. Passed in parameters will
          : always take precedence over any fields with default settings.
          :
 Example  :  
 Output   :
          :Sample Bio::PreSeq sequence
          : sample Length: 240  Wed Nov 27 13:24:28 EST 1996  Type: N Check: 5371  ..
          :
          :       1  aaaacctatg gggtgggctc tcaagctgag accctgtgtg cacagccctc
          :      51  tggctggtgg cagtggagac gggatnnnat gacaagcctg ggggacatga
          :     101  ccccagagaa ggaacgggaa caggatgagt gagaggaggt tctaaattat
          :     151  ccattagcac aggctgccag tggtccttgc ataaatgtat agagcacaca
          :     201  ggtgggggga aagggagaga gagaagaagc cagggtataa
          :
          :
 Note     : GCG formatted sequences contain a "Type:" field.
          : If Type cannot be internally determined and no
          : Type name-parameter is passed in then the Type: 
          : field is not printed.
          :
 Warning  : Unconventional numbering offsets may not
          : be robustly handled
          :
 Revision : 0.06 / 12 Jan 1997
 Source   : Found guts of this code on bionet.gcg, unknown author
 Returns  : Array or String
 Argument : n/a


out_nbrf()

 
 Title     : out_nbrf()
 Usage     : $self->layout("NBRF") or $self->out_nbrf
           :
 Function  : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
           :
           : If the ReadSeq wrapper Parse.pm apppears 
           : to be configured properly it is used
           : to generate the output. 
           :
           : If Parse.pm cannot be used then this code
           : carps out with an error message.
           :
 To-do     : write internal output code
           :
 Version   : 1.0 /  16 MAR 1997
 Example   : see Usage
 Returns   : FORMATTED STRING (wantarray is not used here!)
 Argument  : 


out_ig()

 
 Title     : out_ig()
 Usage     : $self->layout("IG") or $self->out_ig
           :
 Function  : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
           :
           : If the ReadSeq wrapper Parse.pm apppears 
           : to be configured properly it is used
           : to generate the output. 
           :
           : If Parse.pm cannot be used then this code
           : carps out with an error message.
           :
 To-do     : write internal output code
           :
 Version   : 1.0 /  16 MAR 1997
 Example   : see Usage
 Returns   : FORMATTED STRING (wantarray is not used here!)
 Argument  : 


out_strider()

 
 Title     : out_strider()
 Usage     : $self->layout("Strider") or $self->out_strider
           :
 Function  : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
           :
           : If the ReadSeq wrapper Parse.pm apppears 
           : to be configured properly it is used
           : to generate the output. 
           :
           : If Parse.pm cannot be used then this code
           : carps out with an error message.
           :
 To-do     : write internal output code
           :
 Version   : 1.0 /  16 MAR 1997
 Example   : see Usage
 Returns   : FORMATTED STRING (wantarray is not used here!)
 Argument  : 


out_zuker()

   
 Title     : out_zuker()
 Usage     : $self->layout("Zuker") or $self->out_zuker
           :
 Function  : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
           :
           : If the ReadSeq wrapper Parse.pm apppears 
           : to be configured properly it is used
           : to generate the output. 
           :
           : If Parse.pm cannot be used then this code
           : carps out with an error message.
           :
 To-do     : write internal output code
           :
 Version   : 1.0 /  16 MAR 1997
 Example   : see Usage
 Returns   : FORMATTED STRING (wantarray is not used here!)
 Argument  : 


out_msf()

  
 Title     : out_msf()
 Usage     : $self->layout("MSF") or $self->out_msf
           :
 Function  : FORMAT NOT INTERNALLY IMPLEMENTED YET!!!
           :
           : If the ReadSeq wrapper Parse.pm apppears 
           : to be configured properly it is used
           : to generate the output. 
           :
           : If Parse.pm cannot be used then this code
           : carps out with an error message.
           :
 To-do     : write internal output code
           :
 Version   : 1.0 /  16 MAR 1997
 Example   : see Usage
 Returns   : FORMATTED STRING (wantarray is not used here!)
 Argument  : 


## Methods for sequence manipulation ##

#_______________________________________________________________________


copy()

  
# Title     : copy
# Usage     : $copyOfObj = $mySeq->copy;
# Function  : Returns an identical copy of the object.
# Example   :
#           : 
#           :
# Returns   : Bio::PreSeq object ref.
# Argument  : n/a
#-----------------------------------------------------------------------


complement()

 
 Title       : complement
 Usage       : $complemented_seq = $mySeq->compliment;
 Function    : Returns a char string containing 
             : the complementary sequence (eg; other strand)
             : of the original sequence. The translation method
             : is identical to revcom() but the nucleotide order
             : is not reversed.
             :
 Example     :  $complemented_seq = $mySeq->complement;
             :
 Source      : Guts from Jong's <jong@mrc-lmb.cam.ac.uk>
             : library of molbio perl routines
 Note        :
             : The letter codes and complement translations
             : are those proposed by IUB (Nomenclature Committee,
             : 1985, Eur. J. Biochem. 150; 1-5) and are also
             : used by the GCG package. The IUB/GCG letter codes
             : for nucleotide ambiguity are compatible with
             : EMBL, GenBank and PIR database formats but are
             : *NOT* compatible with Stadem/Sanger ambiguity
             : symbols. Staden/Sanger use different symbols to
             : represent uncertainty and frame abiguity.
             :
             : Currently Staden/Sanger are not recognized
             : sequence types.
             :
             : GCG Documentation on sequence symbols:
 URL         : http://www.neb.com/gcgdoc/GCGdoc/Appendices/appendix_iii.html
             : 
             :
 Translation :
             : GCG/IUB    Meaning        Complement
             : ------------------------------------
             :  A            A                T
             :  C            C                G
             :  G            G                C
             :  T            T                A
             :  U            U                A
             :  M          A or C             K
             :  R          A or G             Y
             :  W          A or T             W
             :  S          C or G             S
             :  Y          C or T             R
             :  K          G or T             M
             :  V        A or C or G          B
             :  H        A or C or T          D
             :  D        A or G or T          H
             :  B        C or G or T          V
             :  X      G or A or T or C       X
             :  N      G or A or T or C       N
             :--------------------------------------
             :
 Revision    : 0.01 / 6 Dec 1996
 Returns     : char string
 Argument    : n/a


reverse()

 
 Title     : reverse
 Usage     : $reversed_seq = $mySeq->reverse;
 Function  : Returns a char string containing the
           : reverse of the object sequence
           : 
 Example   :  $reversed_seq = $mySeq->reverse;
           :
 Revision  : 0.01 / 6 Dec 1996
 Returns   : char string
 Argument  : n/a


Dna_to_Rna()

 
 Title     : Dna_to_Rna
 Usage     : $translated_seq = $mySeq->Dna_to_Rna;
 Function  : Returns a char string containing the
           : Rna translation of the Dna nucleotide sequence
           : (Replaces T with U)
           : 
 Example   : $translated_seq = $mySeq->Dna_to_Rna;
           :
 Source    : modified from Jong's <jong@mrc-lmb.cam.ac.uk>
           : library of molbio perl routines
           :
 Revision  : 0.01 / 6 Dec 1996
 Returns   : char string
 Argument  : n/a


Rna_to_Dna()

 
 Title     : Rna_to_Dna
 Usage     : $translated_seq = $mySeq->Rna_to_Dna;
 Function  : Returns a char string containing the
           : Dna translation of the Rna nucleotide sequence
           : (Replaces U with T)
           : 
 Example   : $translated_seq = $mySeq->Rna_to_Dna;
           :
 Revision  : 0.01 / 16 MAR 1997
 Returns   : char string
 Argument  : n/a


translate()

 
 Title     : translate
 Usage     : 
 Function  : Returns a char string containing the single-letter
           : protein translation of a Dna/Rna sequence
           :
           : "*" is the default symbol for a stop codon
           : "X" is the default symbol for an unknown codon
           :
 Example   : $translation = $mySeq->translate;
           :   -or- with user defined stop/unknown codon symbols:
           : $translation = $mySeq->translate($stop_symbol,$unknown_symbol);
           : 
 Source    : modified from Jong's <jong@mrc-lmb.cam.ac.uk>
           : library of molbio perl routines
           :
 To-do     : - allow named parameters (just like new and out_GCG )
           : - allow "frame" parameter to pick translation frame
           :
 Revision  : 0.01 / 6 Dec 1996
 Returns   : char string
 Argument  : n/a


## Misc. methods ##


version()

 
 Title     : version();
 Usage     : $myseq->version;
 Function  : prints Bio::PreSeq current version number


## End of Method docs ##


Bio::PreSeq Guts


Sequence Object

  
 The sequence object is merly a reference to a hash containing
 all or some of the following fields...

 Field         Value
 --------------------------------------------------------------
 seq           the sequence
 
 id            a short identifier for the sequence
 
 desc          a description of the sequence, in descffmt file-format
 
 names         a hash of identifiers that relate to the sequence..
               these could be Database ID's, Accession #'s, URL's,
               pathnames, etc. Currently there is no set format
               for the names hash and no formal definition of databases 
               or names
 
 numbering     numeration scheme, currently is the starting numeration 
               or offset for the sequence 
 
 type          the sequence type. Is actually a 2 value list of format
               ["monomer","origin"] where monomer is one of the
               recognized sequence types and origin is a string
               description of the sequences' origin (mitochondrial, etc)
 
 ffmt          file-format for the sequence
 
 descffmt      file-format of the description string


ACKNOWLEDGEMENTS


SEE ALSO

 UnivAln.pm - The biosequence alignment object
 Parse.pm   - The perl interface to ReadSeq


REFERENCES

 BioPerl Project Page
 (URL) http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/


COPYWRITE

 Copyright (c) 1996 Georg Fuellen, Richard Resnick, Steven E. Brenner,
 Chris Dagdigian and others. All Rights Reserved.
 This module is free software; you can redistribute it and/or modify 
 it under the same terms as Perl itself.