[Bio] / KBaseTutorials / Basic_exercises / command_line_pipes.html Repository:
ViewVC logotype

View of /KBaseTutorials/Basic_exercises/command_line_pipes.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 1.8 - (download) (as text) (annotate)
Fri Aug 24 19:58:35 2012 UTC (7 years, 3 months ago) by salazar
Branch: MAIN
Changes since 1.7: +11 -5 lines
please push to kbase git

<h1>Getting What You Need from the CS Using Command-Line Scripts</h1>
<p><strong>Purpose: </strong>To learn the CDM <a href="http://pubseed.theseed.org/ErdbDocWidget.cgi?xmlFileName=/home/parrello/CdmiData/Published/KSaplingDBD.xml">Entity-Relationship Model</a> and how to run command-line tools to expose data.</p>
<p><strong>Required Prerequisite Activities: </strong><a href="/developer-zone/tutorials/getting-started/getting-started-with-the-kbase/">Getting Started with KBase</a></p>
<p>Suggested Prerequisite Activities: <a href="/developer-zone/tutorials/getting-started/some-basic-exercises-using-the-doe-kbase/">Some Basic Exercises Using KBase</a>.</p>
<p>Related Tutorials: <a href="/developer-zone/tutorials/command-line-scripts/accessing-central-store-data/extending-the-cs-commands-with-operators/">Extending the CS Commands with Operators</a></p>
The Central Store (<b>CS</b>) is the KBase integration of the data needed to
support the creation and validation of metabolic and regulatory models.  It will
certainly be used for many other purposes, but its creation is being
driven by the needs of the modelling community.
We have described many details of how to access the contents of the CS via command-line 
tools already in <a href="basic_exercises.html">Some Basic Exercises Using KBase</a>.
This document is designed to complement that tutorial.  We hope to explain an
overview of the approach of piping KBase tools together on the command-line, as well
as some very minimal notes relating to a subset of the Unix tools.
To understand the contents of either tutorial you really do need to be able to
bring up the entity-relationship model describing the contents of the CS and understand
the contents of that model, which is often called the Central Data Model (<b>CDM</b>).
See the start of the companion tutorial to get an overview of the CDM and how to get started.

We realize that most users will utlimately use user interfaces that obviate the need
to do anything at the command-line.  We look forward to that day.  Until then, there
is a great deal of use that can be made of the CS in its present form, using
the rather primitive environment supported by the command-line.

<h2>The Basic Philosophy of the CS Command-line Tools</h2>

We think of most of the KBase command-line tools as taking in a file containing
a tab-separated table and outputting a modified table.  The most common modification
is the addition of one or more columns.  We create "pipelines" of these tools to implement
fairly complex transformations leading to the final table containing the desired output.
For example, consider the following little pipeline:
  all_entities_Genome -f scientific_name | grep "Streptococcus pneumoniae"
The <i>all_entities_Genome</i> command is thought of as producing a table in which the
first column (by default) is the genome ID, and any extra columns come from the arguments
of the command.  In this case, we get a 2-column table 
This 2-column table is fed into a Unix command called <i>grep</i>, which keeps lines that "match"
its argument.  In this case, the <i>grep</i> extracts rows in the table that "match" the string
"Streptococcus pneumoniae."  Thus, we get as a result a 2-column table in which each row contains
"Streptococcus pneumoniae."  Upon running the command, we got
<pre>  kb|g.1340	Streptococcus pneumoniae SP19-BS75
  kb|g.1880	Streptococcus pneumoniae BS457
  kb|g.3485	Streptococcus pneumoniae SPN7465
  kb|g.9772	Streptococcus pneumoniae SP18-BS74
  kb|g.3478	Streptococcus pneumoniae SPN034183
  kb|g.1784	Streptococcus pneumoniae JJA
  kb|g.9944	Streptococcus pneumoniae CDC1873-00
  kb|g.3474	Streptococcus pneumoniae OXC141
  kb|g.3484	Streptococcus pneumoniae SPN033038
  kb|g.1881	Streptococcus pneumoniae BS458
  kb|g.110	Streptococcus pneumoniae OXC141
  kb|g.1334	Streptococcus pneumoniae SP3-BS71
Transforming tables by extracting rows that contain a given
string (or, more generally, a pattern),  extracting columns from a table, or sorting 
the rows in a table is the basic style we advocate.  For example, suppose that you wanted
to find all features of <i>Streptococcus pneumoniae</i> that had been assigned a specific
function (say, <i>triosephosphate isomerase</i>).  You might try using
  all_entities_Genome -f scientific_name |
  grep "Streptococcus pneumoniae" |
  genomes_to_fids CDS -c 1 | 
  fids_to_functions | 
  grep -i "triosephosphate isomerase"
This produces output like
<pre>  kb|g.1340  Streptococcus pneumoniae SP19-BS75  kb|g.1340.peg.783   Triosephosphate isomerase (EC
  kb|g.9772  Streptococcus pneumoniae SP18-BS74  kb|g.9772.peg.1261  triosephosphate isomerase
  kb|g.9772  Streptococcus pneumoniae SP18-BS74  kb|g.9772.peg.1663  triosephosphate isomerase
  kb|g.9772  Streptococcus pneumoniae SP18-BS74  kb|g.9772.peg.2175  triosephosphate isomerase
  kb|g.9772  Streptococcus pneumoniae SP18-BS74  kb|g.9772.peg.2207  triosephosphate isomerase
  kb|g.3478  Streptococcus pneumoniae SPN034183  kb|g.3478.peg.287   Triosephosphate isomerase (EC

Here, some extra comments are necessary.
<li><i>genomes_to_fids</i> is a KBase command that takes a tab-separated table as input, and one 
of the columns in the table must contain KBase genome IDs.  By default, that column is the last column in the table
(i.e., the rightmost).  If the column of genome IDs is not the last, use the "-c N" argument to say that
the column is the Nth.
<li><i>fids_to_functions</i> is a KBase command that takes a table containing a column of feature ids (i.e., fids), 
which by default is the last column, and extends the table with one more column, the function assigned to the feature.
<li> Finally, the "-i" argument to <i>grep</i> makes the match case insensitive.  <i>grep</i> is an extremely
powerful command with a number of useful options (you can select all rows that do not match, rows that match some
specified pattern, etc.)
<h2> A Short Exercise</h2>
Now, as an exercise, try to show the IDs and functions of all fids that have
precisely the same function as <i>kb|g.9772.peg.2175</i>,  You should spend a few
minutes, if necessary, and try to solve this yourself before looking at our approach.
We would likely use the following pipeline:
  echo 'kb|g.9772.peg.2175' | fids_to_proteins | proteins_to_fids | fids_to_functions
In some sense, this is probably not terribly efficient, but you do get the answer back in a few seconds
(and, we think that the answer itself poses at least one interesting question).
<h2>The Abstract View</h2>
A pipeline begins with a <i>generator</i> -- that is, a command that takes no input, but outputs a table.
<i>Generators</i> come in two flavors: one generates a table for every instance of a designated entity, and
the other a single instance of an entity.  For example,

is a generator that outputs a single column containing the IDs of all genomes in KBase.
  all_entities_Genome -f 'dna_size,contigs,scientific_name'
is a generator that would output a 4-column table. (Try it.)
The <i>all_entities_ENTITY</i> commands are generators that you may depend on being present
(i.e., if you see the entity in the ER-model, the corresponding generator may be assumed to be present).
  echo 'kb|g.9772.peg.2175'
was an example of a generator that produced a single instance of an entity.
Now, it is true that you might begin a pipeline wihout a generator, using something like
  fids_to_proteins < file.of.saved.fid.ids

<p>In this case, you might consider this a "restarted pipeline", or you might amend our
  previous assertion that pipelines begin with generators.  In either case, if you wish to
  debate the issue, we claim to have succeeded in clarifying the concepts.
There is one more type of generator that you will occasionally find extremely useful.
Suppose that you wished to find all of the Features (fids) that have been assigned
the function
  SSU ribosomal protein S9p (S16e)
Using the tools we have described so far, this is not easy to do (and the results are quite inefficient).
To search for instances of an Entity that have a desired value of one of the fields, you can use commands like
  query_entity_Feature -is 'function,SSU ribosomal protein S9p (S16e)'

  query_entity_Feature -like 'function,SSU ribosomal protein%'
The first command looks up the fids for features that have been assigned the given function, while
the second locates all SSDU ribsomal proteins.
  There are two classes of KBase <b>transformation commands</b> that you may depend on being present.
  These transformation commands take as input a table and add columns to it:
<li> <b>get_entity_ENTITY</b> commands are used to take as input a table that contains
a column containing <b>ENTITY</b> IDs.  These commands are used to add columns corresponding
to fields from the referenced <b>ENTITY</b>.  Thus,
  echo 'kb|g.9772.peg.2175' | get_entity_Feature -f 'function,source_id'
is an example <i>get_entity_Feature</i> that adds two columns to the (admittedly limited)
input stream.
<li> The <b>get_relationship_RELATIONSHIP</b> commands are used for "crossing" a relationship from 
one type of entity (called the <i>from entity</i>) to another (called the <i>to entity</i>).
The input must be a table with a column containing IDs of the <i>from entity</i>.  The output
is formed by tacking on columns of data from three sources: the <i>from entity</i>, the relationship,
and the <i>to entity</i>.  For example, 
  echo 'kb|g.9772.peg.2175' | get_relationship_IsOwnedBy -from source_id -to 'scientific_name,id'
takes the input IDs of Features, crosses the IsOwnedBy relationship, and adds three columns (one
from the <i>from entity</i> and two from the <i>to entity</i>).
We have supplemented the standard generators (all_entity_ENTITY routines) and the standard
transformation commands (get_entity_ENTITY and get_relationship_RELATIONSHIP) with a set
of commands representing what we call "well-trodden paths."  When we see these recurring patterns of
use, sometimes we can reduce the requirred effort to extract the desired data.
is a customized generator that outputs a table containing roles that are used in building models.
If you wanted to get the number of unique protein sequences that connect to these roles, you might 
  all_roles_used_in_models | roles_to_proteins | cut -f 2 | sort -u | wc
<p>(Note that the command above may take between 10 to 20 minutes. If preferred, route the output to a file.)</p>
<p>The <i>roles_to_proteins</i> transformation command is an example of one of these
  commands added to transformation commands.  The following Unix commands are worth noting:
<li> <b>cut -f 2</b> says "extract just the second column" (the one containing md5 values
representing protein sequences).
<li> <b>sort -u</b> says "sort the input table removing duplicate lines", and
<li> <b>wc</b> says "count the lines, word, and characters in the input file"
Well, that is basically it.  We end by attempting to convey just one picky detail.
The sort command allows you to sort lines on specific fields using the <b>-k N</b>
argument.  The problem is that it breaks a line into fields using transitions between
whitespace characters and non-whitespace characters, and this is not the behavior you
always want.  If you are dealing with tab-delimited fields (as we are), then you
want the <i>sort</i> command to split the line into fields properly.  This can be
done by executing
  export TAB=`echo -e "\t"`
and then using something like
  sort -t "$TAB" -k 4 
to sort on the 4th field in a tab-delimited table.  Within our KBase exercises, you will usually
need to worry about this only if you have "role" or "function" columns in your tables (they have embedded

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3