[Bio] / FigKernelScripts / embl2gff.pl Repository:
ViewVC logotype

Log of /FigKernelScripts/embl2gff.pl

Parent Directory Parent Directory


Links to HEAD: (view) (download) (as text) (annotate)
Sticky Tag:

Revision 1.20 - (view) (download) (as text) (annotate) - [select for diffs]
Mon Dec 5 18:56:37 2005 UTC (13 years, 11 months ago) by olson
Branch: MAIN
CVS Tags: HEAD, caBIG-05Apr06-00, caBIG-13Feb06-00, mgrast_dev_02212011, mgrast_dev_02222011, mgrast_dev_03252011, mgrast_dev_03312011, mgrast_dev_04012011, mgrast_dev_04052011, mgrast_dev_04082011, mgrast_dev_04132011, mgrast_dev_05262011, mgrast_dev_06072011, mgrast_dev_08022011, mgrast_dev_08112011, mgrast_dev_10262011, mgrast_dev_12152011, mgrast_rel_2008_0625, mgrast_rel_2008_0806, mgrast_rel_2008_0917, mgrast_rel_2008_0919, mgrast_rel_2008_0923, mgrast_rel_2008_0924, mgrast_rel_2008_1110, mgrast_rel_2008_1110_v2, mgrast_release_3_0, mgrast_release_3_0_1, mgrast_release_3_0_2, mgrast_release_3_0_3, mgrast_release_3_0_4, mgrast_release_3_1_0, mgrast_release_3_1_1, mgrast_release_3_1_2, mgrast_version_3_2, myrast_33, myrast_rel40, rast_2008_0924, rast_rel_2008_04_23, rast_rel_2008_06_16, rast_rel_2008_06_18, rast_rel_2008_07_21, rast_rel_2008_08_07, rast_rel_2008_09_29, rast_rel_2008_09_30, rast_rel_2008_10_09, rast_rel_2008_10_29, rast_rel_2008_11_24, rast_rel_2008_12_18, rast_rel_2009_02_05, rast_rel_2009_03_26, rast_rel_2009_05_18, rast_rel_2009_07_09, rast_rel_2009_0925, rast_rel_2010_0118, rast_rel_2010_0526, rast_rel_2010_0827, rast_rel_2010_0928, rast_rel_2010_1206, rast_rel_2011_0119, rast_rel_2011_0928, rast_rel_2014_0729, rast_rel_2014_0912, rast_release_2008_09_29
Changes since 1.19: +17 -0 lines
Diff to previous 1.19
Add license words.

Revision 1.19 - (view) (download) (as text) (annotate) - [select for diffs]
Tue Nov 1 04:01:49 2005 UTC (14 years ago) by mkubal
Branch: MAIN
CVS Tags: caBIG-00-00-00
Changes since 1.18: +6 -2 lines
Diff to previous 1.18
prevent duplicating last line of seq

Revision 1.18 - (view) (download) (as text) (annotate) - [select for diffs]
Wed Oct 12 19:04:06 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.17: +18 -2 lines
Diff to previous 1.17
gff2seed:	use description, Note, and nci_annotation in that order to define
		annotatioin and assignment.  ncbi is source for the final case.
		otherwise, set to gff2seed

embl2gff:	include unigene, EMBL: and protein_id. include assignment from
		parse_mart (below) into Note attribute in gff

parse_mart:	pull in assignemtn from transcript feature file.

Revision 1.17 - (view) (download) (as text) (annotate) - [select for diffs]
Wed Oct 5 21:38:21 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
CVS Tags: caBIG-dataload-0
Changes since 1.16: +17 -5 lines
Diff to previous 1.16
embl2gff- add ensembl protein id
gff2seed- sort the fetaures before writing to tbl file.
parse_mart- rename ChromBand to just Band

Revision 1.16 - (view) (download) (as text) (annotate) - [select for diffs]
Wed Oct 5 03:51:56 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.15: +63 -26 lines
Diff to previous 1.15
embl2gff:

sigh.

embl files are a set of entries.  an entry is a set of genes.  it is also
a set of clones.  but those two are incommensurate- a gene can be split across
clones.  so can transcripts.  so, transcripts (which are supposed to be
uniquely named by their transcript_id) can occur multiple times, despite
being "unique."  EMBL puts cloneName:location into the loc info to refer
to a hunk of the transcript living in a different entry (then puts the
trnascript in both).  At first, I was just rejecting anything with a :
in it as a way to keep from double counting.  But every so often, theres
a short guy such that the list of intervals describing the location is
short.  short enough to fit on one line.  Then *both* parts have a
cloneName: in the first line and you lose *all* copies of the transcript.

it gets worse.

apparently human has alternative assemblies in the embl file.  also, both
X and Y are there (thus duplicating many transcripts).  This is a second
way to double count genes.

I put a hash in.  First come, first served.  When you hit a transcript id,
see if its in the hash.  if so, skip it, else keep it and update the
hash.

but its tricky because the CDS line during the parse is what kicks us
into a "New Transcript" state and you have to kick out of that state
when you get your hands on the transcript Id (which comes severl lines
*after* you hit the CDS line.  So this update also unwinds that
state.

Revision 1.15 - (view) (download) (as text) (annotate) - [select for diffs]
Sat Oct 1 03:09:13 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.14: +13 -7 lines
Diff to previous 1.14
take in a string to put in the PROJECT file, e.g. Ensembl-31

change a regexp since '*' can be in some translations but \w does not
match it.  this fixes a 1 in 10000 error that was causing genes to be
lost.

Revision 1.14 - (view) (download) (as text) (annotate) - [select for diffs]
Thu Sep 29 21:59:34 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.13: +1 -2 lines
Diff to previous 1.13
fix duplicating the figID twice

Revision 1.13 - (view) (download) (as text) (annotate) - [select for diffs]
Thu Sep 29 18:57:26 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.12: +3 -3 lines
Diff to previous 1.12
remove debug prints

Revision 1.12 - (view) (download) (as text) (annotate) - [select for diffs]
Thu Sep 29 18:56:40 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.11: +16 -12 lines
Diff to previous 1.11
parse_mart:  print UniProt not Uniprot
embl2gff- fix broken lookup into extra stuff by removing Foo: from gene and transcript

Revision 1.11 - (view) (download) (as text) (annotate) - [select for diffs]
Thu Sep 29 00:20:03 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.10: +38 -12 lines
Diff to previous 1.10
gff2seed:  fix typo in arg processing that made -append be ignred.

embl2gff: pass in org version number, extras file name

Revision 1.10 - (view) (download) (as text) (annotate) - [select for diffs]
Tue Sep 27 20:02:30 2005 UTC (14 years, 1 month ago) by efrank
Branch: MAIN
Changes since 1.9: +28 -12 lines
Diff to previous 1.9
fix parsing of translation to cat case where FT line has ONLY a "

Revision 1.9 - (view) (download) (as text) (annotate) - [select for diffs]
Sat Sep 24 03:15:20 2005 UTC (14 years, 2 months ago) by efrank
Branch: MAIN
Changes since 1.8: +294 -225 lines
Diff to previous 1.8
break things into subroutines,  indent sanely, etc.
now that work is in subroutine, add top  level driver that iterates over files
and watches how much has been written, starting/stoping output files as
needed to stay below high water  mark (hard wired to 800 MB).

lots of gross coupling through global vars and parser state.

need to add arg parsing

need to set strict but too scared to try.

Revision 1.8 - (view) (download) (as text) (annotate) - [select for diffs]
Sat Sep 24 00:10:43 2005 UTC (14 years, 2 months ago) by efrank
Branch: MAIN
Changes since 1.7: +112 -147 lines
Diff to previous 1.7
replace a big gob of if/else code with a hash.
reorganize/clean as a result.
extend capabilities by making a big fat hash.

now handles all interesting db-xref's in ensmbl for 17 orgs in
release 31.

oh, the hash lets you remap the db-xref from ensembl to
whatever you want, typically to a standard as spec'd by
GO

Revision 1.7 - (view) (download) (as text) (annotate) - [select for diffs]
Fri Sep 23 19:05:35 2005 UTC (14 years, 2 months ago) by mkubal
Branch: MAIN
Changes since 1.6: +18 -11 lines
Diff to previous 1.6
added PDB,IPI,Entrez

Revision 1.6 - (view) (download) (as text) (annotate) - [select for diffs]
Fri Sep 23 16:19:41 2005 UTC (14 years, 2 months ago) by efrank
Branch: MAIN
Changes since 1.5: +39 -23 lines
Diff to previous 1.5
embl - generalizations and dispersions.
parse_mart- do uniprot id xrefs too

Revision 1.5 - (view) (download) (as text) (annotate) - [select for diffs]
Wed Sep 21 23:44:57 2005 UTC (14 years, 2 months ago) by efrank
Branch: MAIN
Changes since 1.4: +76 -18 lines
Diff to previous 1.4
embl2gff- update to read a file of additional information to collate into the alias
and attribute information.  actually, it shoves it all into the alias right now.

parse_mart- grabs cytogenetic locatin out of ensembl mart and dumps into a file
to be read in as extra info by embl2gff.  trivial stuff except for finding
the !@#$# file and understanding what th efields mean.

Revision 1.4 - (view) (download) (as text) (annotate) - [select for diffs]
Sun Sep 18 20:46:58 2005 UTC (14 years, 2 months ago) by mkubal
Branch: MAIN
Changes since 1.3: +18 -2 lines
Diff to previous 1.3
inlcudes cyto_genetic location -must cyto_data.txt to work

Revision 1.3 - (view) (download) (as text) (annotate) - [select for diffs]
Sun Sep 18 20:15:09 2005 UTC (14 years, 2 months ago) by mkubal
Branch: MAIN
Changes since 1.2: +8 -6 lines
Diff to previous 1.2
strand now correct

Revision 1.2 - (view) (download) (as text) (annotate) - [select for diffs]
Fri Sep 16 23:10:57 2005 UTC (14 years, 2 months ago) by efrank
Branch: MAIN
Changes since 1.1: +18 -10 lines
Diff to previous 1.1
add uri_encode and handle commas in alias strings correctly.

Revision 1.1 - (view) (download) (as text) (annotate) - [select for diffs]
Fri Sep 16 16:14:30 2005 UTC (14 years, 2 months ago) by efrank
Branch: MAIN
A converter from EMBL flat file format to GFF being written by Mike Kubal.
See http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html for
description of EMBL file format

Not yet working.

This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, select a symbolic revision name using the selection box, or choose 'Use Text Field' and enter a numeric revision.

  Diffs between and
  Type of Diff should be a

Sort log by:

MCS Webmaster
ViewVC Help
Powered by ViewVC 1.0.3