All the input files such as FASTA
, Generic Feature Format Version 3 (GFF3)
and Tab-separated values (TSV)
files should be placed inside the data directory which is inside the geniesys folder. GFF3 and FASTA files are required to setup the database and other files are recommended but not mandatory.
ls geniesys/data
gene.gff3genome.facds.fatranscript.faprotein.fa
Here are some of the guidelines for prepare initial input files.
1.) GFF3
files should follow the standard GFF3
specifications listed here. Here is the example of a GFF3
file that can be included inside the data file.
head gene.gff3
##gff-version 3Chr1 phytozome9_0 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010Chr1 phytozome9_0 mRNA 3631 5899 . + . ID=PAC:19656964;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010Chr1 phytozome9_0 exon 3631 3913 . + . ID=PAC:19656964.exon.1;Parent=PAC:19656964;pacid=19656964Chr1 phytozome9_0 five_prime_UTR 3631 3759 . + . ID=PAC:19656964.five_prime_UTR.1;Parent=PAC:19656964;pacid=19656964Chr1 phytozome9_0 CDS 3760 3913 . + 0 ID=PAC:19656964.CDS.1;Parent=PAC:19656964;pacid=19656964Chr1 phytozome9_0 exon 3996 4276 . + . ID=PAC:19656964.exon.2;Parent=PAC:19656964;pacid=19656964Chr1 phytozome9_0 CDS 3996 4276 . + 2 ID=PAC:19656964.CDS.2;Parent=PAC:19656964;pacid=19656964Chr1 phytozome9_0 exon 4486 4605 . + . ID=PAC:19656964.exon.3;Parent=PAC:19656964;pacid=19656964Chr1 phytozome9_0 CDS 4486 4605 . + 0 ID=PAC:19656964.CDS.3;Parent=PAC:19656964;pacid=19656964
2.) FASTA files should follow the standard FASTA format preferably a clear sequence ID without special characters. Here is an example of a FASTA file.
head genome.fa
>Chr1CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTTATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCTTGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTACATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTATGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATTTAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAGCATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAAAAAAGCTATCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACTCAAAAAAGTATTTTTAGATGT
3.) gene description or transcript description should be tab delimited file. First column is the gene ID for the gene_description.tsv
file and the transcript ID for the transcript_description.tsv
file and the second column should be the description for gene or transcript respectively. The file should be named as gene_description.tsv
ortranscript_description.tsv.
Following you can find an example of a description file.
head gene_description.tsv
AT3G11260 WUSCHEL related homeobox 5AT3G09140 hypothetical protein (DUF674)AT5G01070 RING/FYVE/PHD zinc finger superfamily proteinAT3G18110 Pentatricopeptide repeat (PPR) superfamily proteinATMG00560 Nucleic acid-binding, OB-fold-like proteinAT3G04800 translocase inner membrane subunit 23-3AT5G63380 AMP-dependent synthetase and ligase family proteinAT1G04105 hypothetical proteinAT1G09815 polymerase delta 4AT4G01690 Flavin containing amine oxidoreductase familyAT3G23490 cyanase
head transcript_description.tsv
​AT3G11260.1 WUSCHEL related homeobox 5AT3G09140.2 hypothetical protein (DUF674)AT5G01070.1 RING/FYVE/PHD zinc finger superfamily proteinAT3G18110.1 Pentatricopeptide repeat (PPR) superfamily proteinATMG00560.1 Nucleic acid-binding, OB-fold-like proteinAT3G04800.1 translocase inner membrane subunit 23-3AT5G63380.1 AMP-dependent synthetase and ligase family proteinAT1G04105.1 hypothetical proteinAT1G09815.1 polymerase delta 4AT4G01690.1 Flavin containing amine oxidoreductase family
4.) GO
, Kegg
and Pfam
annotation can be loaded into the GenIE-Sys website. You can make them as tab delimited files. The first column should be the gene ID and the second column should be the GO
, Kegg
or Pfam
. ID and description separated by the hypen (-). If there are several descriptions associated with one gene id, you can use the semicolon (;) to separate the corresponding annotation ID and Description (ID1-Description1;ID2-Description2
). These files should be named as gene_kegg.tsv
, gene_go.tsv
and gene_pfam.tsv
. Here are some of the examples of annotation files.
head gene_go.tsv
AT1G06190 GO:0006353-transcription termination, DNA-dependentAT1G06620 GO:0055114-oxidation-reduction process;GO:0016491-oxidoreductase activityAT2G46660 GO:0055114-oxidation-reduction process;GO:0020037-heme binding;GO:0016705-oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen;GO:0005506-iron ion bindingAT5G02290 GO:0006468-protein phosphorylation;GO:0004672-protein kinase activityAT3G21700 GO:0007264-small GTPase mediated signal transduction;GO:0005525-GTP bindingAT2G16040 GO:0046983-protein dimerization activityAT3G58440 GO:0005515-protein bindingAT5G44790 GO:0046872-metal ion binding;GO:0030001-metal ion transport;GO:0000166-nucleotide bindingAT2G34630 GO:0008299-isoprenoid biosynthetic process
head gene_kegg.tsv
​AT1G06620 -AT3G27910 -AT4G22890 -AT5G54067 -AT2G34630 2.5.1.85-All-trans-nonaprenyl-diphosphate synthase (geranylgeranyl-diphosphate specific).;2.5.1.1-Dimethylallyltranstransferase.AT2G03270 3.6.4.13-RNA helicase.;3.6.4.12-DNA helicase.AT2G25590 -AT1G43171 -AT5G25130 -AT2G32280 -AT3G15020 1.1.1.37-Malate dehydrogenase.
head gene_pfam.tsv
AT1G06620 PF14226-non-haem dioxygenase in morphine synthesis N-terminal;PF03171-2OG-Fe(II) oxygenase superfamilyAT3G27910 PF01344-Kelch motifAT2G34630 PF00348-Polyprenyl synthetaseAT2G03270 PF13245-Part of AAA domain;PF13087-AAA domainAT2G25590 PF05641-Agenet domainAT5G25130 PF00067-Cytochrome P450AT2G32280 PF06749-Protein of unknown function (DUF1218)
5.) There is a space for loading best blast IDs into GenIE-Sys website. As an example if you have best BLAST hits from model plant species that can be loaded into the database. These files should be named as gene_arabidopsis.tsv, gene_spruce.tsv
or gene_populus_tsv.
Here are some examples of the best blast annotation files.
head gene_populus.tsv
AT4G38320 Potri.014G006000;Potra003982g23967AT4G25700 Potri.017G145700;Potra000924g07477AT4G11300 Potri.003G132100;Potra000613g04660AT5G61090 Potri.014G040100;Potra003452g21711AT3G26570 Potri.008G186601;Potra002618g19588AT3G13950 Potri.006G223800;Potra001531g12715AT3G12170 Potri.006G056400;Potra002594g19498AT5G43175 Potri.002G119200;Potra002863g20178
head gene_spruce.tsv
AT1G01010 MA_10426365g0010AT1G01030 MA_18923g0010AT1G01040 MA_10437243g0020AT1G01050 MA_93206g0010AT1G01060 MA_11267g0020AT1G01070 MA_13078g0020AT1G01080 MA_482994g0010AT1G01090 MA_10426096g0010AT1G01100 MA_10430200g0010
​Here you can find some of the examples of input files that we can use to generate GenIE-Sys website for PLantGenIE core species (Populus tremula, Populus trichocarpa, Picea abies, Eucalyptus grandis and Arabidopsis thaliana).
ftp://plantgenie.org/Data/GenIESys/input_files/
​
​