Input files
All the input files such as FASTA
, Generic Feature Format Version 3 (GFF3)
and Tab-separated values (TSV)
files should be placed inside the data directory which is inside the geniesys folder. GFF3 and FASTA files are required to setup the database and other files are recommended but not mandatory.
ls geniesys/data
gene.gff3
genome.fa
cds.fa
transcript.fa
protein.fa
Here are some of the guidelines for prepare initial input files.
1.) GFF3
files should follow the standard GFF3
specifications listed here. Here is the example of a GFF3
file that can be included inside the data file.
head gene.gff3
##gff-version 3
Chr1 phytozome9_0 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010
Chr1 phytozome9_0 mRNA 3631 5899 . + . ID=PAC:19656964;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010
Chr1 phytozome9_0 exon 3631 3913 . + . ID=PAC:19656964.exon.1;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 five_prime_UTR 3631 3759 . + . ID=PAC:19656964.five_prime_UTR.1;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 CDS 3760 3913 . + 0 ID=PAC:19656964.CDS.1;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 exon 3996 4276 . + . ID=PAC:19656964.exon.2;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 CDS 3996 4276 . + 2 ID=PAC:19656964.CDS.2;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 exon 4486 4605 . + . ID=PAC:19656964.exon.3;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 CDS 4486 4605 . + 0 ID=PAC:19656964.CDS.3;Parent=PAC:19656964;pacid=19656964
2.) FASTA files should follow the standard FASTA format preferably a clear sequence ID without special characters. Here is an example of a FASTA file.
head genome.fa
>Chr1
CCCTAAACCCTAAACCCTAAACCCTAAACCTCTGAATCCTTAATCCCTAAATCCCTAAATCTTTAAATCCTACATCCAT
GAATCCCTAAATACCTAATTCCCTAAACCCGAAACCGGTTTCTCTGGTTGAAAATCATTGTGTATATAATGATAATTTT
ATCGTTTTTATGTAATTGCTTATTGTTGTGTGTAGATTTTTTAAAAATATCATTTGAGGTCAATACAAATCCTATTTCT
TGTGGTTTTCTTTCCTTCACTTAGCTATGGATGGTTTATCTTCATTTGTTATATTGGATACAAGCTTTGCTACGATCTA
CATTTGGGAATGTGAGTCTCTTATTGTAACCTTAGGGTTGGTTTATCTCAAGAATCTTATTAATTGTTTGGACTGTTTA
TGTTTGGACATTTATTGTCATTCTTACTCCTTTGTGGAAATGTTTGTTCTATCAATTTATCTTTTGTGGGAAAATTATT
TAGTTGTAGGGATGAAGTCTTTCTTCGTTGTTGTTACGCTTGTCATCTCATCTCTCAATGATATGGGATGGTCCTTTAG
CATTTATTCTGAAGTTCTTCTGCTTGATGATTTTATCCTTAGCCAAAAGGATTGGTGGTTTGAAGACACATCATATCAA
AAAAGCTATCGCCTCGACGATGCTCTATTTCTATCCTTGTAGCACACATTTTGGCACTCAAAAAAGTATTTTTAGATGT
3.) gene description or transcript description should be tab delimited file. First column is the gene ID for the gene_description.tsv
file and the transcript ID for the transcript_description.tsv
file and the second column should be the description for gene or transcript respectively. The file should be named as gene_description.tsv
ortranscript_description.tsv.
Following you can find an example of a description file.
head gene_description.tsv
AT3G11260 WUSCHEL related homeobox 5
AT3G09140 hypothetical protein (DUF674)
AT5G01070 RING/FYVE/PHD zinc finger superfamily protein
AT3G18110 Pentatricopeptide repeat (PPR) superfamily protein
ATMG00560 Nucleic acid-binding, OB-fold-like protein
AT3G04800 translocase inner membrane subunit 23-3
AT5G63380 AMP-dependent synthetase and ligase family protein
AT1G04105 hypothetical protein
AT1G09815 polymerase delta 4
AT4G01690 Flavin containing amine oxidoreductase family
AT3G23490 cyanase
head transcript_description.tsv
AT3G11260.1 WUSCHEL related homeobox 5
AT3G09140.2 hypothetical protein (DUF674)
AT5G01070.1 RING/FYVE/PHD zinc finger superfamily protein
AT3G18110.1 Pentatricopeptide repeat (PPR) superfamily protein
ATMG00560.1 Nucleic acid-binding, OB-fold-like protein
AT3G04800.1 translocase inner membrane subunit 23-3
AT5G63380.1 AMP-dependent synthetase and ligase family protein
AT1G04105.1 hypothetical protein
AT1G09815.1 polymerase delta 4
AT4G01690.1 Flavin containing amine oxidoreductase family
4.) GO
, Kegg
and Pfam
annotation can be loaded into the GenIE-Sys website. You can make them as tab delimited files. The first column should be the gene ID and the second column should be the GO
, Kegg
or Pfam
. ID and description separated by the hypen (-). If there are several descriptions associated with one gene id, you can use the semicolon (;) to separate the corresponding annotation ID and Description (ID1-Description1;ID2-Description2
). These files should be named as gene_kegg.tsv
, gene_go.tsv
and gene_pfam.tsv
. Here are some of the examples of annotation files.
head gene_go.tsv
AT1G06190 GO:0006353-transcription termination, DNA-dependent
AT1G06620 GO:0055114-oxidation-reduction process;GO:0016491-oxidoreductase activity
AT2G46660 GO:0055114-oxidation-reduction process;GO:0020037-heme binding;GO:0016705-oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen;GO:0005506-iron ion binding
AT5G02290 GO:0006468-protein phosphorylation;GO:0004672-protein kinase activity
AT3G21700 GO:0007264-small GTPase mediated signal transduction;GO:0005525-GTP binding
AT2G16040 GO:0046983-protein dimerization activity
AT3G58440 GO:0005515-protein binding
AT5G44790 GO:0046872-metal ion binding;GO:0030001-metal ion transport;GO:0000166-nucleotide binding
AT2G34630 GO:0008299-isoprenoid biosynthetic process
head gene_kegg.tsv
AT1G06620 -
AT3G27910 -
AT4G22890 -
AT5G54067 -
AT2G34630 2.5.1.85-All-trans-nonaprenyl-diphosphate synthase (geranylgeranyl-diphosphate specific).;2.5.1.1-Dimethylallyltranstransferase.
AT2G03270 3.6.4.13-RNA helicase.;3.6.4.12-DNA helicase.
AT2G25590 -
AT1G43171 -
AT5G25130 -
AT2G32280 -
AT3G15020 1.1.1.37-Malate dehydrogenase.
head gene_pfam.tsv
AT1G06620 PF14226-non-haem dioxygenase in morphine synthesis N-terminal;PF03171-2OG-Fe(II) oxygenase superfamily
AT3G27910 PF01344-Kelch motif
AT2G34630 PF00348-Polyprenyl synthetase
AT2G03270 PF13245-Part of AAA domain;PF13087-AAA domain
AT2G25590 PF05641-Agenet domain
AT5G25130 PF00067-Cytochrome P450
AT2G32280 PF06749-Protein of unknown function (DUF1218)
5.) There is a space for loading best blast IDs into GenIE-Sys website. As an example if you have best BLAST hits from model plant species that can be loaded into the database. These files should be named as gene_arabidopsis.tsv, gene_spruce.tsv
or gene_populus_tsv.
Here are some examples of the best blast annotation files.
head gene_populus.tsv
AT4G38320 Potri.014G006000;Potra003982g23967
AT4G25700 Potri.017G145700;Potra000924g07477
AT4G11300 Potri.003G132100;Potra000613g04660
AT5G61090 Potri.014G040100;Potra003452g21711
AT3G26570 Potri.008G186601;Potra002618g19588
AT3G13950 Potri.006G223800;Potra001531g12715
AT3G12170 Potri.006G056400;Potra002594g19498
AT5G43175 Potri.002G119200;Potra002863g20178
head gene_spruce.tsv
AT1G01010 MA_10426365g0010
AT1G01030 MA_18923g0010
AT1G01040 MA_10437243g0020
AT1G01050 MA_93206g0010
AT1G01060 MA_11267g0020
AT1G01070 MA_13078g0020
AT1G01080 MA_482994g0010
AT1G01090 MA_10426096g0010
AT1G01100 MA_10430200g0010
Here you can find some examples of input files that we use to generate GenIE-Sys website for PLantGenIE core species (Populus tremula, Populus trichocarpa, Picea abies, Eucalyptus grandis and Arabidopsis thaliana).
ftp://plantgenie.org/Data/GenIESys/input_files/
Input files for expression and network tables
Following table shows the example of expression file format that we use is the geniesys. expression
value can be TPM or CPM values.
gene_id sample_name dataset sample_description expression
Potra2c499s35830 Female mature leaf (Genotype 202) sex Female mature leaf (Genotype 202) 11.485859
Potra2c499s35830 Male mature leaf (Genotype 207) sex Male mature leaf (Genotype 207) 38.15796
Potra2c499s35830 Female mature leaf (Genotype 213.1) sex Female mature leaf (Genotype 213.1) 5.70359
Potra2c499s35830 Male mature leaf (Genotype 221) sex Male mature leaf (Genotype 221) 27.727529
Potra2c499s35830 Female mature leaf (Genotype 226.1) sex Female mature leaf (Genotype 226.1) 4.554956
Potra2c499s35830 Male mature leaf (Genotype 229.1) sex Male mature leaf (Genotype 229.1) 8.86191
Potra2c499s35830 Male mature leaf (Genotype 229) sex Male mature leaf (Genotype 229) 9.736108
Potra2c499s35830 Male mature leaf (Genotype 235) sex Male mature leaf (Genotype 235) 9.915396
Potra2c499s35830 Female mature leaf (Genotype 236) sex Female mature leaf (Genotype 236) 0
Potra2c499s35830 Female mature leaf (Genotype 239) sex Female mature leaf (Genotype 239) 11.865896
Following table shows the example of network file format that we use is the geniesys. irp_score,nc_score,nc_sdev
derived from the Seidr output.
dataset source target type irp_score nc_score nc_sdev
aspleaf Potra2c499s35830 Potra2n10c20315 Directed 0.286377 0.638776 0.431892
aspleaf Potra2c499s35830 Potra2n10c20477 Directed 0.267369 0.619623 0.467934
aspleaf Potra2c499s35830 Potra2n14c27073 Directed 0.261255 0.646634 0.44341
aspleaf Potra2c499s35830 Potra2n15c28503 Directed 0.235742 0.628397 0.488165
aspleaf Potra2c499s35830 Potra2n15c28913 Directed 0.246521 0.681161 0.416157
aspleaf Potra2c499s35830 Potra2n16c29690 Directed 0.200386 0.647241 0.50554
aspleaf Potra2c499s35830 Potra2n18c32618 Directed 0.261308 0.66005 0.428262
aspleaf Potra2c499s35830 Potra2n19c33381 Directed 0.453916 0.798063 0.200826
aspleaf Potra2c499s35830 Potra2n1c1338 Directed 0.275956 0.674358 0.400909
aspleaf Potra2c499s35830 Potra2n1c1722 Directed 0.239664 0.645548 0.464229
Last updated
Was this helpful?