All the input files such as FASTA, Generic Feature Format Version 3 (GFF3) and Tab-separated values (TSV) files should be placed inside the data directory which is inside the geniesys folder. GFF3 and FASTA files are required to setup the database and other files are recommended but not mandatory.
Here are some of the guidelines for prepare initial input files.
1.) GFF3 files should follow the standard GFF3 specifications listed here. Here is the example of a GFF3 file that can be included inside the data file.
2.) FASTA files should follow the standard FASTA format preferably a clear sequence ID without special characters. Here is an example of a FASTA file.
head genome.fa
3.) gene description or transcript description should be tab delimited file. First column is the gene ID for the gene_description.tsv file and the transcript ID for the transcript_description.tsv file and the second column should be the description for gene or transcript respectively. The file should be named as gene_description.tsv ortranscript_description.tsv. Following you can find an example of a description file.
head gene_description.tsv
AT3G11260 WUSCHEL related homeobox 5
AT3G09140 hypothetical protein (DUF674)
AT5G01070 RING/FYVE/PHD zinc finger superfamily protein
AT3G18110 Pentatricopeptide repeat (PPR) superfamily protein
ATMG00560 Nucleic acid-binding, OB-fold-like protein
AT3G04800 translocase inner membrane subunit 23-3
AT5G63380 AMP-dependent synthetase and ligase family protein
AT1G04105 hypothetical protein
AT1G09815 polymerase delta 4
AT4G01690 Flavin containing amine oxidoreductase family
AT3G23490 cyanase
head transcript_description.tsv
AT3G11260.1 WUSCHEL related homeobox 5
AT3G09140.2 hypothetical protein (DUF674)
AT5G01070.1 RING/FYVE/PHD zinc finger superfamily protein
AT3G18110.1 Pentatricopeptide repeat (PPR) superfamily protein
ATMG00560.1 Nucleic acid-binding, OB-fold-like protein
AT3G04800.1 translocase inner membrane subunit 23-3
AT5G63380.1 AMP-dependent synthetase and ligase family protein
AT1G04105.1 hypothetical protein
AT1G09815.1 polymerase delta 4
AT4G01690.1 Flavin containing amine oxidoreductase family
4.) GO, Kegg and Pfam annotation can be loaded into the GenIE-Sys website. You can make them as tab delimited files. The first column should be the gene ID and the second column should be the GO, Kegg or Pfam . ID and description separated by the hypen (-). If there are several descriptions associated with one gene id, you can use the semicolon (;) to separate the corresponding annotation ID and Description (ID1-Description1;ID2-Description2). These files should be named as gene_kegg.tsv, gene_go.tsv and gene_pfam.tsv. Here are some of the examples of annotation files.
head gene_go.tsv
AT1G06190 GO:0006353-transcription termination, DNA-dependent
AT1G06620 GO:0055114-oxidation-reduction process;GO:0016491-oxidoreductase activity
AT2G46660 GO:0055114-oxidation-reduction process;GO:0020037-heme binding;GO:0016705-oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen;GO:0005506-iron ion binding
AT5G02290 GO:0006468-protein phosphorylation;GO:0004672-protein kinase activity
AT3G21700 GO:0007264-small GTPase mediated signal transduction;GO:0005525-GTP binding
AT2G16040 GO:0046983-protein dimerization activity
AT3G58440 GO:0005515-protein binding
AT5G44790 GO:0046872-metal ion binding;GO:0030001-metal ion transport;GO:0000166-nucleotide binding
AT2G34630 GO:0008299-isoprenoid biosynthetic process
AT1G06620 PF14226-non-haem dioxygenase in morphine synthesis N-terminal;PF03171-2OG-Fe(II) oxygenase superfamily
AT3G27910 PF01344-Kelch motif
AT2G34630 PF00348-Polyprenyl synthetase
AT2G03270 PF13245-Part of AAA domain;PF13087-AAA domain
AT2G25590 PF05641-Agenet domain
AT5G25130 PF00067-Cytochrome P450
AT2G32280 PF06749-Protein of unknown function (DUF1218)
5.) There is a space for loading best blast IDs into GenIE-Sys website. As an example if you have best BLAST hits from model plant species that can be loaded into the database. These files should be named as gene_arabidopsis.tsv, gene_spruce.tsv or gene_populus_tsv. Here are some examples of the best blast annotation files.
Here you can find some examples of input files that we use to generate GenIE-Sys website for PLantGenIE core species (Populus tremula, Populus trichocarpa, Picea abies, Eucalyptus grandis and Arabidopsis thaliana).
ftp://plantgenie.org/Data/GenIESys/input_files/
Input files for expression and network tables
Following table shows the example of expression file format that we use is the geniesys. expression value can be TPM or CPM values.
gene_id sample_name dataset sample_description expression
Potra2c499s35830 Female mature leaf (Genotype 202) sex Female mature leaf (Genotype 202) 11.485859
Potra2c499s35830 Male mature leaf (Genotype 207) sex Male mature leaf (Genotype 207) 38.15796
Potra2c499s35830 Female mature leaf (Genotype 213.1) sex Female mature leaf (Genotype 213.1) 5.70359
Potra2c499s35830 Male mature leaf (Genotype 221) sex Male mature leaf (Genotype 221) 27.727529
Potra2c499s35830 Female mature leaf (Genotype 226.1) sex Female mature leaf (Genotype 226.1) 4.554956
Potra2c499s35830 Male mature leaf (Genotype 229.1) sex Male mature leaf (Genotype 229.1) 8.86191
Potra2c499s35830 Male mature leaf (Genotype 229) sex Male mature leaf (Genotype 229) 9.736108
Potra2c499s35830 Male mature leaf (Genotype 235) sex Male mature leaf (Genotype 235) 9.915396
Potra2c499s35830 Female mature leaf (Genotype 236) sex Female mature leaf (Genotype 236) 0
Potra2c499s35830 Female mature leaf (Genotype 239) sex Female mature leaf (Genotype 239) 11.865896
Following table shows the example of network file format that we use is the geniesys. irp_score,nc_score,nc_sdev derived from the Seidr output.