Input files
All the input files such as FASTA, Generic Feature Format Version 3 (GFF3) and Tab-separated values (TSV) files should be placed inside the data directory which is inside the geniesys folder. GFF3 and FASTA files are required to setup the database and other files are recommended but not mandatory.
ls geniesys/data
gene.gff3
genome.fa
cds.fa
transcript.fa
protein.faHere are some of the guidelines for prepare initial input files.
1.) GFF3 files should follow the standard GFF3 specifications listed here. Here is the example of a GFF3 file that can be included inside the data file.
head gene.gff3
##gff-version 3
Chr1 phytozome9_0 gene 3631 5899 . + . ID=AT1G01010;Name=AT1G01010
Chr1 phytozome9_0 mRNA 3631 5899 . + . ID=PAC:19656964;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010
Chr1 phytozome9_0 exon 3631 3913 . + . ID=PAC:19656964.exon.1;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 five_prime_UTR 3631 3759 . + . ID=PAC:19656964.five_prime_UTR.1;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 CDS 3760 3913 . + 0 ID=PAC:19656964.CDS.1;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 exon 3996 4276 . + . ID=PAC:19656964.exon.2;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 CDS 3996 4276 . + 2 ID=PAC:19656964.CDS.2;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 exon 4486 4605 . + . ID=PAC:19656964.exon.3;Parent=PAC:19656964;pacid=19656964
Chr1 phytozome9_0 CDS 4486 4605 . + 0 ID=PAC:19656964.CDS.3;Parent=PAC:19656964;pacid=196569642.) FASTA files should follow the standard FASTA format preferably a clear sequence ID without special characters. Here is an example of a FASTA file.
head genome.fa
3.) gene description or transcript description should be tab delimited file. First column is the gene ID for the gene_description.tsv file and the transcript ID for the transcript_description.tsv file and the second column should be the description for gene or transcript respectively. The file should be named as gene_description.tsv ortranscript_description.tsv. Following you can find an example of a description file.
head gene_description.tsv
head transcript_description.tsv
4.) GO, Kegg and Pfam annotation can be loaded into the GenIE-Sys website. You can make them as tab delimited files. The first column should be the gene ID and the second column should be the GO, Kegg or Pfam . ID and description separated by the hypen (-). If there are several descriptions associated with one gene id, you can use the semicolon (;) to separate the corresponding annotation ID and Description (ID1-Description1;ID2-Description2). These files should be named as gene_kegg.tsv, gene_go.tsv and gene_pfam.tsv. Here are some of the examples of annotation files.
head gene_go.tsv
head gene_kegg.tsv
head gene_pfam.tsv
5.) There is a space for loading best blast IDs into GenIE-Sys website. As an example if you have best BLAST hits from model plant species that can be loaded into the database. These files should be named as gene_arabidopsis.tsv, gene_spruce.tsv or gene_populus_tsv. Here are some examples of the best blast annotation files.
head gene_populus.tsv
head gene_spruce.tsv
Here you can find some examples of input files that we use to generate GenIE-Sys website for PLantGenIE core species (Populus tremula, Populus trichocarpa, Picea abies, Eucalyptus grandis and Arabidopsis thaliana).
Input files for expression and network tables
Following table shows the example of expression file format that we use is the geniesys. expression value can be TPM or CPM values.
Following table shows the example of network file format that we use is the geniesys. irp_score,nc_score,nc_sdev derived from the Seidr output.
Last updated
Was this helpful?