Loading genome files

We can install a genome to GenIE-Sys using either Graphical User Interface (GUI) or Command Line Interface (CLI).

Using Graphical User Interface (GUI)
Using Command Line Interface (CLI)
Using Graphical User Interface (GUI)

Once you have placed all the required files into the data folder according to the previous section (Input files). You will be able to see the annotation tab similar to the following screenshot.

Annotation tab in the admin panel of the GenIE-Sys website

If you miss some of the files described in the previous section, you will be able to see the error message similar to the following screenshot.

Please make sure to place the correct files into the data folder. Once you have placed all required files, now it's time to parse them into the suitable formats right before loading into the database. Please follow the order in the Annotation tab, first load data into the database and then create BLAST indices using Generate FASTA indices link.

If you have expression data please go to the next tab otherwise please log out from the site and see all your changes are displayed on the GeneList, BLAST and gene information pages.

Using Command Line Interface (CLI)

Creating a new database using CMD

Due to the increasing number of species in PlantGenIE we use a standard naming convention to easily identify and maintain the databases. For example: [website name]_[species name]_[version number]

https://github.com/irusri/geniesys/blob/master/docs/images/genie_databases.png?raw=true

Log into the MySQL server and create a database.

CREATE DATABASE new_database;

You can download the empty database here. Then load the database into the newly created database using the following commands.

git show HEAD~1:scripts/dump.sql > dump.sql
mysql -u newuser -p newpassword new_database < dump.sql

Log into the MySQL server to create user and grant permissions.

Create MySQL user

CREATE USER [email protected]'localhost' IDENTIFIED BY 'newpassword';

User permissions

GRANT SELECT ON new_database.* TO [email protected]'localhost';
GRANT INSERT,UPDATE,DELETE ON new_database.genebaskets TO [email protected]'localhost';
GRANT INSERT,UPDATE,DELETE ON new_database.defaultgenebaskets TO [email protected]'localhost';

newuser, newpassword and new_database should be included in the plugins/settings.php similar to following example.

//Define the databasename names
$db_species_array=array("new_database"=>"new genome",...
//Define the databasename and background colours
$db_species_color_array=array("new_database"=>"#86c0a6",....
//Define the username, password and host here
$db_url= array ('genelist'=>'mysqli://newuser:[email protected]/'.$selected_database);
//Define the base url with trailing slash
$GLOBALS["base_url"]='http://localhost:3000/';

Download

Let’s assume we need to integrate Populus tremula v2.0 genome into GenIE-System. First, we need to download the required files. The latest version of the GFF3 and FASTA files are available on PlantGenIE FTP.

curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/gff/Potra02_genes.gff.gz
curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/fasta/Potra02_genome.fasta.gz
gzip -d Potra02_genes.gff.gz
gzip -d Potra02_genome.fasta.gz
awk '!/##/' Potra02_genes.gff |head
chr1 maker gene 8865 11259 . - . ID=Potra2n1c1;Name=Potra2n1c1
chr1 maker mRNA 8865 10802 . - . ID=Potra2n1c1.3;Parent=Potra2n1c1;Name=Potra2n1c1.3;_AED=0.39;_eAED=0.37;_QI=192|0.66|0.75|1|0|0|4|0|115
chr1 maker CDS 8865 9054 . - 1 ID=Potra2n1c1.3:cds;Parent=Potra2n1c1.3
chr1 maker CDS 9487 9559 . - 2 ID=Potra2n1c1.3:cds;Parent=Potra2n1c1.3
chr1 maker CDS 9669 9753 . - 0 ID=Potra2n1c1.3:cds;Parent=Potra2n1c1.3
chr1 maker exon 9669 9764 . - . ID=Potra2n1c1.3:exon:300;Parent=Potra2n1c1.3
chr1 maker five_prime_UTR 9754 9764 . - . ID=Potra2n1c1.3:five_prime_utr;Parent=Potra2n1c1.3
chr1 maker exon 10622 10802 . - . ID=Potra2n1c1.3:exon:299;Parent=Potra2n1c1.3
chr1 maker five_prime_UTR 10622 10802 . - . ID=Potra2n1c1.3:five_prime_utr;Parent=Potra2n1c1.3
chr1 maker mRNA 8865 11259 . - . ID=Potra2n1c1.1;Parent=Potra2n1c1;Name=Potra2n1c1.1;_AED=0.22;_eAED=0.21;_QI=896|0.66|0.75|1|0|0|4|0|115
head Potra02_genome.fasta
>chr1
AGAGAGCTCTGTGGGTCATTACTGTCACAACTCCTAGCCAGCTTGAATAT
TCCATATAGCACATATCCTGGATGGGAAAGTTTGGTTAATGTGTGCTATT
CTTGCTCGCCTTCAACACGATTATTTCGTTCATACCACAAGAAATAAACA
GTAGTGGATAGTAGAAGGCGAGCTAGCATGTGATCACTGTTATTCTTCTT
CGTGTAGTGAGTGACTGACCAATGAAGCAATTGTGTCCACGGTTTGCATG
GCCAATAATGGTTGGCTCTGCGACAAATGGACTTCCAAACCAAGCTGGTG
TAACTGCATTCAAAAAAGAGGTGTTCATATGTTTCCATGTAAATTCCATA
TAGTATGCAAGTTGTATCTGTGACTCCTCCATGCAATCTATCCATCGTTC
TTAGTCTACCAAGGCTGGCTAACCAGAGTATAAATGAGTGACGAGGGATA

Parse genome

Now we need to parse GFF3 and FASTA files into required formats. There are two primary tables(transcript_info and gene_info) in the database.

## generate file for gene_info table
awk '/gene/{split($9,a,"ID=");split(a[2],b,";");print b[1],$1,$4,$5,$7}' FS='\t' OFS='\t' Potra02_genes.gff > gene_info.txt
$ head gene_info.txt
Potra2n1c1 chr1 8865 11259 -
Potra2n1c2 chr1 21121 21603 +
Potra2n1c3 chr1 22295 24697 -
Potra2n1c4 chr1 30731 32811 +
Potra2n1c5 chr1 33508 33833 +
Potra2n1c6 chr1 50823 54726 -
Potra2n1c7 chr1 50901 51116 +
Potra2n1c8 chr1 54928 62450 -
Potra2n1c9 chr1 69471 73884 -
Potra2n1c10 chr1 74717 75583 +
## create file for transcript_info table
awk 'BEGIN{ OFS = "\t"; }$3~/gene/{g=$4"\t"$5}$3~/RNA$/{split($9,a,/[;=]/);for(i=1;i in a;i+=2)k[a[i]]=a[i+1]; print k["Name"], k["Parent"], "desc", $1, $7, g, "PAC", "PEP", $4,$5}' Potra02_genes.gff > transcript_info.txt
$ head transcript_info.txt
Potra2n1c1.3 Potra2n1c1 desc chr1 - 8865 11259 PAC PEP 8865 10802
Potra2n1c1.1 Potra2n1c1 desc chr1 - 8865 11259 PAC PEP 8865 11259
Potra2n1c1.2 Potra2n1c1 desc chr1 - 8865 11259 PAC PEP 8865 11259
Potra2n1c2.1 Potra2n1c2 desc chr1 + 21121 21603 PAC PEP 21121 21603
Potra2n1c3.1 Potra2n1c3 desc chr1 - 22295 24697 PAC PEP 22295 24697
Potra2n1c4.1 Potra2n1c4 desc chr1 + 30731 32811 PAC PEP 30731 32811
Potra2n1c5.1 Potra2n1c5 desc chr1 + 33508 33833 PAC PEP 33508 33833
Potra2n1c6.1 Potra2n1c6 desc chr1 - 50823 54726 PAC PEP 50823 54726
Potra2n1c7.1 Potra2n1c7 desc chr1 + 50901 51116 PAC PEP 50901 51116
Potra2n1c8.3 Potra2n1c8 desc chr1 - 54928 62450 PAC PEP 54928 61609

Create a database

Now we need to create a database. To do this, you need a MySQL username and password. If you use the MAMP installation default username and password would be root.

## Download all required scripts and dump database
$ git clone https://github.com/irusri/scripts.git
## Create database for default root user and root password
$ mysql -u root -proot
mysql> create database my_genie_sys_database;
Query OK, 1 row affected (0.01 sec)
mysql> use my_genie_sys_database;
Database changed
mysql> source scripts/dump.sql;

Loading primary tables

Now we need to load above two files(gene_info.txt and transcript_info.txt) into the newly created database. There is a script(load_data.sh) to do this. We can download the script and enter the correct username, password and database information to DB_USER, DB_PASS and DB parameters respectively.

$ nano scripts/load_data.sh
#load_data.sh script
#!/bin/bash
#load_data.sh
#USAGE: sh load_data.sh [table_name] [filename]
#sh load_data.sh transcript_info_x /tmp/transcript_info.tsv
DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'
mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
TRUNCATE TABLE $1;
ALTER TABLE $1 AUTO_INCREMENT = 1;
load data local infile '$2' ignore INTO TABLE $1 CHARACTER SET UTF8 fields terminated by '\t' LINES TERMINATED BY '\n' ignore 0 lines;
EOFMYSQL

Run following commands to load gene_info.txt and transcript_info.txt into respective tables.

#Load above generated source file into gene_info table
sh scripts/load_data.sh gene_info gene_info.txt
#Load previously generated source file into transcript_info table
sh scripts/load_data.sh transcript_info transcript_info.txt

Now we need to update the gene_i parameter in transcript_info table. There is a script(update_gene_i.sh) in the scripts directory, we just need to enter the correct username, password and database information to DB_USER, DB_PASS and DB parameters respectively as we did in previous step.

$ nano scripts/update_gene_i.sh
#update_gene_i.sh script
#!/bin/bash
#update_gene_i.sh
DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'
#USAGE: sh update_gene_i.sh
mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
create temporary table add_gene_i(gene_i MEDIUMINT NOT NULL AUTO_INCREMENT PRIMARY KEY, genename VARCHAR(40));
ALTER TABLE add_gene_i AUTO_INCREMENT = 1;
INSERT INTO add_gene_i(genename) select DISTINCT(gene_id) from transcript_info;
UPDATE transcript_info INNER join add_gene_i ON add_gene_i.genename = transcript_info.gene_id SET transcript_info.gene_i = add_gene_i.gene_i;
drop temporary table add_gene_i;
EOFMYSQL

Let’s run the following command to update gene_i in transcript_info table.

#Finally update the gene_i in transcript_info table using update_gene_i.sh.
sh scripts/update_gene_i.sh

If above script takes time please try following command on MySQL. This will update the gene_i column in transcript_info table.

$ nano scripts/update_gene_i.sh
#update_gene_i_dev.sh script
#!/bin/bash
#update_gene_i_dev.sh
DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'
#USAGE: sh update_gene_i_dev.sh
mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
update transcript_info,gene_info set transcript_info.gene_i=gene_info.gene_i where gene_info.gene_id=transcript_info.gene_id;
EOFMYSQL

Run following command to execute the above script(update_gene_i_dev.sh)

sh scripts/update_gene_i_dev.sh

Great! We have loaded transcript and gene infortmation properly into the database. Now can we load additional information. For example; description to the transcript_info table.

$ curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/annotation/blast2go/Potra22_blast2go_description.txt
$ head Potra22_blast2go_description.txt
Potra2n765s36715.1 UniRef90_B9GWJ3F-box domain-containing protein n=2 Tax=Populus TaxID=3689 RepID=B9GWJ3_POPTR
Potra2n765s36713.1 Populus trichocarpa uncharacterized LOC112326797 (LOC112326797), ncRNA
Potra2n765s36713.2 Populus trichocarpa uncharacterized LOC112326797 (LOC112326797), ncRNA
Potra2n765s36714.1 UniRef90_A0A2K2BA33FAD-binding PCMH-type domain-containing protein n=40 Tax=Populus TaxID=3689 RepID=A0A2K2BA33_POPTR
Potra2n1433s37070.1 UniRef90_U7E173Protein kinase domain-containing protein (Fragment) n=1 Tax=Populus trichocarpa TaxID=3694 RepID=U7E173_POPTR
Potra2n581s36023.1 UniRef90_UPI00057ABC08probable LRR receptor-like serine/threonine-protein kinase At4g08850 isoform X1 n=1 Tax=Populus euphratica TaxID=75702 RepID=UPI00057ABC08
Potra2n581s36025.1 UniRef90_UPI00057B3C83probable LRR receptor-like serine/threonine-protein kinase At4g08850 n=1 Tax=Populus euphratica TaxID=75702 RepID=UPI00057B3C83
Potra2n581s36024.1 UniRef90_U5GE99Zeta-carotene desaturase n=10 Tax=fabids TaxID=91835 RepID=U5GE99_POPTR
Potra2n707s36547.1 UniRef90_A0A2K1X8T3AMPKBI domain-containing protein n=5 Tax=Populus TaxID=3689 RepID=A0A2K1X8T3_POPTR
Potra2n409s35556.1 UniRef90_UPI000B5D6D9FE3 ubiquitin-protein ligase SHPRH isoform X3 n=1 Tax=Manihot esculenta TaxID=3983 RepID=UPI000B5D6D9F

Here is the script to load description into transcript_info column.

#!/bin/bash
#update_description.sh
DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'
# if less than two arguments supplied, display error message
if [ $# -le 0 ]
then
start='\033[0;33m'
start_0='\033[0;33m'
start_2='\033[0;31m'
end='\033[0m'
echo "\nUsage:\n$0 ${start}[gene_info/transcript_info] [file_name]${end}\nEx: ${start_2}sh update_description.sh transcript_info/gene_info potra_description.tsv${end}\n\nWhat it does?\n${start_0}This script will create a two columns(ids, description) temporary table and load the [file_name] into it.\nThen it will match ids column in temporary table with transcript_ids/gene_ids and update the gene/transcript description.\nFinally delete the temporary table.\n${end}"
exit 1
fi
table_name=$(echo $1 | awk '{split($0,a,"_");print a[1]}');
tmp_field_name=$table_name"_id"
/usr/bin/mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB<<EOFMYSQL
CREATE TEMPORARY TABLE tmp_tb(gene_name VARCHAR(60),annotation VARCHAR(1000));
load data local infile '$2' replace INTO TABLE tmp_tb fields terminated by '\t' LINES TERMINATED BY '\n' ignore 0 lines;
UPDATE $1 INNER JOIN tmp_tb on tmp_tb.gene_name = $1.$tmp_field_name SET $1.description = tmp_tb.annotation;
DROP TEMPORARY TABLE tmp_tb;
EOFMYSQL

We just need to run the script to load description into transcript_info table.

sh scripts/update_description.sh transcript_info Potra22_blast2go_description.txt

Loading secondary tables

Following are the tables available with GenIE-Sys default database. However, it is easy to add more tables depending on the user demands. Secondary table conatins annotation related to the primary tables.

$ curl -O ftp://plantgenie.org/Data/PopGenIE/Populus_tremula/v2.2/annotation/blast2go/Potra22_blast2go_GO.txt
$ head Potra22_blast2go_GO.txt
Sequence Name Annotation GO ID-Annotation GO Term
Potra2n765s36715.1 GO:0005515-protein binding
Potra2n765s36714.1 GO:0009690-cytokinin metabolic process
Potra2n765s36714.1 GO:0016021-integral component of membrane
Potra2n765s36714.1 GO:0019139-cytokinin dehydrogenase activity
Potra2n765s36714.1 GO:0055114-oxidation-reduction process
Potra2n765s36714.1 GO:0071949-FAD binding
Potra2n1433s37070.1 GO:0004674-protein serine/threonine kinase activity
Potra2n1433s37070.1 GO:0005509-calcium ion binding
Potra2n1433s37070.1 GO:0005524-ATP binding
$ awk 'BEGIN{FS="\t";OFS="\t"}{a[$1]=a[$1]?a[$1]";"$2:$2;}END{for (i in a)print i"\t"a[i];}' Potra22_blast2go_GO.txt > Potrav22_go_desc.txt
$ head Potrav22_go_desc.txt
Potra2n5c11384.4 GO:0019904-protein domain specific binding
Potra2n1c2900.1 GO:0006118-obsolete electron transport;GO:0009055-electron transfer activity;GO:0016021-integral component of membrane;GO:0022900-electron transport chain
Potra2n12c24161.1 GO:0005789-endoplasmic reticulum membrane;GO:0016021-integral component of membrane
Potra2n6c13118.1 GO:0003677-DNA binding;GO:0004724-magnesium-dependent protein serine/threonine phosphatase activity;GO:0005963-magnesium-dependent protein serine/threonine phosphatase complex;GO:0006470-protein dephosphorylation;GO:0046872-metal ion binding
Potra2n6c13118.2 GO:0003677-DNA binding;GO:0004724-magnesium-dependent protein serine/threonine phosphatase activity;GO:0005963-magnesium-dependent protein serine/threonine phosphatase complex;GO:0006470-protein dephosphorylation;GO:0046872-metal ion binding
Potra2n6c13118.3 GO:0003677-DNA binding;GO:0004724-magnesium-dependent protein serine/threonine phosphatase activity;GO:0005963-magnesium-dependent protein serine/threonine phosphatase complex;GO:0006470-protein dephosphorylation;GO:0046872-metal ion binding
Potra2n9c19679.1 GO:0016021-integral component of membrane;GO:0016117-carotenoid biosynthetic process;GO:0016166-phytoene dehydrogenase activity;GO:0016757-transferase activity, transferring glycosyl groups;GO:0055114-oxidation-reduction process
Potra2n14c27340.1 GO:0046872-metal ion binding
Potra2n9c19679.2 GO:0016021-integral component of membrane;GO:0016117-carotenoid biosynthetic process;GO:0016166-phytoene dehydrogenase activity;GO:0016757-transferase activity, transferring glycosyl groups;GO:0055114-oxidation-reduction process
Potra2n14c27340.2 GO:0046872-metal ion binding

As you see the annotation are based on transcript IDs. Therefore, Following script can be used to load secondary table into transcript_go table. Then update transcript_i column using another script as described below.

sh scripts/load_data.sh transcript_go Potrav22_go_desc.txt

Then update the gene_i or transcript_i depending on the primary usint of the annotation dataset using following script.

#!/bin/bash
#update_annotation_gene.sh
DB_USER='root' #'your_db_username'
DB_PASS='root' #'your_password'
DB='my_genie_sys_database' #'database_name'
#USAGE sh update_annotation_gene_i.sh transcript_go
display_usage() {
echo "\nUsage:\n$0 [table_name] \n"
}
# if less than one arguments supplied, display usage
if [ $# -le 0 ]
then
display_usage
exit 1
fi
count=$(mysql --host=localhost --user=$DB_USER --password=$DB_PASS --database=$DB -sse "SHOW COLUMNS FROM $1 LIKE 'transcript_id';")
if [ ${#count} -gt 0 ]
then
mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
UPDATE $1 INNER JOIN transcript_info on transcript_info.transcript_id = $1.transcript_id SET $1.transcript_i = transcript_info.transcript_i;
EOFMYSQL
else
mysql --host=localhost --user=$DB_USER --password=$DB_PASS --local_infile=1 --database=$DB <<EOFMYSQL
UPDATE $1 INNER JOIN transcript_info on transcript_info.gene_id = $1.gene_id SET $1.gene_i = transcript_info.gene_i;
EOFMYSQL
fi

Let’s run following command to fill the transcript_i or gene_i column.

sh scripts/update_annotation_gene_i.sh transcript_go

Similalrly when we have annotation based on gene IDs, we have to fill gene_annotation tables.

You may also load additional annotation as secondary tables to the GenIE-Sys database. If there is a transcript-based annotation, please use the following script to create a corresponding table (please replace annotation with the name of the annotation).

-- ----------------------------
-- Table structure for `transcript_annotation`
-- ----------------------------
DROP TABLE IF EXISTS `transcript_annotation`;
CREATE TABLE `transcript_annotation` (
`transcript_id` varchar(255) NOT NULL,
`annotation_description` varchar(1000) DEFAULT '' NOT NULL,
`transcript_i` mediumint(20) unsigned DEFAULT 0 NOT NULL,
PRIMARY KEY (`transcript_i`,`transcript_id`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT;

If there is a gene-based annotation, please use the following script to create a corresponding table (please replace annotation with the name of the annotation).

-- ----------------------------
-- Table structure for `gene_annotation`
-- ----------------------------
DROP TABLE IF EXISTS `gene_annotation`;
CREATE TABLE `gene_annotation` (
`gene_id` varchar(255) NOT NULL,
`annotation_description` varchar(1000) DEFAULT '' NOT NULL,
`gene_i` mediumint(20) unsigned DEFAULT 0 NOT NULL,
PRIMARY KEY (`gene_i`,`gene_id`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1 ROW_FORMAT=COMPACT;

Finally you may need to add the new annotation into /plugins/genelist/genelist/service/config.php to make it searchable in the GeneSearch tool.