ApiDB/EuPathDB Workshop

Solutions to Data retrieval, SRT, download, Orthologs Exercises

Brian Brunk
Monday, June 9th - 3:30 pm

Exercise 1: Id list and downloads

  • Retrieve in PlasmoDB genes for the following list of identifiers that you determined via a microarray experiment to be downregulated >= 2 fold compared to 3D7 at 48 hrs in response to the knockout of the pfRh2b gene.
    • MAL13P1.273, PFA0010c, MAL8P1.163, PFI0025c, PF10_0293, PF10_0306, PF10_0255, PFF1360w, PFL2080c, PFL2520w, PF10_0282, PFL0010c, PFA0260c, PFL1700c, PF10_0289, PFI0945w, PF10_0343, PFL0590c, PFL0030c, PFE1100w, PFF1365c, PFI1005w, PFE1285w, PFB0100c, PFB0687c, PF10_0295, PFI0100c, PFB0105c, PFB0680w, PF11_0461, PF10_0347, PF10_0337, PF10_0022, PFL1465c, MAL13P1.22, PFI1370c, PFI1180w, PFF1180w, PFA0020w, MAL8P1.109, PFC0735w, PFL0585w, PFE1350c, PFE0625w, PF14_0607, PFE1050w, PFC0920w, PF13_0301, MAL7P1.125, PFF0995c, PFF0610c, PF10_0283, PFD0665c, PF10_0344, PF11_0075, PF14_0018, MAL13P1.2, PFA0630c, PF14_0373, PF14_0443, PF14_0500, MAL13P1.264, PFC0371w, PF13_0063, PF10_0203, MAL13P1.260, MAL8P1.72
    • Use the query by list of ids and paste in the above list. You should get 67 genes returned.
  • Can you identify commonalities between members of this list that help elucidate the biology of this knockout?
    • You can explore this in a variety of ways. Simplest to add columns such as GO function, EC number, expression column(s) and look for patterns or apparent over-representation of specific values. You could also run queries for these things and intersect with your gene list. Most powerful way to do this would be to download with many of the columns and import into excel or a statistical analysis package.
  • Download using the tab delimited format a report containing attributes that may help with #2.
    • Click download report and choose tab-delimited. Select whichever columns you want and click download (can download to the browser or a file).
  • Download using the detailed report additional attributes that may help (such as metabolic pathways etc). What advantages / disadvantages does the detailed report have as compared to the simple report?
    • Select some tables such as EC numbers, GO functions, Metabolic pathways etc. Disadvantage is that it is more difficult to parse although you can still open it in Excel. Advantage is that much more data is available including more detailed information for things such as GO functions and pathways.
  • Download a fasta report containing the putative promoter regions of the genes (-1000 to +0 of the start).
    • Choose FASTA as the type from the download page. You can then select genomic and choose the coordinates you want to download relative to the genes. Make the start -1000 from the start and end +0 from the start.
  • You want to include comparative genomics in your transfactor analysis. Download a fasta report of the promoter regions of all the vivax and berhei orthologs of these genes.
    • Use the ortholog link in the query history to transform your list of 67 falciparum genes to a list of berghei and vivax genes (113 total). Click download and proceed as for #5.

Exercise 2 : Identify as many P. falciparum genes containing signal peptides as possible.

  • How many genes in falciparum are annotated with signal peptides (inclusion score 3)?
    • Signal peptide query for organism falciparum. (866)
  • How many P vivax genes are annotated with signal peptides (inclusion score 3)?
    • Revise above, organism vivax. (677)
  • How many genes on these two lists are in common (hint, use the ortholog query to transform between organisms)?
    • Transform the results from the vivax query to falciparum (and vice versa) and intersect the results of the ortholog transform with the direct query. 408 falciparum genes and 397 vivax genes.
  • How many falciparum orthologs of vivax genes with signal peptides do not themselves contain signal peptides? Why might this be the case? Look at a couple of these using the synteny viewer to generate some hypotheses.
    • Subtract the falciparum signal peptide list from the falciparum list generated from the ortholog transform from vivax. Results in 187 genes. This could be due to annotation errors (model too short or missing an exon in falciparum or model too long in vivax or divergence of the two proteins (less likely)).
  • Generate the most comprehensive list of falciparum genes using PlasmoDB that may contain signal peptides (inclusion score 3). How many did you find?
    • Query for signal pepides in all plasmodium species except falciparum. Then transform this result to falciparum using orthology and union with the signal peptide query in falciparum. (1309)

Exercise 3: Apicoplast-targeted genes in T. gondii.

  • Identify putative nuclear genes in T. gondii that are targeted to the apicoplast.
    • In Eupath query for falciparum apicoplast targeted genes then transform to Toxoplasma using ortholog query. (213)
  • How many of these have signal peptides?
    • Signal peptide query of Toxo and intersect with above result. (41)
  • Is the percentage in #2 above higher than the percentage of all Toxoplasma genes with signal peptides?
    • Query for all protein coding genes and use results to generate the percentages. Yes, nearly 20 % vs. 12%
  • How does the percentage compare to the percentage of falciparum apicoplast targeted genes containing signal peptides?
    • Run falciparum protein coding genes and falciparum signal peptides and use results to generate percentages. Much higher -- 77% vs 20 %.
  • Do you think these results indicate a valid approach to identifying putative Toxoplasma apicoplast targeted genes?
    • Seems like an enrichment of genes containing a signal peptide but not nearly as high as the falciparum genes. Likely some real things but could be lots of noise.

Exercise 4: Orthology profile

  • Identify P. falciparum genes that are conserved among all apicomplexans but not present in mammals. How many are there?
    • Use the Orthology Profile query, click on apicomplexans so they have green checks and mammals so they have red Xs. (212)
  • How does this compare to the number of falciparum genes conserved among all Plasmodium species but not present in mammals?
    • Revise ortholgy profile query above. 2064 so all apicomplexans is much more specific.
  • Now extend #2 to also include conservation with T. gondii but not mammals.
    • Revise ortholgy profile query above. 863 results
  • Why might these sorts of queries be very useful when analyzing a eukaryotic pathogen?
    • Provide the capacity to identify genes that may make good drug or vaccine targets since they are conserved among a variety of pathogens but not present in mammals (and thus likely to be specific).