ApiDB/EuPathDB Workshop

Solutions to Population data Exercises

Brian Brunk
Tuesday, June 10th - 10:45 am

Exercise 1: Isolates


  1. How many isolate records are there?
    • 756
  2. How did you figure it out?
    • Queried by isolates by product name selecting all product names.
  3. What, in your opinion, is the most interesting isolation source?
    • Personal Opinion, no incorrect answer.

Exercise 2 : SNPs


  1. Find all SNPs in the falciparum DHFR gene that would make good genetic markers (i.e., have at least two strains represented for each allele).
  2. How would you find good genetic markers distributed say every megabase along the falciparum genome?
    • Do a keyword query to find the gene identifier for DHFR. Then go to the SNPs by Gene id query and enter the id for DHFR (PFD0830w). Keep 3D7 as the reference and select all except for P. reichenowi for comparators. This returns 7 SNPs in the DHFR gene. The columns include the strains that match the reference and strains that show a SNP. Select SNPs that have at least two strains of each category. To find good genetic markers distributed along the genome, you should use the SNPs by Allele Frequency query. Set the major and minor allele frequencies both >= 3 (let's be even more conservative since we have lots of SNPs). This returns 4,279 genes. These can then be sorted by position and SNPs chosen that are spaced along the genome. Might be simplest to download a table of these results as an excel file by clicking download and choosing the columns you want and then manipulating the results in excel.

Exercise 3 : Diversifying selection


  1. Find the falciparum genes that are the most diverse among sequenced strains and thus appear to be under diversifying selection.
  2. What strikes you about the known genes in this result set?
    • We address this question using the Genes by SNPs query. Make certain to click open the advanced parameters. We would expect genes under diversifying selection to have lots of SNPs and more specifically to have a high ratio of non-synonymous to synonymous SNPs. This query is based on specific strains so you may want to run in several times choosing different strains (use reference and comparators above the line as they have been sequenced more deeply). Choose 20 for the minumum number of SNPs and 3 for the minimum ratio of non-synonymous to synonymous SNPs.

Exercise 4 : Drug Target Identification


  1. Try to identify your own set of the ten most promising drug targets! (hint: There are many ways to do this depending on your own assumptions and experience but one way would be to find genes that appear likely to be enymes, are conserved among Plasmodium species but not in humans, are expressed in merozoites and are under purifying selection (i.e., are not changing rapidly)).
    • We've made movie tutorials for this exercise demonstrating one way in which this could be done. Note that there are many criteria that one might want to identify drug candidates so this only represents one way to get an answer. Drug Target PlasmoDB 5.3. and with the SNP query Drug Target With SNP Data PlasmoDB 5.3