ApiDB/EuPathDB Workshop

Sequence Data Exercises

Steve Sullivan and David Roos
Monday, June 9th - 1:30 pm

Exercise 1: Genomic Sequences

  1. Go to Trichdb.org.
  2. How many T. vaginalis genomic sequences are there in TrichDB?
  3. What are their minimum and maximum lengths?
  4. What is the length of the largest contig (assume a contig = genomic sequence with no 'Ns')?
  5. What might the smallest (<500bp) contigs in TrichDB represent (hint: remember how shotgun library inserts are usually sequenced)?

Exercise 2: GBrowse

  1. Go to the gene page for T. gondii dihydrofolate reductase-thymidylate synthase (50.m00016), e.g. by using a text or ID search.
    • What genes lie immediately upstream and downstream of DHFR-TS?
    • What kinds of evidence support the expression of these three genes? (Click on genes to open relevant gene pages)
    • What do you notice about the SNPs (single nucleotide polymorphisms) associated with these genes ... and what does this suggest about parasite population biology?
  2. Now open to the Genome browser and display the same genes (DHFR-TS and flanking) (Click 'View in Genome Browser' and zoom out to encompass these genes.)
    • Expand the browser window to identify the next genes up- and down-stream.
    • How many genes are annotated within 30 kb up- and down-stream of DHFR-TS?
    • What additional expression information is available to extend your previous analysis of DHFR-TS and its immediate neighbors?
    • How far do you have to look to find an abundance of SNPs of a different type, indicating different history in the three archetypal T. gondii strains? (note that GBrowse displays can be slow to load, as vast amounts of information are associated with large spans; turn off any unnecessary tracks, and display no more than 1 Mb)
  3. Now open the Toxoplasma 'Ancillary GBrowse' site and display the same region (Cut from the current window, e.g. XII:4273973..4373972, and paste into the ancillary site) .
    • What additional information is available that might facilitate your assessment of expression?
    • Moving track position may make cross-comparisons easier. Using the 'Configure Tracks' option, move SAGE tags do display just below Annotated Genes. Which of the SAGE tags in this region are mostly likely to be believable? (Note that mouse-over provides abundance information.)
    • The ancillary GBrowse site includes many other data types, such as informati on on cosmid and BAC clones. How many cosmids completely span the DHFR-TS gene?
    • At present, ChIP-chip data on chromatin marks is available only for a small segment of Toxoplasma chromosome 1b ... but more will be coming soon (and for other organisms as well). Jump to Ib:1500000..1600000 and display all chromatin mark information. How well do chromatin marks correspond to gene location and expression?
  4. Locate the P. falciparum DHFR-TS gene in the genome browser, and configure to display SNPs. (Note that mousing-over SNPs displays strain and coding information.)
    • What is the relative density of coding and non-coding SNPs for DHFR-TS and the adjacent genes? Do you find these numbers surprising?
    • As we gain more information about multiple species, comparative genomics is increasingly interesting. Genes that are located in the same position in different species are said to be "syntenic". Turn on all tracks for syntenic contigs (DNA sequences) and syntenic genes. Is the region surrounding DHFR-TS syntenic across available Plasmodium species? (Note: it may be necessary to reconfigure tracks so as to intersperse contigs and genes, grouped by species. Please ask for help if you are having trouble with this.)
    • How similar are gene structures for DHFR-TS in these various species?
    • Are contiguous DNA sequences properly assembled across this region in all available Plasmodium species?
    • Turn off all synteny information except for P. vivax, and zoom out to examine a 500 kb span. How syntenic are P. falciparum and P. vivax, and what can you say about non-syntenic regions? (Note that mousing over genes provides their identifiers.)
    • Chromosome centromeres are often unusually rich in A and T nucleotides. Turning on the "GC content" track to display nucleotide composition, can you identify a candidate centromere? (You may want to zoom in, or manually reconfigure the search span, to review GC content at higher resolution.)
    • Clicking on the syntenic P. vivax contig, are (putative) centromeres syntenic between P. falciparum and P. vivax?