ApiDB/EuPathDB Workshop

Solutions to Sequence Data Exercises

Steve Sullivan, David Roos
Monday, June 9th - 1:30 pm

Exercise 1: Genomic Sequences

  1. Go to Trichdb.org.
    • Tool: Genomic Sequences by Species or by Sequence ID
      Query: default if by Species, or DS* if by Seq ID
  2. How many T. vaginalis genomic sequences are there in TrichDB?
    • 64,764 sequence IDs are returned.
  3. What are their minimum and maximum lengths?
    • To get their range of lengths, add a "length" column to the results, from the pulldown menu, and use the arrows to sort the column in ascending and descending order (or, since the results are automatically sorted high/low when you add the column begin with, just note the first value on the first page, and then go to the last page and note the last value). Minimum is 202bp, max is 495,035 bp.
  4. What is the length of the largest contig (assume a contig = genomic sequence with no 'Ns')?
    • to determine the largest contig, add the 'Other Counts' column. Then sort in ascending "Other Count" order and note the corresponding length. Answer is 135,235 bp.
  5. What might the smallest (<500bp) contigs in TrichDB represent (hint: remember how shotgun library inserts are usually sequenced)?
    • single reads (i.e., a read of one end of an 2kb insert). For some reason the read of the other end failed.

Exercise 2: GBrowse

  1. Go to the gene page for T. gondii dihydrofolate reductase-thymidylate synthase (50.m00016), e.g. by using a text or ID search.
    • What genes lie immediately upstream and downstream of DHFR-TS?
      • 50.m03282 = hypothetical protein / 50.m03280 = ras family protein
    • What kinds of evidence support the expression of these three genes? (Click on genes to open relevant gene pages)
      • 50.m00016 is definitely expressed: transcribed (~80%ile on microarrays for three strains), translated (many hits in proteomics datasets)
      • 50.m03282 is probably expressed: abundant transcripts (~75%ile), no proteomics hits
      • 50.m03280 may be expressed: transcripts not abundant (~35%ile), some proteomics hits
    • What do you notice about the SNPs (single nucleotide polymorphisms) associated with these genes ... and what does this suggest about parasite population biology?
      • All are 'type I' SNPs (red), i.e. GT1 is the outlier; suggests that sex is infrequent and/or haplotype blocks are large
  2. Now open to the Genome browser and display the same genes (DHFR-TS and flanking) (Click 'View in Genome Browser' and zoom out to encompass these genes.)
    • Expand the browser window to identify the next genes up- and down-stream.
      • 50.m03283 = copper transporter / 50.m03279 = PAN-domain protein
    • How many genes are annotated within 30 kb up- and down-stream of DHFR-TS?
      • 5 upstream, 2 downstream
    • What additional expression information is available to extend your previous analysis of DHFR-TS and its immediate neighbors?
      • Many ESTs for DHFR-TS (14 in 7 DoTs clusters, 2 ESTs for 282, none for 280 SAGE tags data available … but very noisy & hard to interpret; tags for 282 may be real
    • How far do you have to look to find an abundance of SNPs of a different type, indicating different history in the three archetypal T. gondii strains? (note that GBrowse displays can be slow to load, as vast amounts of information are associated with large spans; turn off any unnecessary tracks, and display no more than 1 Mb)
      • Move right to ~6.015 Mb, or left to ~1.58 Mb
  3. Now open the Toxoplasma 'Ancillary GBrowse' site and display the same region (Cut from the current window, e.g. XII:4273973..4373972, and paste into the ancillary site) .
    • What additional information is available that might facilitate your assessment of expression?
      • SAGE tag data ... as in the main site, but more readily interpretable due to abundance and library-specific information in mouse-over. Transcriptional activity summary (includes information on stage-specificity: blue=bradyzoite, green=sporozoite, red=tachyzoite, black=no stage specificity) Affymetrix array data (Prugniaud 3' kit) Summary: DHFR-TS is expressed; others questionable
    • Moving track position may make cross-comparisons easier. Using the 'Configure Tracks' option, move SAGE tags do display just below Annotated Genes. Which of the SAGE tags in this region are mostly likely to be believable? (Note that mouse-over provides abundance information.)
    • The ancillary GBrowse site includes many other data types, such as informati on on cosmid and BAC clones. How many cosmids completely span the DHFR-TS gene?
      • 3 cosmids: TOXOL85, TOXPG94, and TOXP065, no BACs
    • At present, ChIP-chip data on chromatin marks is available only for a small segment of Toxoplasma chromosome 1b ... but more will be coming soon (and for other organisms as well). Jump to Ib:1500000..1600000 and display all chromatin mark information. How well do chromatin marks correspond to gene location and expression?
      • Reasonably well, but not perfectly. For example, the two closely-spaced marks near position 1500k appear to correspond well with genes 25.m02895 and 25.m01896. The relative position of H3K4me3 and H4ac marks even appears to correlate with transcriptional orientation (red genes are transcribed from right to left, and blue from left to right). The same is true for marks near 1573k. Other marks are also promising, e.g. marks near 1565 appear to correspond to probes that are positive in Affymetrix microarray studies, even though no genes are annotated here. The lack of marks associated with 25m.02953 makes sense, since this is a sporozoite-specific gene (green). Other marks (e.g. those near 1523k and 1538k) do not appear to correlate with genes, however.
  4. Locate the P. falciparum DHFR-TS gene in the genome browser, and configure to display SNPs. (Note that mousing-over SNPs displays strain and coding information.)
    • What is the relative density of coding and non-coding SNPs for DHFR-TS and the adjacent genes? Do you find these numbers surprising?
      • All of these genes show a surprising number of nonsynonymous polymorphisms ... as do all P. falciparum genes, as a consequence of high AT content. DHFR-TS shows an even higher ratio of NS/S than the adjacent genes, however, suggesting positive selection, which might initially seem surprising for a highly-conserved, essential housekeeping gene. Mousing over shows that nonsynonymous coding sequence polymorphisms in DHFR-TS correspond to known drug-resistance mutations, including the highly unusual tri-allelic polymorphism at amino acid position 108.
    • As we gain more information about multiple species, comparative genomics is increasingly interesting. Genes that are located in the same position in different species are said to be "syntenic". Turn on all tracks for syntenic contigs (DNA sequences) and syntenic genes. Is the region surrounding DHFR-TS syntenic across available Plasmodium species? (Note: it may be necessary to reconfigure tracks so as to intersperse contigs and genes, grouped by species. Please ask for help if you are having trouble with this.)
    • How similar are gene structures for DHFR-TS in these various species?
    • Are contiguous DNA sequences properly assembled across this region in all available Plasmodium species?
    • Turn off all synteny information except for P. vivax, and zoom out to examine a 500 kb span. How syntenic are P. falciparum and P. vivax, and what can you say about non-syntenic regions? (Note that mousing over genes provides their identifiers.)
    • Chromosome centromeres are often unusually rich in A and T nucleotides. Turning on the "GC content" track to display nucleotide composition, can you identify a candidate centromere? (You may want to zoom in, or manually reconfigure the search span, to review GC content at higher resolution.)
    • Clicking on the syntenic P. vivax contig, are (putative) centromeres syntenic between P. falciparum and P. vivax?