Sequencing the C. sativa PK genome and transcriptome
We obtained DNA and RNA samples from plants of PK, a clonally propagated marijuana
strain that may have been bred in California and is reportedly derived from an "indica"
genetic background
[
24]. Genomic DNA was isolated from PK leaves and used to create six 2 ×100-bp Illumina
paired-end libraries with median insert sizes of approximately 200, 300, 350, 580
and 660 bp. Sequencing each of these libraries produced > 92 gigabase (Gb) of data
after filtering of low-quality reads (see below), which is equivalent to approximately
110× coverage of the estimated ~820 Mb genome. To improve repeat resolution and scaffolding,
we supplemented these data with four 2 × 44-bp Illumina mate-pair libraries with a
median insert size of approximately 1.8 kb and two 2 × 44-bp libraries with a median
insert size of approximately 4.6 kb, adding 16.3 Gb of sequencing data in 185 million
unique mated reads. We also included eleven 454 mate-pair libraries with insert sizes
ranging from 8 to 40 kb, obtaining > 1.9 Gb of raw sequence data (~2.3 × coverage
of 820 Mb) and 2 M unique mated reads.
To characterize the cannabis transcriptome, we sequenced polyA+ RNA from a panel of
six PK tissues (roots, stems, vegetative shoots, pre-flowers (i.e. primordia) and
flowers (in early- and mid-stages of development)) obtaining > 18.8 Gb of sequence.
To increase coverage of rare transcripts, we also sequenced a normalized cDNA library
made from a mixture of the six RNA samples, obtaining an additional 33.9 Gb. The sequencing
data obtained for the genomic and RNA-Seq libraries are summarized in Table
1.
Table 1. Purple Kush sequencing library statistics
Assembling the C. sativa PK genome and transcriptome
We used different approaches for the
de novo assembly of the PK genome (SOAPdenovo
[
25]) and transcriptome (ABySS
[
26] and Inchworm
[
27]). To gauge the success of the outputs, and to refine the assemblies, we used both
traditional measures (coverage, bases in assembly, N50, maximum contig size and contig
count) as well as comparisons between the assembled versions of the genome and transcriptome.
For the transcriptome, we used two different assemblers, ABySS and Inchworm, to obtain
the best possible coverage. Both assemblers were run on the individual tissue datasets
and normalized cDNA libraries, as well as the full set of RNA-Seq data (summarized
in Table
2). We used predicted splice junctions and the presence of apparent coding regions
to orient the assembled transcripts and to perform quality control (QC). In general,
Inchworm produced assemblies with a larger N50 than ABySS (Table
2); however, we also observed many cases in which adjacent transcripts (e.g. head-to-head
transcripts that overlap in their termini) appeared to be merged. Therefore, we considered
only Inchworm transcripts with a single blastx hit covering at least 70% of their
length when merging assemblies. The filtered individual ABySS and Inchworm assemblies
were combined by first selecting the largest transcript among sets of near-identical
sequences from each assembly, followed by a second stage where transcripts with blunt
overlaps were joined. This second step resulted in a significant improvement of transcript
N50 from 1.65 to 1.80 kb (Table
2).
Table 2. Overview of transcriptome assembly stages
The final merged assembly contains 40,224 transcripts falling into 30,074 clusters
of isoforms (Table
3). We selected the transcript with the largest open reading frame (ORF) as the representative
for each cluster, resulting in a pruned assembly with an N50 of 1.91 kb. Most representative
transcripts (83%) have a blastx hit in other plants, and the distribution of transcript
classes, according to Panther
[
28], is nearly identical between PK and
Arabidopsis (Figure
1), as is the total number of transcripts and the N50 (33,602 and 1.93 kb in
Arabidopsis, respectively
[
29]). The total number of bases in representative
Arabidopsis transcripts is, however, somewhat larger (50 Mb,
[
29]) which may indicate that some of the PK transcripts are partial or that genes are
represented by more than one non-contiguous fragments. We noted a 3' end bias in the
normalized cDNA library, presumably due to the polyA priming step (data not shown).
Moreover, by combining near-identical transcripts during assembly merging and isoform
clustering, we likely collapsed transcripts of large multi-copy gene families. Indeed,
applying our isoform clustering algorithm to the
Arabidopsis assembly reduces the total number of bases to 44 Mb, which is mostly due to the loss
of transposable element genes. Overall, our assembled PK transcriptome is therefore
very similar to the deeply characterized
Arabidopsis transcriptome, both in size and composition.
Table 3. Genome and transcriptome assembly statistics
Figure 1. Transcript classes in Cannabis sativa and Arabidopsis thaliana. Panther
[
28] was used to determine the distribution of transcripts in
(a) C. sativa (PK) (30,074 representative transcripts) and
(b) A. thaliana (31,684 transcripts). The high degree of similarity between both species indicates
that all major functional classes are proportionally represented in the PK transcriptome
assembly.
Our genome assembly procedure first involved a series of filtering steps to remove
low-quality reads, bacterial sequences (about 2% of all reads) and sequencing adapters.
Mate-pair libraries (454 and Illumina) were further processed to remove duplicate
pairs and unmated reads. We then assembled a small fraction of the Illumina data (1%)
together with the 454 data, to reconstruct the mitochondrial (approximately 450 kb)
and plastid (approximately 150 kb) genomes, and subsequently removed their highly
abundant DNA sequences. The remaining reads were assembled with SOAPdenovo, resulting
in a draft assembly that spans > 786 Mb of the cannabis genome and includes 534 million
bp (Table
3). The Illumina mate-pair libraries had a significant impact on the assembly, increasing
the N50 from 2 kb to 12 kb. Addition of the large-insert 454 data increased this to
16 kb (24.9 kb for scaffolds containing genes). Between 73% and 87% of the reads in
each library could be mapped back to the draft genome (Table
1), indicating that our assembly accounts for most of the bases sequenced. As an additional
measure of completeness, we also examined the proportion of the transcriptome represented
in the genome assembly. Over 94% of assembled transcripts map to the draft genome
over at least half of their length, and 83.9% of them are fully represented; that
is, all bases of the transcript can be mapped to genomic contigs. Overall, 37.6 Mb
(92.5%) of the complete transcriptome is accounted for in the genome assembly (Figure
2), and over 68.9% of transcripts are fully encompassed by a single scaffold. Thus,
our draft genome assembly appears to represent a large majority of the genic, non-repetitive
C. sativa genome.
Figure 2. Proportion of transcriptome mapping to genome assembly. (
a) A histogram showing the number of bases in the transcript assembly that could be
mapped to the genome at 98% sequence identity, as a function of transcript length
in 300 nt bins. (
b) The proportion of transcriptome bases that could be mapped to the genome for the
same bins as in (a). The black dashed line indicates the proportion of the transcriptome
that is accounted for in the genome assembly.
The assembled
C. sativa PK genome and transcriptome (canSat3) can be downloaded and browsed at a dedicated
website
[
30]. The Cannabis Genome Browser combines the genome assembly with the transcriptome
annotations, and has tracks for RNA-Seq data, single nucleotide variants (SNVs) and
repeats.
Expression of the cannabinoid pathway in C. sativa PK tissues
Our first step in the functional analysis of the
C. sativa genome was to examine the relative expression of each of the 30,074 representative
transcripts in the six PK tissues, from which the RNA-Seq data were derived (Figure
3a). In metazoans (e.g. humans), different organs and tissues have different physiological
functions, and consequently have unique gene expression profiles
[
31]. Therefore, many genes have a highly restricted expression pattern. By contrast,
in plants, different photosynthetic tissues are often composed of a similar set of
cell types. Moreover, photosynthetic processes and primary metabolic pathways have
widespread expression, and only a minor proportion of transcripts appear to be uniquely
expressed in a given cell type
[
32]. Consistent with these observations, we found all of the cannabis photosynthetic
tissues to have similar expression profiles (Figure
3a).
Figure 3. Analysis of gene expression in PK tissues. (
a) RNA-Seq read counts for 30,074 representative transcripts (rows), expressed as log2
RPKM, were subjected to hierarchical agglomerative clustering based on their expression
pattern across tissues (columns). (
b) Schematic illustration of THCA and CBDA cannabinoid biosynthesis, including the
production of fatty acid and isoprenoid precursors via the hexanoate, MEP and GPP
pathways. Hexanoate could arise through fatty acid degradation, involving desaturase,
lipoxygenase (LOX) and hydroperoxide lyase (HPL) steps. Activation of hexanoate by
an acyl-activating enzyme (AAE) yields hexanoyl-CoA, which is the substrate for the
polyketide synthase enzyme (OLS) that forms olivetolic acid. The prenyl side-chain
originates in the MEP pathway, which provides substrates for GPP synthesis, and is
added by an aromatic prenyltransferase (PT)
[
36]. The final steps are catalyzed by the oxidocyclases THCAS and CBDAS. Pathway enzymes
and metabolic intermediates are indicated in black and blue, respectively. (
c) Same data as (a), showing the expression levels for genes in the cannabinoid pathway
and precursor pathways (rows) across the six assayed tissues (columns). The majority
of the genes encoding cannabinoid and precursor pathway enzymes are most highly expressed
in the flowering stages. Gene and pathway names correspond to those used in panel
B.
Nonetheless, flowers show a pattern of gene expression consistent with the biosynthesis
of cannabinoids and terpenoids in these organs. Cannabinoids are prenylated polyketides
that are synthesized from the short-chain fatty acid hexanoate and geranyl diphosphate
(GPP)
[
23,
33]. The latter precursor, which is the substrate for an aromatic prenyltransferase enzyme,
is derived from the 2-
C-methyl-D-erythritol 4-phosphate (MEP) pathway
[
34-
36] (see Figure
3b for details). We found that the genes encoding cannabinoid pathway enzymes and also
most of those encoding proteins (e.g. hexanoate, MEP and GPP) involved in putative
precursor pathways were most highly expressed in the three stages of flower development
(pre-flowers, and flowers in early and mid-stage of development) (Figure
3c). This finding is consistent with cannabinoids being synthesized in glandular trichomes,
the highest density of which is found on female flowers
[
37]. The production of THCA in marijuana strains (such as PK) and CBDA in hemp, is due
to the presence or absence of THCAS and CBDA synthase (CBDAS) in these two chemotypes.
Indeed, THCAS is highly expressed in PK flowers of all stages, whereas CBDAS is absent
(Figure
3c).
It is worth noting that of the 19 'pathway genes' we analyzed, 18 were complete in
the transcriptome assembly, underscoring its quality. The transcript of the
MDS gene (which encodes a protein involved in the MEP pathway) was assembled in two fragments
with a blunt overlap of 48 nt, narrowly missing the merging threshold of 50 nt. This
sequence was resolved by merging the fragments manually. All 'pathway genes' are fully
represented in the draft genome and an overview of their genomic locations is provided
on the Cannabis Genome Browser website
[
30].
Comparison of the expression of cannabinoid pathway genes between marijuana (PK) and
hemp ('Finola')
Although there are differences in the morphology of marijuana and hemp strains, the
THC content of PK and other strains selected and bred for use as marijuana is remarkably
high. We investigated whether the high THC production in PK was associated with increased
gene expression levels of cannabinoid pathway enzymes, relative to those in hemp.
We performed RNA-Seq analysis on Finola flowers at the mid-stage of development, generating
a total of 18.2 M reads. 'Finola' is a short, dioecious, autoflowering cultivar developed
in Finland for oil seed production. It was created by crossing early maturing hemp
varieties from the Vavilov Research Institute (St. Petersburg, Russia), 'Finola' might
be derived from a "
ruderalis" genetic background
[
38]. It contains moderate amounts of CBDA in female flowers but very low amounts (< 0.3%
by dry weight) of THCA. Figure
4a shows that the overall mid-flower transcript profiles, expressed as RPKM (reads per
kb per million reads), are similar between PK and 'Finola'; however, the entire cannabinoid
pathway is expressed at higher levels in PK than in 'Finola', with later steps increased
as much as 15-fold (Figure
4a).
Figure 4. Comparison of gene expression in female cannabis flowers, and gene copy number, between
marijuana (PK) and hemp ('Finola'). (
a) A scatter plot of RNA-Seq read counts for all representative transcripts in marijuana
and hemp, expressed as log2 RPKM. Specific subsets of transcripts are shown in color,
as indicated in the key. The dashed line represents the relative enrichment of trichomes
in the marijuana strain, inferred from the ratio in expression of trichome-specific
genes, as defined in the text. Gene symbols/abbreviations: CAN - known and putative
cannabinoid pathway genes; HEX - putative hexanoate pathway genes; GPP - GPP pathway
genes; MEP - MEP pathway genes; TF - putative transcription factors according to PFAM,
with at least a 4-fold change in expression in PK relative to 'Finola'; MYB - Myb-domain
transcription factors previously suggested as trichome regulators. (
b) A scatter plot of the log2 median read depth (MRD) of genomic DNA-Seq reads that
aligned uniquely to the PK transcriptome. Genomic reads were trimmed to a length of
32 bases prior to alignment with Bowtie, to allow for mapping close to exon junctions.
The lack of outliers in the scatter plot indicates that there have been relatively
few changes in gene copy number between marijuana and hemp. (
c) The relative RNA-Seq expression of individual genes in the cannabinoid pathway and
precursor pathways (is shown on the left), adjusted for enrichment of trichome-specific
genes (i.e. relative to the dashed line in panel a). The median genomic DNA read depth
for the same genes is shown on the right. The box plots reflect the variation in the
depth of coverage of uniquely aligned genomic DNA reads across each transcript, with
the median coverage distribution across all transcripts shown as reference (All).
Reads that are likely derived from pseudogenes are marked by the symbol [P]. While
there is increased expression of most cannabinoid genes in the HEX and CAN pathways
(left) in PK, this does not appear to be due to an increased representation of these
genes in the PK genome relative to the 'Finola' genome (right).
The difference in gene expression is not due to morphological variability between
the strains, such as in the size or proportion of trichomes. We examined the global
expression levels of trichome genes to account for possible differences in trichome
density between PK and 'Finola' flowers, by analyzing an RNA-Seq dataset obtained
for 'Finola' glandular trichomes (from a separate study, data not shown). From a set
of the1000 most abundant transcripts, we selected 100 that had the greatest difference
in expression between the mid-flower stage and the maximum expression level found
in PK root, shoot or stem in the current study. This subset of genes should be highly
enriched for genes predominantly expressed in glandular trichomes (and indeed contained
all the cannabinoid and hexanoate 'pathway genes' expressed in 'Finola'). The median
difference in expression level after excluding the cannabinoid genes is shown as a
dotted line in Figure
4a, and was used to adjust the expression differences shown in Figure
4c. Even after accounting for global trichome differences, cannabinoid pathway enzymes
remain among several hundred obvious outliers. Outliers also include several dozen
transcription factors, including two Myb-domain proteins that have been previously
suggested to play a role in regulating processes in cannabis trichomes
[
23] (Figure
4a). These data suggest that the increased production of cannabinoids in PK may be due
in part to an increase in expression of the biosynthetic genes.
Resequencing of the C. sativa 'Finola' genome reveals copy number changes in a PK
cannabinoid pathway related enzyme
To begin our search for the underlying causes of the differences between marijuana
and hemp, we sequenced the genome of 'Finola' (e.g. Illumina 100 nt paired-end reads,
200-500 bp inserts, at approximately 50× coverage of the estimated 820 Mb genome).
Plant genomes often contain many duplicated genes, and gene amplification represents
a well-documented mechanism for increasing expression levels
[
39]. Therefore, we first asked whether there were apparent differences in copy number
for the enzyme-encoding gene set, using the median read depth (MRD) of genomic DNA-Seq
reads that could be uniquely mapped to transcripts as a proxy. Figure
4b illustrates that, overall, there appear to be relatively few differences in gene
MRD between PK and 'Finola'. The exception to this is the much expanded coverage for
AAE3, a gene encoding an enzyme of unknown function in PK.
AAE3 is similar to an
Arabidopsis AAE [TAIR:At4g05160] that has been shown to activate medium- and long-chain fatty acids
including hexanoate
[
40]. Although
AAE1 is a more likely candidate for the hexanoyl-CoA synthetase involved in cannabinoid
biosynthesis (JMS and JEP, unpublished results), owing to its high expression in flower
tissues and increased transcript abundance in PK (Figure
3b,
4),
AAE3 might play an, as yet, unknown role in cannabinoid biosynthesis. Because we could
detect both multi- and single-exon copies of AAE3, we believe that the large expansion
of
AAE3 has occurred through the insertion of processed pseudogenes in the PK genome. In addition,
the read depth analysis uncovered reads corresponding to
CBDAS in PK and
THCAS in 'Finola'. However, on the basis of our inability to assemble these into functional
protein-coding genes, we conclude that the
THCAS reads in 'Finola' and
CBDAS reads in PK are likely to be caused by the presence of pseudogenic copies, as we discuss
below. Therefore, it appears that the differences in expression of cannabinoid pathway
enzymes between marijuana and hemp are due to subtle genetic differences that cause
changes in gene expression, either directly or indirectly.
The PK genome contains two copies of two genes involved in cannabinoid biosynthesis.
Copies of
AAE1, which encodes a protein likely to synthesize the hexanoyl-CoA precursor for cannabinoid
biosynthesis, are found on scaffold1750 [genbank:
JH227821] and scaffold29030 [genbank:
JH245535].
OLS, which encodes the putative cannabinoid pathway polyketide synthase
[
18], was found to be duplicated at scaffold15717 [genbank:
JH226441] and scaffold16618 [genbank:
JH237993].
Analysis of single nucleotide variants (SNVs) in cannabis
To further examine the genetic variation in
C. sativa, we estimated the frequency of SNVs in four taxa. In addition to PK and 'Finola',
our analysis included the Illumina reads we generated from the hemp cultivar 'USO-31',
as well as the reads from the marijuana strain Chemdawg, which were released by Medical
Genomics, LLC
[
41] while this manuscript was in preparation. 'USO-31' is a tall, monoecious fibre hemp
cultivar developed in the former Soviet Union that contains very low or undetectable
levels of cannabinoids
[
42]. Our resequencing of 'USO-31' was similar to that of 'Finola' (Illumina 100 nt paired-end
reads, 200 and 500 bp inserts, at approximately 16× coverage of the estimated 820
Mb genome). We aligned individual Illumina reads to the PK reference genome, and identified
variant bases that qualify as SNVs (see the Methods section for further details).
We also quantified the degree of heterozygosity within single genomes. Overall, PK,
Chemdawg, 'Finola' and 'USO-31' have comparable rates of heterozygosity (0.20%, 0.26%,
0.25%, and 0.18%, respectively). The lower rate of heterozygosity in 'USO-31', which
is monoecious, is presumably due to self-pollination.
The rate of occurrence of SNVs between any two strains ranged from 0.38% (PK versus
Chemdawg) to 0.64% (Chemdawg versus 'Finola'). A neighbor-joining tree constructed
using the concatenated polymorphic sequences from each of the strains is shown in
Figure
5, and supports the expected pedigree of the two hemp cultivars, which are likely to
have been bred using germplasm from North Central Asia. Although the ancestry of PK
and Chemdawg is unclear, their position on the tree supports the notion that marijuana
forms of cannabis are more related to each other than to the hemp forms, and that
these two marijuana strains share a common heritage.
Figure 5. Neighbour-joining tree for two hemp cultivars and two marijuana strains. The tree was plotted in MEGA5
[
71] using the maximum composite likelihood of SNV nucleotide substitution rates, calculated
based on the concatenated SNV sequences in each variety, as a distance metric. The
topology of the tree reveals a distinct separation between the hemp and marijuana
strains.
Genomic analysis of cannabinoid chemotypes
The molecular basis for THCA (marijuana) and CBDA (hemp) chemotypes is unclear. De
Meijer
et al [
43] crossed CBDA- and THCA-dominant plants to produce F1 progeny that are intermediate
in their ratio of THCA:CBDA; selfing gave F2 progeny that segregated 1:2:1 for THCA-dominant:codominant
mixed THCA/CBDA:CBDA-dominant chemotypes. These data suggested two explanations: a
single cannabinoid synthase locus (
B) exists with different alleles of this gene encoding THCAS or CBDAS; or THCAS and
CBDAS are encoded by two tightly linked yet genetically separate loci. In the latter
scenario, differences in transcript abundance and/or enzyme efficiencies might account
for the observed chemotypic ratios. Indeed, given that both of these enzymes compete
for CBGA, reductions in one activity might lead to a proportional increase in the
production of the other cannabinoid. Our draft sequence of the THCA-dominant PK genome
enables some preliminary insights into possible mechanisms of the inheritance of cannabinoid
profiles. Using the published
THCAS sequence [genbank:
AB057805]
[
16] to query the PK genome, a single scaffold of 12.6 kb (scaffold19603, [genbank:
JH239911]) was identified that contained the
THCAS gene as a single 1638 bp exon with 99% nucleotide identity to the published
THCAS sequence. Querying the PK transcriptome returned the same
THCAS transcript (PK29242.1, [genbank:
JP450547]) that was found to be expressed at high abundance in female flowers (Figure
3c). A
THCAS-like pseudogene (scaffold1330 [genbank:
JH227480], 91% nucleotide identity to
THCAS) was also identified. We used the
CBDAS sequence [genbank:
AB292682]
[
17] to query the PK genome and identified as many as three scaffolds that contain
CBDAS pseudogenes (scaffold39155 [genbank:
AGQN01159678], 95% nucleotide identity to
CBDAS; scaffold6274 [genbank:
JH231038] + scaffold74778 [genbank:
JH266266] combined, 94% identity; and scaffold99205 [genbank:
AGQN01254730], 94% identity), all of which contained premature stop codons and frameshift mutations.
The presence of these pseudogenes in the PK genome accounts for the identification
of
CBDAS genomic sequences in PK (Figure
4c). A 347-bp transcript fragment (PK08888.1, [genbank:
JP462955]) with 100% nucleotide identity to
CBDAS could be identified in the PK transcriptome, likely due to the nonsense-mediated decay
of transcripts derived from
CBDAS pseudogenes. Given that multiple independent loci were identified with high sequence
similarity to either
THCAS or
CBDAS in a THCA-dominant marijuana strain, the two-loci model for the genetic control of
THCA:CBDA ratios might be correct. A possible explanation is that during the development
of high-THC marijuana strains such as PK, underground breeders selected for non-functional
CBDAS that would effectively eliminate substrate competition for CBGA and thus increase
THCA production. Alternatively, the
CBDAS pseudogene in the PK genome might occur in all cannabis strains. If this is true,
the single-locus model might still be correct, and we did not find a CBDAS-encoding
allele at this locus because PK is homozygous for
THCAS.
Analysis of PK transcriptome for cannabichromenic acid synthase (CBCAS) candidates
To illustrate the potential value of the cannabis genome and transcriptome to elucidate
cannabinoid biosynthesis, we searched for genes encoding enzymes that might catalyze
the formation of cannabichromenic acid (CBCA), which is present in most cannabis plants
as a minor constituent and in certain strains as the dominant cannabinoid
[
44]. Although a protein that synthesizes CBCA has been purified from cannabis, the gene
that encodes the CBCA synthase (CBCAS) has not been identified
[
45]. We hypothesized that CBCAS is an oxidocyclase enzyme related to THCAS and CBDAS,
therefore, we queried the PK transcriptome using
THCAS and
CBDAS sequences. In total, 23 candidates were identified that had greater than 65% nucleotide
identity with these sequences. These include four genes that we designated
THCAS-like1 to
THCAS-like4, which encode proteins that are 89%, 64%, 68%, and 59% identical to THCAS at the
amino acid level, respectively. We also identified transcripts corresponding to
CBDAS2 and
CBDAS3, which are closely related to
CBDAS but do not encode enzymes with CBDAS activity
[
17]. The remaining 18 transcripts encode proteins that are similar to reticuline oxidase,
an oxidoreductase that functions in alkaloid biosynthesis
[
46]. Biochemical analysis of CBCAS candidates is currently underway.
No comments:
Post a Comment