Radical Changes in the Engineering of Synthetic Genes for Protein Expression

February 9, 2006
Joseph D. Kittle, Jr., Ph.D.

BioPharm International, BioPharm International-02-09-2006, Volume 2006 Supplement, Issue 1

Managing codon pair interactions and simultaneously optimizing the entire set of parameters requires advanced computationally intensive design tools.

New concepts in gene design emphasize the control of protein translation kinetics as a means to improve yield and alter protein solubility and activity. Many expression systems are now available for producing proteins in diverse organisms. Adapting genes for predictable performance in these systems requires more than just controlling transcription and translation initiation. Translation elongation pausing encoded by pairs of codons plays a key role in host-specific expression. The required modifications in the so–called silent nucleotides in the open reading frame can dictate wholesale redesign of the gene. This requirement for host-specific adaptation of the gene will drive drug development programs into the era of synthetic biology. The convergence of large-scale DNA synthesis capability, better technology for gene expression, and the massive computational capacity needed to design genes de novo make the practical methodology of gene design a new necessity.

Modern drug development programs depend heavily on targets identified first as gene sequences. Although information from various "omic" sciences provides a map to treasured target genes, the drug development process really depends on obtaining sufficient quantities of each protein that is encoded in these "druggable targets." A key to enabling expression is being able to control transcription and translation of recombinant genes.

Controlling Transcription and Translation Initiation of Recombinant Genes

As a way to better understand the problem of how to control protein expression, let us consider how recombinant proteins currently are produced. Early Escherichia coli expression systems emphasized overall yield, but high protein expression levels in E. coli often result in insoluble protein that forms inclusion bodies. Although these insoluble particles can be purified easily, and sometimes can be refolded to yield soluble protein, there is a lot of interest in controlling factors that determine protein folding as a means of obtaining soluble, active protein.

Refinements introduced in the 1990s and in the past five years have made production in E. coli more sophisticated.1,2 One technique that has been used to overcome difficult transcription problems is to put the gene under the control of a robust promoter from the RNA phage T7 system and introduce the phage polymerase to drive transcription. Another approach is to use highly repressible promoters (derived from the arabinose operon) to maintain control over genes that are problematic for maintaining vector plasmids in the cell. In addition to transcription initiation control, the end of the message (the 3' untranslated region) can be stabilized by sequences that promote RNAse III cleavage.

New Hosts to Consider

Perhaps the most dramatic change in protein expression has been the broadened choice of host organisms with expression systems that have well controlled recombinant protein expression systems.3-6 Recombinant proteins now are produced in a variety of bacteria, yeasts, fungi, insect cells, tissue culture, and even whole plants and animals.

More complex systems such as tissue culture or recombinant animals and plants are time-consuming and expensive. Thus, these systems are appropriate for producing proteins with mammalian-specific post-translational modifications or proteins that depend on ancillary proteins such as chaparonins to fold.

By contrast, producing protein in micro-organisms such as E. coli, yeast, or fungal hosts is cheaper and often faster. In addition, when these simpler systems can produce an appropriate active molecule, they can produce very large amounts of protein.

For any given choice of host, however, a choice of transcription promoters and vectors exists, and no host is universally successful in producing proteins of interest. When contemporary "omic" sciences generate a list of dozens or hundreds of target proteins that do not have a history of successful protein expression, the complexity of finding an expression system customized for each gene may be unmanageable.

Clearly, there is a collision between the opportunities and demands placed on protein expression programs by the "omic" sciences and the capabilities of current systems. It is difficult to justify either abandoning hard-won targets or performing a survey of various combinations of host and recombinant expression system available for every protein in a list.

So, what is lacking in the technology of heterologous gene expression that makes producing an arbitrary protein so unpredictable?

Transcription and Translation Elongation

Several lines of evidence suggest that uncontrolled factors in the process of transcription and translation elongation may have direct and indirect effects on protein activity and yield. First, fusion of open reading frames to any of several well expressed genes (e.g., glutathione reductase GST) does not always produce full length protein. Second, statistical analysis of codon usage indicates that organisms differ in codon abundance. Altering the gene to eliminate rarely used codons can alter the expression of the gene in a particular host.

The observation here is that highly used codons are predominant in abundantly expressed proteins for a particular host and that these are optimal for gene expression. When the goal is to direct host-cell resources to producing a recombinant protein, human codon usage, when translated in a heterologous host, may create a scarcity of the cognate tRNA iso-acceptors and virtual starvation of the ribosome. Apart from some improvements, it is important to note that the expression changes seen are a response to so-called silent mutations that do not change the protein composition itself. Unfortunately, codon optimization alone does not predictably dictate high protein production; sometimes the expression actually gets worse.

Recently, another approach has been taken in an attempt to account for the variability. This approach stems from the observation that pairs of codons appear to explicitly encode signals that control the rate at which nascent proteins are elongated as the gene is translated along its full length.8 If translation elongation rates can differ for a given amino acid sequence based on the underlying mRNA sequence as translated by a given host, this might account for a large degree of the unpredictability seen in protein expression.

Codon Pairs Can Encode Translation Pause Sites

One early suggestion of the ability of simple sequences to control translation kinetics is related to the effect of codon context on nonsense codon suppression in E. coli, with certain codon pairs having much higher or lower suppression frequencies. This observation coincides with the observation of highly improbable bias in the abundance of codon pairs encoded in an organism's transcriptome (the sum of the sections of DNA in an organism's genome that are transcribed).7 The observed frequency of some codon pairs is many standard deviations higher than the expected abundance, and this over-representation is independent of the abundance of each individual codon.7 This phenomenon is specific and directional; changing the order of the codons in a pair eliminates the effect. This statistical aberration cannot be accounted for by the abundance of the codons, the amino acid pair associations, dinucleotide abundance, or other factors. This statistical anomaly is present in all organisms tested, but the actual codon pairs in the over-represented group are different for each organism.9

Careful in vivo and in vitro translation experiments reveal a counter-intuitive result: Over-represented codon pairs in a gene's open reading frame have the effect of slowing translation, and the greater the degree of over-representation, the greater the pause.8 What is the biological relevance of this slowing? One analogy is that the pauses act like "punctuation marks" — i.e., like commas in written language. There are only a few hundred statistically over-represented codon pairs in a given transcriptome (out of 3,721 possible non-terminating pair combinations) and a lower number of highly under-represented codon pairs. Moreover, the codon pairs that are significantly over-represented vary widely by organism, so that pausing signals are different in different organisms.9

Protein translation follows a series of steps. Two tRNAs are bound to the ribosome when a growing peptide chain on one tRNA is transferred to the amino acid on the next coded tRNA.10 Mechanistically, the tRNAs that bind during the translation of a biased pair appear somehow incompatible (perhaps because of steric hindrance) with binding and transfer of the peptide bond occurring with unfavorable kinetics (Figure 1).9 The importance of codon pair-dictated kinetics has been seen in an isolated system, in which a single silent change in a codon caused a 30-fold change in an engineered immunoglobulin's expression11 and in model systems in bacteria.8

Figure 1. Codon pair bias mediated translational pausing. Incompatible tRNA isoacceptors of over-represented codon pairs affect translational step times at the levels of tRNA binding, trans-peptidization, and perhaps translocation.

How does the presence of pauses affect the practical task of expressing a protein? On the simplest level, the pauses are likely to down-regulate a highly translated (polysomal) mRNA, because the rate of translation initiation will soon saturate and the slowest translation step becomes rate-limiting. Secondly, at least in bacteria, a significant pause can result in premature transcription termination or messenger degradation. Even in eukaryotes, there is a coupling between the export of mRNA from the nucleus and translation, so a different, but still effective system of clearing untranslated mRNA, exists in eukaryotes.

Heterologous Gene Expression Creates Inappropriate Translation Pausing Signals and Inefficient Codon Usage

Taken together, both organism-specific codon usage and the presence of organism-specific pause sites mean that the biologically appropriate translation of a gene is highly adapted to its original host organism. Ribosomal pausing sites that may be functional in a human cell will almost certainly not be recognized in a bacterium, and even worse, a cDNA has a random but high probability of encoding a bacterial pause site that decouples translation from transcription, leading to expression failure. It is little wonder then that most cDNA clones do not smoothly express high levels of protein in bacteria. Even differences between pause signal coding among bacteria or among vertebrates are sufficient to make cross-family gene expression unpredictable.

Enter Synthetic Biology

The simplest test of translation pausing as a general regulator of protein synthesis is to compare a series of genes that have random pauses with synthetic genes from which the pauses have been removed intentionally. Genes moved from their source organism and expressed in a heterologous host with an altered set of over-represented codon pairs have a drastically altered configuration of presumed pause sites. Experimentally, creating codon-pair optimized genes has a dramatic effect on expression: of more than 60 genes tested, expression was either seen for the first time or improved, sometimes more than 100 fold.12

Radical Overhaul: De Novo Design of Synthetic Genes

Building a novel gene sequence to express a target protein sequence can have several advantages. Because of the redundancy of the triplet code, it is possible to preserve amino acid sequence coding while varying the nucleic acid sequence. In fact, a tremendous amount of variation is possible — approximately 3N sequence permutations for a protein of length N. Even a small protein of 100 amino acids thus has a total sequence space of approximately 1050 possible ways to encode the peptide. Using this space, genes can be designed which are malleable and specifically tailored to a certain host and vector system. The resulting gene can:

  • eliminate translational problems caused by inappropriate ribosome pausing

  • have codon usage rationalized to avoid over-reliance on rare codons

  • be specifically designed to avoid oligos that mis-hybridize. Genes can be easily assembled from optimized oligonucleotides that by thermodynamic necessity can only pair up in a specific order.

  • be free of selected restriction sites, internal Shine-Delgarno sequences, or other sites that may cause problems in cloning or in interactions with the host organism

  • have a controlled RNA sequence and secondary structure to avoid detrimental termination or processing sites.

The first two items on this list can re-tune the gene of interest to express in a particular host organism. Several programs exist for the purpose of identifying codon usage changes needed to suit the tRNA availability for a particular organism.13,14

Extensive alteration of codon usage alone, however, can introduce deleterious pauses into an open reading frame. In addition, changing the host organism effectively randomizes the pauses, making expression unpredictable. Managing codon pair interactions and simultaneously optimizing the entire set of parameters requires advanced computationally intensive design tools. Successful applications to explore and exploit the incredible diversity of potential gene sequences have been developed using a branch–bound algorithm.15 The design process called "translation engineering," for example, simultaneously satisfies the five parameters listed above.12

Continuing use of these techniques for ever-increasing numbers of genes in diverse organisms will further test their utility. To date, genes have been tailored for expression in E. coli, gram positive bacteria, yeasts, human cells, and plants, with frequently dramatic increases in protein yield.

The Future of Translation Engineering and the Role of Pausing in Complex Proteins

A rich series of biological events takes place during the translation of a protein. Deconvoluting the various aspects of co-translational modification critically depends on modulating the speed of the ribosome. One example is the independent domain folding of multi-domain proteins. Experimental evidence shows that proper folding of a yeast TY3 GAG gene (which has two independent domains) expressed in E. coli can be altered by the placement of pause sites between the two domains.16 By modifying the translation kinetics of complex multi-domain proteins, it may be possible to alter the time each domain needs to organize. Although refolding studies indicate that the time required for a protein to settle into its final configuration may take as long or longer that the translation of the protein,17 pausing may allow each domain to partially organize, committing to a particular, independent fold. Other co-translational events, such as association with membranes, secretion, proteolysis, or association with other proteins, may all depend on the kinetics of the emerging nascent peptide.

Evolution has favored the over-representation of codon pairs that encode pause sites. This overabundance presumably is renewed as the set of pause sites change over evolutionary time. It appears that tremendous genome-wide selection pressure exists for cells to avoid creating misfolded proteins.18 Perhaps these codon pair-encoded pauses are selected as a consequence of this pressure. Would-be designers of genes should consider taking advantage of this rich and biologically-used method for controlling the translation of genes and the activity of their encoded proteins.


The authors wish to thank GW Hatfield, RH Lathrop, S-P Hung, and A Fraudorf.

Joseph D. Kittle, Jr, Ph.D., Senior Vice President, Market Development, CODA Genomics, Inc., 4521 Campus Drive, Irvine, CA 92612, Office 949 348 1188 ext 103


1. Goeddel DV, Kleid DG, Bolivar F, Heyneker HL, Yansura DG, Crea R, Hirose T, Kraszewski A, Itakura K, Riggs AD. Expression in Escherichia coli of chemically synthesized genes for human insulin. Proc Natl Acad Sci U S A. 1979; 76:106-10.

2. Swartz JR. Advances in Escherichia coli production of therapeutic proteins. Curr Opin Biotechnol. 2001;12:195-201

3. Greene JJ. Host cell compatibility in protein expression, Methods Mol Biol. 2004;267:3-14.

4. Gerngross TU. Advances in the production of human therapeutic proteins in yeasts and filamentous fungi. Nat Biotechnol. 2004;22:1409-14

5. Cregg JM, Cereghino JL, Shi J, Higgins DR. Recombinant protein expression in Pichia pastoris. Mol Biotechnol. 2000;16:23-52.

6. Andersen DC, Krummen L. Recombinant protein expression for therapeutic applications. Curr Opin Biotechnol. 2002;13:117-23

7. Gutman GA, Hatfield GW Nonrandom utilization of codon pairs in Escherichia coli. Proc. Natl. Acad. Sci. USA 1989; 86: 3699-3703

8. Irwin B, Heck JD, and Hatfield GW. Codon Pair Utilization Biases Influence Translational Elongation Step Times. J. Biol. Chem.1995; 270: 22801–22806

9. Hatfield GW, Gutman, GA. Codon Pair Utilization Bias in Bacteria, Yeast. Gutman. In: Hatfield DL, Lee BJ, and Pirtle RM, editors. Transfer RNA in Protein Synthesis. Boca Raton, LA: CRC Press, 1993.

10. Noller HF, Yusupov MM, Yusupova GZ, Baucom A, Cate JH. Translocation of tRNA during protein synthesis. FEBS Lett. 2002; 514:11-6.

11. Trinh R, Gurbaxani B, Morrison SL, Seyfzadeh M, Optimization of codon pair use within the (GGGGS)3 linker sequence results in enhanced protein expression Mol. Immunology. 2004; 40: 717–722

12. Unpublished results, CODA Genomics Inc. Irvine CA, www.codagenomics.com

13. Rydzanicz R, Zhao XS, Johnson PE. Assembly PCR oligo maker: a tool for designing oligodeoxynucleotides for constructing long DNA molecules for RNA production. Nucleic Acids Res. 2005;33: (Web Server issue):W521-5

14. Hoover DM, Lubkowski J. DNAWorks: an automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 2002 May 15;30(10):e43

15. Lathrop RH, Sazhin A, Sun Y, Steffin N, Irani SS. A multi-queue branch-and-bound algorithm for anytime optimal search with biological applications. Genome Inform Ser Workshop Genome Inform. 2001;12:73-82.

16. Larsen, LSZ, Lathup, RH, Sandemeyer, SB, Hatfield, GW. Personal communication.

17. Eaton WA, Muñoz V, Hagen SJ, Jas GS, Lapidus LJ, Henry ER Hofrichter J, Fast Kinetics and Mechanisms in Protein Folding J Annu. Rev. Biophys. Biomol. Struct. 2000; 29:327-359.

18. Drummond DA, Bloom JD, Adami C, Wilke CO, Arnold FH, Why highly expressed proteins evolve slowly. PNAS. 2005; 102 14338–14343.