Protein folding is one of the biggest, if not THE biggest problem, in biochemistry. It’s become the holy grail of drug development. Some of you may even have folded proteins yourself, at least virtually, with the crowd-science app ‘’Foldit”. But then late last year the headlines proclaimed that Protein Folding was “solved” by artificial intelligence. Was it really solved? And if it was solved, what does that mean? And, erm, what was the protein folding problem again? That’s what we will talk about today.
Proteins are one of the major building blocks of living tissue, for example muscles, which is why you may be familiar with “proteins” as one of the most important nutrients in meat.
But proteins come in a bewildering number of variants and functions. They are everywhere in biology, and are super-important: Proteins can be antibodies that fight against infections, proteins allow organs to communicate between each other, and proteins can repair damaged tissue. Some proteins can perform amazingly complex functions. For example, pumping molecules in and out of cells, or carrying substances along using motions that look much like walking.
But what’s a protein to begin with? Proteins are basically really big molecules. Somewhat more specifically, proteins are chains of smaller molecules called amino acids. But long and loose chains of amino acids are unstable, so proteins fold and curl until they reach a stable, three-dimensional, shape. What is a protein’s stable shape, or stable shapes, if there are several? This is the “protein folding problem”.
Understanding how proteins fold is important because the function of a protein depends on its shape. Some mutations can lead to a change in the amino acid sequence of a protein which causes the protein to fold the wrong way. It can then no longer fulfil its function and the result can be severe illness. There are many diseases which are caused by improperly folded proteins, for example, type two diabetes, Alzheimer’s, Parkinson’s, and also ALS, that’s the disease that Stephen Hawking had.
So, understanding how proteins fold is essential to figuring out how these diseases come about, and how to maybe cure them. But the benefit of understanding protein folding goes beyond that. If we knew how proteins fold, it would generally be much easier to synthetically produce proteins with a desired function.
But protein folding is a hideously difficult problem. What makes it so difficult is that there’s a huge number of ways proteins can fold. The amino acid chains are long and they can fold in many different directions, so the possibilities increase exponentially with the length of the chain.
Cyrus Levinthal estimated in the nineteen-sixties that a typical protein could fold in more than ten to the one-hundred-forty ways. Don’t take this number too seriously though. The number of possible foldings actually depends on the size of the protein. Small proteins may have as “few” as ten to the fifty, while some large ones can have and a mind-blowing ten to the three-hundred possible foldings. That’s almost as many vacua as there are in string theory!
So, just trying out all possible foldings is clearly not feasible. We’d never figure out which one is the most stable one.
The problem is so difficult, you may think it’s unsolvable. But not all is bad. Scientists found out in the nineteen-fifties that when proteins fold under controlled conditions, for example in a test tube, then the shape into which they fold is pretty much determined by the sequence of amino acids. And even in a natural environment, rather than a test tube, this is usually still the case.
Indeed, the Nobel Prize for Chemistry was awarded for this in 1972. Before that, one could have been worried that proteins have a large numbers of stable shapes, but that doesn’t seem to be the case. This is probably because natural selection preferentially made use of large molecules which reliably fold the same way.
There are some exceptions to this. For example prions, like the ones that are responsible for mad cow disease, have several stable shapes. And proteins can change shape if their environment changes, for instance when they encounter certain substances inside a cell. But mostly, the amino acid sequence determines the shape of the protein.
So, the protein folding problem comes down to the question: If you have the amino-acid sequence, can you tell me what’s the most stable shape?
How would one go about solving this problem? There are basically two ways. One is that you can try to come up with a model for why proteins fold one way and not another. You probably won’t be surprised to hear that I had quite a few physicist friends who tried their hands at this. In physics we call that a “top down” approach. The other thing you can do is what we call a “bottom up” approach. This means you observe how a large number of proteins fold and hope to extract regularities from this.
Either way, to get anywhere with protein folding you first of all need examples of how folded proteins look like. One of the most important methods for this is X-ray crystallography. For this, one fires beams of X-rays at crystallized proteins and measures how the rays scatter off. The resulting pattern depends on the position of the different atoms in the molecule, from which one can then infer the three-dimensional shape of the protein. Unfortunately, some proteins take months or even years to crystallize. But a new method has recently much improved the situation by using electron microscopy on deep-frozen proteins. This so-called Cryo-electron microscopy gives much better resolution.
In 1994, to keep track of progress in protein folding predictions, researchers founded an initiative called the Critical Assessment of Protein Structure Prediction, CASP for short. CASP is a competition among different research teams which try to predict how proteins fold. The teams are given a set of amino acid sequences and have to submit which shape they think the protein will fold into.
This competition takes place every two years. It uses protein structures that were just experimentally measured, but have not yet been published, so the competing teams don’t know the right answer. The predictions are then compared with the real shape of the protein, and get a score depending on how well they match. This method for comparing the predicted with the actual three-dimensional shape is called a Global Distance Test, and it’s a percentage. 0% is a total failure, 100% is the high score. In the end, each team gets a complete score that is the average over all their prediction scores.
For the first 20 years, progress in the CASP competition was slow. Then, researchers began putting artificial intelligence on the task. Indeed, in last year’s competition, about half of the teams used artificial intelligence or, more specifically, deep learning. Deep learning uses neural networks. It is software that is trained on large sets of data and learns recognize patterns which it then extrapolates from. I explained this in more detail in an earlier video.
Until some years ago, no one in the CASP competition scored more than 40%. But in the last two installments of the competition, one team has reached remarkable scores. This is DeepMind, a British Company that was acquired by Google in twenty-fourteen. It’s the same company which is also behind the computer program AlphaGo, that in twenty-fifteen was first to beat a professional Go player.
DeepMind’s program for protein folding is called AlphaFold. In twenty-eighteen, AlphaFold got a score of almost 60% in the CASP competition, and in 2020, the update AlphaFold2 reached almost 90%.
The news made big headlines some months ago. Indeed, many news outlets claimed that AlphaFold2 solved the protein folding problem. But did it?
Critics have pointed out that 90% is still a significant failure rate and that some of the most interesting cases are the ones for which AlphaFold2 did not do well, such as complexes of proteins, called oligomers, in which several amino acids are interacting. There is also the general problem with artificial intelligences, which is that they can only learn to extract patterns from data which they’ve been trained on. This means the data has to exist in the first place. If there are entirely new functions that don’t make an appearance in the data set, they may remain undiscovered.
But well. I sense a certain grumpiness here of people who are afraid they’ll be rendered obsolete by software. It’s certainly true that the AlphaFold’s 2020 success won’t be the end of the story. Much needs to be done, and of course one still needs data, meaning measurements, to train artificial intelligence on.
Still I think this is a remarkable achievement and amazing progress. It means that, in the future, protein folding predictions by artificially intelligent software may save scientists much time-consuming and expensive experiments. This could help researchers to develop proteins that have specific functions. Some that are on the wish-list, for example, are proteins to stimulate the immune system to fight cancer, a universal flu vaccine, or proteins that breaking down plastics.