Sabine Hossenfelder: Backreaction: Automated Discovery

Thursday, August 01, 2019

Automated Discovery

In 1986, Dan Swanson from the University of Chicago discovered a discovery.

Swanson (who passed away in 2012) was an information scientist and a pioneer in literature analysis. In the 1980s, he studied the distribution of references in scientific papers and found that, on occasion, studies on two separate research topics would have few references between them, but would refer to a common, third, set of papers. He conjectured that might indicate so-far unknown links between the separate research topics.

Indeed, Swanson found a concrete example for such a link. Already in the 1980s, scientists knew that certain types of fish oils benefit blood composition and blood vessels. So there was one body of literature linking circulatory health to fish oil. They had also found, in another line of research, that patients with Raynaud’s disease do better if their circulatory health improves. This led Swanson to conjecture that patients with Raynaud’s disease could benefit from fish oil. In 1993, a clinical trial demonstrated that this hypothesis was correct.

You may find this rather obvious. I would agree it’s not a groundbreaking insight, but this isn’t the point. The point is that the scientific community missed this obvious insight. It was right there, in front of their eyes, but no one noticed.

30 years after Swanson’s seminal paper, we have more data than ever about scientific publications. And just the other week, Nature published a new example for what you can do with it.

In the new paper, a group of researchers from California studied the materials science literature. They did not, like Swanson, look for relations between research studies by using citations, but they did a (much more computationally intensive) word-analysis of paper abstracts (not unlike the one we did in our paper). This analysis serves to identify the most relevant words associated with a manuscript, and to find relations between these words.

Previous studies have shown that words, treated as vectors in a high-dimensional space, can be added and subtracted. The most famous example is that the combination “King – Man + Woman” gives a new vector that turns out to be associated with the word “Queen”. In the new paper, the authors report finding similar examples in the materials science literature, such as “ferromagnetic − NiFe + IrMn” which adds together to “antiferromagnetic”.

Even more remarkable though, they noticed that a number of materials whose names are close to the word “thermoelectric” were never actually mentioned together with the word “thermoelectric” in any paper’s abstract. This suggests, so the authors claim, that these materials may be thermoelectric, but so-far no one has noticed.

They have tested how well this works by making back-dated predictions for the discovery of new thermoelectric materials using only papers published until one of the years between 2001 and 2018. For each of these historical datasets, they used the relations between words in the abstracts to predict 50 thermoelectrical materials most likely to be found in the future. And it worked! In the five years after the historical data-cut, the identified materials were on average eight times more likely to be studied as thermoelectrics than were randomly chosen unstudied materials. The authors have now also made real predictions for new thermoelectric materials. We will see in the coming years how those pan out.

I think that analyses like this have lot of potential. Indeed, one of the things that keeps me up at night is the possibility that we might already have all the knowledge necessary to make progress in the foundations of physics, we just haven’t connected the dots. Smart tools to help scientists decide what papers to pay attention to could greatly aid knowledge discovery.

39 comments:

Éric ANGELINI8:22 AM, August 01, 2019
Hello Sabine,
thank you for the time you spend on this blog, I always come back to it knowing I will learn something. I was wondering — is it good or bad that the papers scanned and searched by those teams are written in "globish" vs "good" English? "Good" English might express nuances and "globish" English might go straight to the point. But this is perhaps irrelevant, the point being to find (new) correlations between words, thus between scientific domains.
(A last comment: your "Cassandra" song is a splendid earworm — be blessed for that!-)
ReplyDelete
Replies
JoshP8:30 AM, August 01, 2019
Sleep well. It's only a matter of a few years before AI silves this problem.
ReplyDelete
Replies
Glenk10:29 AM, August 01, 2019
I am so happy to have stumbled across your blog. There are things I know about, things I don't know about, and best of all, here I find the things I didn't know that I didn't know about!
ReplyDelete
Replies
TheLambLiesDownOnBroadway2:04 PM, August 01, 2019
Sabine,
It's "materialS science"
ReplyDelete
Replies
Kevin S. Van Horn2:44 PM, August 01, 2019
JoshP, what do you think AI is? These kinds of algorithms Sabine is discussing are in fact fairly typical AI algorithms.
ReplyDelete
Replies
jpvillaseca2:56 PM, August 01, 2019
Thank you for your articles in philosophy of science. I love the subject of how discoveries "occur". Having innate knowledge surfaced by these embeddings opens up the playing field for many new contributions.
ReplyDelete
Replies
Lawrence Crowell8:09 PM, August 01, 2019
Special relativity was similar. This was a piece of low hanging fruit that had been their since Maxwell laid down electromagnetic field theory. However, for 40 years you had bright people, Lorentz, Mach, Poincare, and others who missed seeing this. It has occurred to me we may be in a similar situation.
ReplyDelete
Replies
mh9:11 PM, August 01, 2019
Bingo! This research is brilliant and the best science I have read about in a long time! With the old scientific discourse of debate and crony funding slipping away as ineffective methods for improving knowledge, data mining takes over and shows the potential to connect the dots on new scientific discoveries. I can't wait to see what this produces next..
ReplyDelete
Replies
Gokul Gopisetti10:34 PM, August 01, 2019
What is thinking? Let us play J Krishnamurti's game of association. I say "rose" and there is a train of thought and it happens so quickly. Firstly, I think of a rose(this is dot1 or link1) and then I think of a thorn(dot2 or link2), then a thorn prick(dot3 or link3), then, oozing blood(dot4 or link4), then nurse(dot5 or link5), then bandage(dot6 or link6). (You will have your own series of dots or links, watch the whole train) If you observe, thinking is a series of associated or related dots or links; thinking is the act of jumping from one link to another associated link or dot thereby connecting the dots or linking the dots to form a series of associations; this jumping is what I call "hyperlinking" akin to hyperlinking in computer science. If you observe your browsing history of a single session, you will see that you jumped or hyperlinked from one link to another "similar" or "associated" link making a history or series of associated links wherein the last link may be a very remote association to the link you started with. Thought history or thinking is similar to our browsing history of a single session. Going into "thinking" further, we see that hyperlinking is actually "recognition" i.e., placing what you see with what you know: if you see a rose, you place that image against what you know and recognize it as a rose; if you cannot place it, match it with what you know, you say you dont know. If not for hyperlinking or recognition we cannot find our way home neither can a animal finds its way through its territory. This hyperlinking or recognition mechanism in animals evolved into human thinking. That is thinking in humans is an accentuated, heightened recognition mechanism.

Let us take Discovery. There is a jigsaw puzzle, and you have almost solved the puzzle but for a missing piece. The missing piece is not in the field of the known. Then observation begins, the flame of the question is alive, and both consciously and mostly unconsciously you are sensitive to the question. Since the mind, the brain is alive, sensitive to the question, the question generates the energy and the flame flickers relentlessly. That is you are observing and gathering information, mostly unconsciously, about things relevant to the missing piece or the question. Then one day when unconsciously, the information, the missing piece, is almost there, after all this relevant gathering, and you see a semblance of the information or missing piece and you extrapolate by placing this vision against all what you have gathered unconsciously or consciously relevant to the question and eureka! the missing piece falls in its place as an act of recognition or extrapolated recognition. Then there is what Lyall Watson in his book Beyond Supernature quotes as "the shock of recognition". This we call discovery. Discovery is possible only because of the heightened recognition mechanism in man. Example: Archimedes's Eureka, Kikule's dream or Ramanujam's divine intimations; both the latter are actually intimations based on their background; if Ramanujam was a catholic, then mother Mary would have given the intimations.
ReplyDelete
Replies
First Name Surname4:58 AM, August 02, 2019
If Time is the 4th dimensión ... and processing language cells as complex vectors in n-dimensional abstract topologies leads to Unveiling Nature's "brand new" facts/events ...

Is not that mechanism for "facts production" giving hints about the possibility of non-local domains linked to The Block Universe ( on Itself a Non-local Entity by compressing all its phases as a single 4th dimensional entity )

Therefore, "Truthful Knowledge" becomes The 5th Dimension ... a non-local/non-temporal dimension that permeates every physical event but can't be observed and/or registered by dynamic local Systems .. but Those Contingent Systems are linked the Non-local/non-temporal domain ....

In short words, Platon 3.0 ... or "The Return of The Deductivist ( Jedi ) " ...

The "Information Era" ,extreme statistical inductivism leading the culture into a Deductive Event Horizon ... or The transition from a Pre,-Truth's Civilization into a Truth's Civilization ...

... Of Course, a Truth's Civilization could be a never locally manifested condition ... like an Eternal Future becoming its own Past ...

Whatever, but It seems that The Apes can not Rip to pursue Truth as the Unique Path for Transcendence.
ReplyDelete
Replies
Philip Thrift6:26 AM, August 02, 2019
One can also "mine" structured text (other than natural language) in source documents: mathematical TeX (e.g., theories written as mathematical equations), chemical TeX (mhchem, chemfig), programs (programming languages), etc.
ReplyDelete
Replies
Éric ANGELINI7:26 AM, August 02, 2019
I remember this math BS generator (this was 10 years ago, or so): http://thatsmathematics.com/mathgen/
ReplyDelete
Replies
mh10:36 AM, August 02, 2019
There are several drivers that are augmenting this effort. NLP and AI classifiers are being used for mining clinical notes to improve healthcare analytics. As these tools improve, so will the rate of article-mining discoveries.
ReplyDelete
Replies
Olav Thorsen10:53 AM, August 02, 2019
I think a bigger problem today is the peer reviewing process. Physics is a place of many dogmas. The basis of theoretical physics and applied physics within very similar fields can be very different. We need a peer reviewing process were widely different ideas can live side by side. Today's applied plasma physics is rather non-compatible with the theories of relativity.

A good example is the application of the Sagnac effect which goes against the assumption of constant light speed in relation to all reference frames, which is a requisite in the "special theory of relativity".

Michelson and Morley failed to measure the equivalent to a Sagnac effect with their interferometer. Sagnac however, measured and called it, which is a relative speed of light to a rotating reference system in relation to the surface of the Earth. And GPS measure a Sagnac effect with the Earth being a rotational reference system in relation to the Sun.

The Sagnac effect proves that light moves in relation to a dominating reference system, an inequality between observers that is not pertained by the special theory of relativity.
ReplyDelete
Replies
CScurlock12:41 PM, August 02, 2019
One of the blessings (and curses)of my own intellectual life is my interest and research in "everything." What has gone missing in science as in most disciplines is the training to be generalists. It is generalists who see the the cross-discipline patterns and the connections between physics and biology and chemistry and consciousness, for that matter. I think you are right and we need to back up and look at what we already know with fresh, generalist eyes.
ReplyDelete
Replies
JM1:05 PM, August 02, 2019
interesting to see this one make the rounds - here's a similar post from a Chemist: https://blogs.sciencemag.org/pipeline/archives/2019/07/15/machine-mining-the-literature

as a chemist who avoided physics because of the math, whilst liking physics for the concepts, I really enjoy your blog - you really do manage to put the math into words that I can understand :-)
ReplyDelete
Replies
Axil2:27 PM, August 02, 2019
Connecting the dots requires a goal. In examining a generalized database, that goal is epitomized in a search criteria and an associated range. Without a goal in a search, that search will wonder off and explode into an infinity of disjointed correlations. But when constrained by a goal, the search process can discover new unrelated correlations that can be stored to enhance the next search done to meet another goal.

An optimization or a minimization process is oftentimes required to meet a goal.

In a search that discovers fish oils benefits blood composition and blood vessels as a correlation unrelated to a search, but this correlation is stored as a new connection of seemingly unrelated dots. In the next search, the fish oil correlation becomes meaningful in the “Raynaud’s disease” search.

A optimization strategy addresses the need to finds the best treatment for “Raynaud’s disease”
.
Assuming a large storage capability, a disjointed correlation process could run as a background task when the search mechanism is idle.

The key to success here is to come up with a generalized representation of a ‘correlation’ that spans multiple fields of study: physics, chemistry, medicine, astrophysics, and so on. A search such as: “how best to shield an astronaut from a solar flare under a given weight constraint” bears upon a successful application of a general approach to searching involving correlation representations and storage under maxima/minima.
ReplyDelete
Replies
antooneo3:17 PM, August 02, 2019
Olav:

The Sagnac effect is a bit more complicated, but your conclusion is surely correct.

The Sagnac effect is proven daily; how? The so called laser gyroscope uses this effect for navigation. Almost every plane these days uses it and every plane which reaches its destination on a long range flight can be said to have disproven Einstein.

But why more complicated? The observer in the Sagnac system (co-moving) is in a rotating system. So Einsteinians argue that special relativity is not about rotating systems. However there is a solution for this. You can increase the diameter of the Sagnac system further and further in a way that the speed on the surface of the rotating device remains constant. Here the conflict between Einstein and Sagnac does not decrease. If now the diameter is increased towards infinity, there is still this stable conflict but the motion of the system goes as close to a straight motion as one wants. So, the conflict is clear.

I have tried to discuss this consideration with professors of relativity. They did not have any objections against it but still go with Einstein’s postulation about speed of light.

Question: what is about our physics if discussions are blocked in this way? It’s not only at peer reviews.
ReplyDelete
Replies
Michael John Sarnowski9:20 PM, August 02, 2019
There are plenty of discoveries! Discoveries must be marketed to be seen.
ReplyDelete
Replies
alanlit10:35 PM, August 02, 2019
While I totally agree with the idea behind the blog (that ~= syntactical analysis -> interesting semantics) the poster child (king - man + woman = queen) is turning out to be a lot more fragile that first reported. A quick Google search will turn up a fair amount of 'yes, but' detailed analyses.

As is so far always the case with these things there seems to be a need for a human hand on the tiller.
ReplyDelete
Replies
Terry Bollinger5:16 AM, August 03, 2019
Typo: added an subtracted
====> added and subtracted
ReplyDelete
Replies
Gokul Gopisetti10:18 AM, August 03, 2019
...such a network of connected dots or hyperlinks is a nueral network; and when one dot or hyperlink is activated in the act of recognition then a portion of the network or the entire network is called forth i.e., it is either a partial recall or a total recall.
ReplyDelete
Replies
Gokul Gopisetti9:48 PM, August 03, 2019
Let us take Mathew Fisher's statement that superposition is an entanglement of more than one quantum particle. Lets keep that in the background for the moment. We know that waves that impinge on the earth are of different wavelengths, and that visible light has a particular range. But the human eye, the human program resolves these wavelengths and picks up only visible light. It is the virtue of the human program. The range of the wavelengths picked are just as much as the program allows, right. Obviously, the human program is the observer. Now comming back to Mathew Fisher's superposition. The quantum observer, the detector in the act of observation or measurement resolves the superposition and picks up only those spins or states based on the detector program or the observer i.e., just as much as the detector program allows.
ReplyDelete
Replies
Gokul Gopisetti10:26 PM, August 03, 2019
......I think the catch is in the "resolution of superposition",resolving superposition...resolution is an act of measurement. Going by Mathew Fisher, if quantum particles are entangled as a superposition of states or spins, then when resolved if one is up then what is left is the other which is down by the same act of measurement; the same act of measurement is important here; if one is left then what is left is the other which is right by the same act of measurement.
ReplyDelete
Replies
tytung12:03 AM, August 05, 2019
Interesting. This sounds more like a automated secondary discovery - discovering hidden links in the primary sources. Perhaps one could develop AI that devise formulas based on existing papers and make predictions. Maybe quarks could have been discovered in this way if we had the right AI tools back then.
ReplyDelete
Replies
Axil11:34 PM, August 05, 2019
How about the reliability of data in a paper to be added to the data set? Will the data set contain any data from speculative or even more extreme crackpot papers. Who will filter this information?
ReplyDelete
Replies

Add comment

COMMENTS ON THIS BLOG ARE PERMANENTLY CLOSED. You can join the discussion on Patreon.

Pages

Thursday, August 01, 2019

Automated Discovery

39 comments: