Thursday, August 01, 2019

Automated Discovery


In 1986, Dan Swanson from the University of Chicago discovered a discovery.

Swanson (who passed away in 2012) was an information scientist and a pioneer in literature analysis. In the 1980s, he studied the distribution of references in scientific papers and found that, on occasion, studies on two separate research topics would have few references between them, but would refer to a common, third, set of papers. He conjectured that might indicate so-far unknown links between the separate research topics.

Indeed, Swanson found a concrete example for such a link. Already in the 1980s, scientists knew that certain types of fish oils benefit blood composition and blood vessels. So there was one body of literature linking circulatory health to fish oil. They had also found, in another line of research, that patients with Raynaud’s disease do better if their circulatory health improves. This led Swanson to conjecture that patients with Raynaud’s disease could benefit from fish oil. In 1993, a clinical trial demonstrated that this hypothesis was correct.

You may find this rather obvious. I would agree it’s not a groundbreaking insight, but this isn’t the point. The point is that the scientific community missed this obvious insight. It was right there, in front of their eyes, but no one noticed.

30 years after Swanson’s seminal paper, we have more data than ever about scientific publications. And just the other week, Nature published a new example for what you can do with it.

In the new paper, a group of researchers from California studied the materials science literature. They did not, like Swanson, look for relations between research studies by using citations, but they did a (much more computationally intensive) word-analysis of paper abstracts (not unlike the one we did in our paper). This analysis serves to identify the most relevant words associated with a manuscript, and to find relations between these words.

Previous studies have shown that words, treated as vectors in a high-dimensional space, can be added and subtracted. The most famous example is that the combination “King – Man + Woman” gives a new vector that turns out to be associated with the word “Queen”. In the new paper, the authors report finding similar examples in the materials science literature, such as “ferromagnetic −  NiFe + IrMn” which adds together to “antiferromagnetic”.

Even more remarkable though, they noticed that a number of materials whose names are close to the word “thermoelectric” were never actually mentioned together with the word “thermoelectric” in any paper’s abstract. This suggests, so the authors claim, that these materials may be thermoelectric, but so-far no one has noticed.

They have tested how well this works by making back-dated predictions for the discovery of new thermoelectric materials using only papers published until one of the years between 2001 and 2018. For each of these historical datasets, they used the relations between words in the abstracts to predict 50 thermoelectrical materials most likely to be found in the future. And it worked! In the five years after the historical data-cut, the identified materials were on average eight times more likely to be studied as thermoelectrics than were randomly chosen unstudied materials. The authors have now also made real predictions for new thermoelectric materials. We will see in the coming years how those pan out.

I think that analyses like this have lot of potential. Indeed, one of the things that keeps me up at night is the possibility that we might already have all the knowledge necessary to make progress in the foundations of physics, we just haven’t connected the dots. Smart tools to help scientists decide what papers to pay attention to could greatly aid knowledge discovery.


  1. Hello Sabine,
    thank you for the time you spend on this blog, I always come back to it knowing I will learn something. I was wondering — is it good or bad that the papers scanned and searched by those teams are written in "globish" vs "good" English? "Good" English might express nuances and "globish" English might go straight to the point. But this is perhaps irrelevant, the point being to find (new) correlations between words, thus between scientific domains.
    (A last comment: your "Cassandra" song is a splendid earworm — be blessed for that!-)

    1. Eric,

      Good point. I suspect (but do not know) that it matters that the papers use "academic" language in which words have very precisely defined (and often not colloquial) meaning. (It's actually a problem that across very different disciplines technical terms can have different meanings entirely.)

      Happy to hear you like my song :) :)

  2. Sleep well. It's only a matter of a few years before AI silves this problem.

  3. I am so happy to have stumbled across your blog. There are things I know about, things I don't know about, and best of all, here I find the things I didn't know that I didn't know about!

  4. Replies
    1. Opa,

      Thanks for pointing out, I have corrected this.

  5. JoshP, what do you think AI is? These kinds of algorithms Sabine is discussing are in fact fairly typical AI algorithms.

  6. Thank you for your articles in philosophy of science. I love the subject of how discoveries "occur". Having innate knowledge surfaced by these embeddings opens up the playing field for many new contributions.

  7. Special relativity was similar. This was a piece of low hanging fruit that had been their since Maxwell laid down electromagnetic field theory. However, for 40 years you had bright people, Lorentz, Mach, Poincare, and others who missed seeing this. It has occurred to me we may be in a similar situation.

    1. I would hazard a guess on one possible form of low hanging fruit it is with all these scalar fields. The Higgs field is quartic, which nonablelian YM fields are as well. We have all these other fields, so far not empirically established, such as inflaton, axion and others, that in effect can change the phase of a system. We may have some sort of general physics of fields that define some "periodic table" of possible phase changes that we currently do not understand.

    2. Yes, but this example is only partly a good one. Because that situation (relativity) was not fully covered by this method.
      1) From Maxwell’s theory it followed that a field contracts at motion. This knowledge existed and was published by Oliver Heaviside. This was known by Lorentz without the existence of computer correlations; however his conclusion for the PHYSICAL way of contraction was not accepted by the community
      2) The second element which Lorentz needed for his version of relativity was the existence of molecules and molecule lattices. But for this no accepted literature existed. So his approach was refused by the community. (Now this is accepted knowledge.)

      That was finally the chance for Einstein.

    3. Regarding: "We may have some sort of general physics of fields that define some "periodic table" of possible phase changes that we currently do not understand."

      There may be an unrecognized amalgamation of conditions that could impact the Higgs field that may someday explain some of the unsolved mysteries of science.

    4. Regarding the general physics of fields: I was comparing the values of the isospin and the supercharge of both the electron and the Higgs boson recently. The electron looks like it can be converted into a Higgs boson if the supercharge of the electron is flip from -1 to a +1. Could this flip in the sign of supercharge from negative to positive happen when the electron becomes entangled with the photon or a phonon when it experiences spin-charge separation and possibly becomes a quasiHiggs boson as a quasiparticle if all energy requirements are met?

  8. Bingo! This research is brilliant and the best science I have read about in a long time! With the old scientific discourse of debate and crony funding slipping away as ineffective methods for improving knowledge, data mining takes over and shows the potential to connect the dots on new scientific discoveries. I can't wait to see what this produces next..

  9. What is thinking? Let us play J Krishnamurti's game of association. I say "rose" and there is a train of thought and it happens so quickly. Firstly, I think of a rose(this is dot1 or link1) and then I think of a thorn(dot2 or link2), then a thorn prick(dot3 or link3), then, oozing blood(dot4 or link4), then nurse(dot5 or link5), then bandage(dot6 or link6). (You will have your own series of dots or links, watch the whole train) If you observe, thinking is a series of associated or related dots or links; thinking is the act of jumping from one link to another associated link or dot thereby connecting the dots or linking the dots to form a series of associations; this jumping is what I call "hyperlinking" akin to hyperlinking in computer science. If you observe your browsing history of a single session, you will see that you jumped or hyperlinked from one link to another "similar" or "associated" link making a history or series of associated links wherein the last link may be a very remote association to the link you started with. Thought history or thinking is similar to our browsing history of a single session. Going into "thinking" further, we see that hyperlinking is actually "recognition" i.e., placing what you see with what you know: if you see a rose, you place that image against what you know and recognize it as a rose; if you cannot place it, match it with what you know, you say you dont know. If not for hyperlinking or recognition we cannot find our way home neither can a animal finds its way through its territory. This hyperlinking or recognition mechanism in animals evolved into human thinking. That is thinking in humans is an accentuated, heightened recognition mechanism.

    Let us take Discovery. There is a jigsaw puzzle, and you have almost solved the puzzle but for a missing piece. The missing piece is not in the field of the known. Then observation begins, the flame of the question is alive, and both consciously and mostly unconsciously you are sensitive to the question. Since the mind, the brain is alive, sensitive to the question, the question generates the energy and the flame flickers relentlessly. That is you are observing and gathering information, mostly unconsciously, about things relevant to the missing piece or the question. Then one day when unconsciously, the information, the missing piece, is almost there, after all this relevant gathering, and you see a semblance of the information or missing piece and you extrapolate by placing this vision against all what you have gathered unconsciously or consciously relevant to the question and eureka! the missing piece falls in its place as an act of recognition or extrapolated recognition. Then there is what Lyall Watson in his book Beyond Supernature quotes as "the shock of recognition". This we call discovery. Discovery is possible only because of the heightened recognition mechanism in man. Example: Archimedes's Eureka, Kikule's dream or Ramanujam's divine intimations; both the latter are actually intimations based on their background; if Ramanujam was a catholic, then mother Mary would have given the intimations.

  10. If Time is the 4th dimensión ... and processing language cells as complex vectors in n-dimensional abstract topologies leads to Unveiling Nature's "brand new" facts/events ...

    Is not that mechanism for "facts production" giving hints about the possibility of non-local domains linked to The Block Universe ( on Itself a Non-local Entity by compressing all its phases as a single 4th dimensional entity )

    Therefore, "Truthful Knowledge" becomes The 5th Dimension ... a non-local/non-temporal dimension that permeates every physical event but can't be observed and/or registered by dynamic local Systems .. but Those Contingent Systems are linked the Non-local/non-temporal domain ....

    In short words, Platon 3.0 ... or "The Return of The Deductivist ( Jedi ) " ...

    The "Information Era" ,extreme statistical inductivism leading the culture into a Deductive Event Horizon ... or The transition from a Pre,-Truth's Civilization into a Truth's Civilization ...

    ... Of Course, a Truth's Civilization could be a never locally manifested condition ... like an Eternal Future becoming its own Past ...

    Whatever, but It seems that The Apes can not Rip to pursue Truth as the Unique Path for Transcendence.

  11. One can also "mine" structured text (other than natural language) in source documents: mathematical TeX (e.g., theories written as mathematical equations), chemical TeX (mhchem, chemfig), programs (programming languages), etc.

  12. I remember this math BS generator (this was 10 years ago, or so):

    1. Perl at its best. The coder (Nate Eldredge, University of Northern Colorado) got tenure last year.

  13. There are several drivers that are augmenting this effort. NLP and AI classifiers are being used for mining clinical notes to improve healthcare analytics. As these tools improve, so will the rate of article-mining discoveries.

  14. I think a bigger problem today is the peer reviewing process. Physics is a place of many dogmas. The basis of theoretical physics and applied physics within very similar fields can be very different. We need a peer reviewing process were widely different ideas can live side by side. Today's applied plasma physics is rather non-compatible with the theories of relativity.

    A good example is the application of the Sagnac effect which goes against the assumption of constant light speed in relation to all reference frames, which is a requisite in the "special theory of relativity".

    Michelson and Morley failed to measure the equivalent to a Sagnac effect with their interferometer. Sagnac however, measured and called it, which is a relative speed of light to a rotating reference system in relation to the surface of the Earth. And GPS measure a Sagnac effect with the Earth being a rotational reference system in relation to the Sun.

    The Sagnac effect proves that light moves in relation to a dominating reference system, an inequality between observers that is not pertained by the special theory of relativity.

  15. One of the blessings (and curses)of my own intellectual life is my interest and research in "everything." What has gone missing in science as in most disciplines is the training to be generalists. It is generalists who see the the cross-discipline patterns and the connections between physics and biology and chemistry and consciousness, for that matter. I think you are right and we need to back up and look at what we already know with fresh, generalist eyes.

  16. interesting to see this one make the rounds - here's a similar post from a Chemist:

    as a chemist who avoided physics because of the math, whilst liking physics for the concepts, I really enjoy your blog - you really do manage to put the math into words that I can understand :-)

  17. Connecting the dots requires a goal. In examining a generalized database, that goal is epitomized in a search criteria and an associated range. Without a goal in a search, that search will wonder off and explode into an infinity of disjointed correlations. But when constrained by a goal, the search process can discover new unrelated correlations that can be stored to enhance the next search done to meet another goal.

    An optimization or a minimization process is oftentimes required to meet a goal.

    In a search that discovers fish oils benefits blood composition and blood vessels as a correlation unrelated to a search, but this correlation is stored as a new connection of seemingly unrelated dots. In the next search, the fish oil correlation becomes meaningful in the “Raynaud’s disease” search.

    A optimization strategy addresses the need to finds the best treatment for “Raynaud’s disease”
    Assuming a large storage capability, a disjointed correlation process could run as a background task when the search mechanism is idle.

    The key to success here is to come up with a generalized representation of a ‘correlation’ that spans multiple fields of study: physics, chemistry, medicine, astrophysics, and so on. A search such as: “how best to shield an astronaut from a solar flare under a given weight constraint” bears upon a successful application of a general approach to searching involving correlation representations and storage under maxima/minima.

  18. Olav:

    The Sagnac effect is a bit more complicated, but your conclusion is surely correct.

    The Sagnac effect is proven daily; how? The so called laser gyroscope uses this effect for navigation. Almost every plane these days uses it and every plane which reaches its destination on a long range flight can be said to have disproven Einstein.

    But why more complicated? The observer in the Sagnac system (co-moving) is in a rotating system. So Einsteinians argue that special relativity is not about rotating systems. However there is a solution for this. You can increase the diameter of the Sagnac system further and further in a way that the speed on the surface of the rotating device remains constant. Here the conflict between Einstein and Sagnac does not decrease. If now the diameter is increased towards infinity, there is still this stable conflict but the motion of the system goes as close to a straight motion as one wants. So, the conflict is clear.

    I have tried to discuss this consideration with professors of relativity. They did not have any objections against it but still go with Einstein’s postulation about speed of light.

    Question: what is about our physics if discussions are blocked in this way? It’s not only at peer reviews.

    1. Hi Antooneo

      Yes, the problem is solved by relativists through the definition of a global reference system which violates the duality of the Lorentz transformation.

      Michelson and Morley searched for the speed of the earth through the aether via a discrepancy in light speed. They found none even though the Earth is a rotating system, and this gave rise to special relativity. Given the relativists explanation of the Sagnac effect Michelson and Morley should have measured a difference in a rotating Earth reference. Because they didn't, one can deduce that light moves in relation to a dominating mass and momentum. At the surface of the Earth we measure identical speeds because the Earth's mass, rotational momentum and proximity is totally dominating the Sun and other massive bodies at the surface of the Earth. However when moving up to our GPS satellites, the conditions from the Earth are not any longer completely dominating the Sun, and therefore the speed of light is slightly eschewed towards the Sun as a global reference. And this gives light a different speed in relation to the Earth's rotation.

      And the lies they tell the populace with relativity in relation to GPS. We do not use relativity calculations in order to make GPS work, it is a pure technical calibration process where all the clocks are related to a fixed reference clock and also in relation to each other. They even call it Coordinated Universal Time (UTC), which completely breaks with the spacetime concept of different moving frames of reference existing in a unique slice of spacetime.

      I love electromagnetism, and the "right hand rule" gives a direct relationship between movement, induced force and charge. That is all I need to calculate in the Universe. After all, the conservation of Energy and momentum gives many facets of movement, with vortex-like energy movement existing in two chiralities (left and right hand screw). Positive and negative charge also represent this chirality, after all, that is why we have a "right hand rule".

    2. Hi Olav Torsen

      you cannot compare the Sagnac effect with the MM experiment. The discrepancy with SR is in case of Sagnac only visible if the travel time of a light signal is followed along a full circumference of the Sagnac set up, not for a portion of it, and this in both directions. In case of a portion of the circuit the difference of the travel time in both directions cannot be independently measured because the synchronization of the necessary clocks cannot be independently defined.

      The null result of MM is caused by the fact that the apparatus contracts in motion, which in this case is the motion of the earth. But full rotation of the earth is no factor here.

      The synchronization of the clocks in the GPS satellites can be done from a central source, that is correct, so no time offset has to be taken into account here. However the time dilation in the clocks is a factor, insofar relativity matters.

      But generally speaking we can see that rotation is a case generally not covered by relativity, as Einstein himself has already stated in 1916.

    3. This comment has been removed by the author.

    4. This comment has been removed by the author.

    5. For some reason there is a lot of confusion over the Sagnac effect. It really is very simple. Suppose I have a fiber optical cable along the equator of Earth. I send photons simultaneously in both directions. To get fancy this could be a parametric down shifted pair of identical photons from a single photon passed through a sapphire crystal. One does not need to consider the rotating frame. All one recognizes is that in the time the two photons circle the Earth the Earth has rotated some small angle. Then in an interference experiment these two photons will exhibit wave interference because one photon traveled around the loop with a greater distance than the other. This is analysed from a frame outside the rotating frame.

  19. There are plenty of discoveries! Discoveries must be marketed to be seen.

  20. While I totally agree with the idea behind the blog (that ~= syntactical analysis -> interesting semantics) the poster child (king - man + woman = queen) is turning out to be a lot more fragile that first reported. A quick Google search will turn up a fair amount of 'yes, but' detailed analyses.

    As is so far always the case with these things there seems to be a need for a human hand on the tiller.

  21. Typo: added an subtracted
    ====> added and subtracted

  22. ...such a network of connected dots or hyperlinks is a nueral network; and when one dot or hyperlink is activated in the act of recognition then a portion of the network or the entire network is called forth i.e., it is either a partial recall or a total recall.

  23. Let us take Mathew Fisher's statement that superposition is an entanglement of more than one quantum particle. Lets keep that in the background for the moment. We know that waves that impinge on the earth are of different wavelengths, and that visible light has a particular range. But the human eye, the human program resolves these wavelengths and picks up only visible light. It is the virtue of the human program. The range of the wavelengths picked are just as much as the program allows, right. Obviously, the human program is the observer. Now comming back to Mathew Fisher's superposition. The quantum observer, the detector in the act of observation or measurement resolves the superposition and picks up only those spins or states based on the detector program or the observer i.e., just as much as the detector program allows.

  24. ......I think the catch is in the "resolution of superposition",resolving superposition...resolution is an act of measurement. Going by Mathew Fisher, if quantum particles are entangled as a superposition of states or spins, then when resolved if one is up then what is left is the other which is down by the same act of measurement; the same act of measurement is important here; if one is left then what is left is the other which is right by the same act of measurement.

  25. Interesting. This sounds more like a automated secondary discovery - discovering hidden links in the primary sources. Perhaps one could develop AI that devise formulas based on existing papers and make predictions. Maybe quarks could have been discovered in this way if we had the right AI tools back then.

  26. I think that word embeddings are definitely interesting, and may provide additional insights in natural language processing of scientific work than other analyses. And it vaguely reflects how the human brain may work:

    The brain may work in embedding spaces and be able to switch "contexts" (different embedding spaces for different situations). A friend's statement: this is why athletes practice under "game time conditions", so they learn things in the "correct context".

    I think we can make a lot of breakthroughs this way, by finding unnoticed connections. But I also think to go further, maybe such as in the foundations of physics, we may need a more complex representation of knowledge.

    Word embeddings do not build up a full knowledge graph; words and concepts do not take up a single point in some embedding space. Rather, they may take up fuzzy _regions_ of a space (these regions probably do not need to be connected), and can form a hierarchical structure.

  27. How about the reliability of data in a paper to be added to the data set? Will the data set contain any data from speculative or even more extreme crackpot papers. Who will filter this information?


COMMENTS ON THIS BLOG ARE PERMANENTLY CLOSED. You can join the discussion on Patreon.

Note: Only a member of this blog may post a comment.