Thursday, November 07, 2013

Big data meets the eye

Remember when a 20kB image took a minute to load? Back then, when dinosaurs were roaming the earth?

Data has become big.

Today we have more data than ever before, more data in fact than we know how to analyze or even handle. Big data is a big topic. Big data changes the way we do science and the way we think about science. Big data even led Chris Anderson to declare the End of Theory:
“We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”
That was 5 years ago. Theory hasn’t ended yet and it’s unlikely to end anytime soon. Because there is slight problem with Anderson’s vision: One still needs the algorithm that is able to find patterns. And for that algorithm, one needs to know what one is looking for to begin with. But pattern finding algorithms for big data are difficult. One could say they are a science in themselves, so theory better not ends before having found them.

Those of us working on the phenomenology of quantum gravity would be happy if we had data at all, so I can’t say the big data problem is big on my mind, but I have a story to tell. Alexander Balatsky recently took on a professorship in condensed matter physics at Nordita, and he told me about a previous work of his that illustrates the challenge of big data in physics. It comes with an interesting lesson.

Electron conducting bands in crystals are impossible to calculate analytically except for very simplified approximations. Determining the behavior of electrons in crystals to high accuracy requires three-dimensional many-body calculations of multiple bands and their interactions. It produces a lot of data. Big data.

You can find and download some of that data in the 3D Fermi Surface Database. Let me just show you a random example example of Fermi surfaces, this one being for a gold-indium lattice:

The Fermi-surface roughly speaking tells you how electrons are packed. Pretty in a nerdy way, but what is the relevant information here?

The particular type of crystal Alexander and his collaborators, Hari Dahal and Athanasios Chantis, were interested in are so-called non-centrosymmetric crystals which have a relativistic spin-splitting of the conducting bands. This type of crystal symmetry exists in certain types of semiconductors and metals and plays a role in unconventional superconductivity that is still a theoretical challenge. Understanding the behavior of electrons in these crystals may hold the key to the production of novel materials.

The many-body, many-bands numerical simulation of the crystals produces a lot of numbers. You pipe them into a file, but now what? What really is it that you are looking for? What is relevant for the superconducting properties of the material? What pattern finding algorithm do you apply?

Let’s see...

Human eyes are remarkable pattern
search algorithms. Image Source.
The human eye, and its software in the visual cortex, is remarkably good in finding patterns, so good in fact it frequently finds patterns where none exist. And so the big data algorithm is to visualize the data and let humans scrutinize it, giving them the possibility to interact with the data while studying it. This interaction might mean selecting different parameters, different axes, rotating in several dimensions, changing colors or markers, zooming in and out. The hardware for this visualization was provided by the Los Almos-Sandia Center for Integrated Nanotechnologies, VIZ@CINT; the software is called ParaView and shareware. Here, big data meets theory again.

Intrigued about how this works in practice, I talked to Hari and Athanasios the other day. Athanasios recalls:
“I was looking at the data before in conventional ways, [producing 2-dimensional cuts in the parameter space], and missed it. But in the 3-d visualization I immediately saw it. It took like 5 minutes. I looked at it and thought “Wow”. To see this in conventional ways, even if I had known what to look for, I would have had to do hundreds of plots.”
The irony being that I had no idea what he was talking about. Because all I had to look at was a (crappy print of) a 2-dimensional projection. “Yes,” Athanasios says, “It’s in the nature of the problem. It cannot be translated into paper.”

So I’ll give it a try, but don’t be disappointed if you don’t see too much in the image because that’s the reason d’ĂȘtre for interactive data visualization software.

3-d bandstructure of GaAs. Image credits: Athanasios Chantis.

The two horizontal axis in the figure show the momentum space of the electrons into the directions away from the high symmetry direction of the crystal. It has a periodic symmetry, so you’re actually seeing four times the same patch, and in the atomic lattice this pattern goes on to repeat. In the vertical direction, there are two different functions shown simultaneously. One is depicted with the height profile whose color code you see on the left and shows the energy of the electrons. The other function shown (rescaled) in the colored bullets, is the spin-splitting of three different conduction bands; you see them in (bright) red, white and pink. Towards the middle of the front, note the white band getting close to the pink one. They don’t cross, but instead they seem to repel and move apart again. This is called an anti-crossing.

The relevant feature in the data, the one that’s hard if not impossible to see in two dimensional projections, is that the energy peaks coincide with the location of these anti-crossings. This property of the conducting bands, caused by the spin-splitting in this type of non-centrosymmetric crystals, affects how electrons travel through the crystal, and in particular it affects how electrons can form pairs. Because of this, materials with an atomic lattice of this symmetry (or rather, absence of symmetry) should be unconventional superconductors. This theoretical prediction has meanwhile been tested experimentally by two independent groups. Both groups observed signs of unconventional pairing, confirming at a strong connection between noncentrosymmetry and unconventional superconductivity.

This isn’t the only dataset that Hari studied by way of interactive visualization, and not the only case where it wasn’t only helpful but necessary to extract scientific information. Another example is this analysis of a data set from the composition of the tip of a scanning tunnel microscope, as well as a few other projects he has worked on.

And so it looks to me that, at least for now, the best pattern-finding algorithm for these big data sets is the eye of a trained theoretical physicist. News about the death of theory, it seems, have been greatly exaggerated.


  1. I assume the scientific community saw this.
    Any application to your topic above? Does one still need to know a bit about what one is looking for before you can decide how to present the data to be visualized in 3D space?

  2. Nice article (in fact I have enjoyed several of your latest articles). Although I am an experimentalist, not a theorist, it is obvious that we need theorists and the proposition that theory is little more than finding patterns in sets of numbers absurd. The laws of physics as we learn them in school are the real point of physics and they tell us much more than "there is a pattern in a specific data set."

    Anyway, the fact that modern and capable visualization tools play such an important role in some theoretical research is intriguing. I don't think this is the case for particle physics, but could it be?

  3. One-Loop Calculations with BlackHat

    Just write a better algorithm for a search feature with increased data like they did above?

    Data transfer is big money, and most people have been sucked in by it's metered use? If you want imaging and you like to watch movies maybe the transfer rate can be accommodated? :)


  4. Okay last post.:)

    "That his family could watch his dissertation defense over streaming media illustrated to him the usefulness of visualization: “What we are discovering about DNA is easy to grasp when you can see it,” says Freeman. “We can now fully illustrate, through computation, how DNA interacts with other entities such as proteins within a cell. We can show the public what our research looks like.” Freeman pointed out that what they discover through computation (in silico), they always confirm in the real world (in vivo) – but he says that computation should matter to the general public because it enables researchers to study a wide range of interactions between key biological molecules in an inexpensive manner. It speeds up drug discovery. Thus, funding science is extremely valuable to everyone." Using the OSG to simulate DNA-protein interaction

  5. The simple solution: an algorithm that emulates pareidolia. Sounds like a neural net job. What content does a non-random image possess?

    " non-centrosymmetric crystals which have a relativistic spin-splitting of the conducting bands." Heavy atom semiconductor tellurium crystallizes in enantiomorphic space groups P3(1,2)21. Piezoelectric crystals are non-centrosymmetric insulators (though they may have mirror planes).
    The power of seeing stuff
    The power of knowing stuff

  6. Nice Batman Curve. Sure beats the almost Superman curve you get with a weak acid titration.

  7. Dear Bee,

    You state: "One still needs the algorithm that is able to find patterns. And for that algorithm, one needs to know what one is looking for to begin with." ... Not necessarily. First there is Occam's razor which tells us that we should look for the simplest possible model. Then there is Balasubramanian's work relating Occam's razor and statistical mechanics ( Essentially, one can construct a partition function describing the dynamics of "agents" which explore some parameter space and eventually settle down when the find a minima, which often corresponds to the simplest possible model.

    Even more interestingly, very recently Jonathan Heckmann has argued that one can derive string theory from a similar line of reasoning involving agents exploring a parameter space - which happens to correspond to the target spacetime of a string. The agents correspond to the points on the string.

    Of course, that does not do away entirely with the need for some sort of intelligent inferential systems - such as humans - to make sense of big data, but it does bolster Chris Anderson's claim.

    Fascinating article nevertheless.



  8. Harbles,

    That is pretty damn cool :) I think though the data manipulation that they worked with was less fancy. But, yeah, I guess that's the future. I mean, we've all gotten pretty used to zooming with our thumb and index finger, no? I'm always pissed off though I can't do it with the index and middle finger instead. Best,


  9. Muon,

    Well, particle physics illustrates the problem with the 'end of theory' idea. If you got not theory, you don't know what to look for in the data. I mean, you can generically look for deviations from predictions of the theory you have, but that only gets you so far. Typically you'll have to look for some signature that fulfills different requirements to get a significant result, and the different requirements necessitate that you have a theory that tells you they should occur together. To drive it to an extreme. Imagine you had all the LHC data and no theory at all. Do you think we'd be able to arrive at correlation tables equally good as the standard model? The answer is almost certainly no. Even if you could correctly extract all correlations, it would be a terribly slow and clumsy operation and I can't for the hell of it see how it would be good enough to come up with higher order corrections that might be testable in the future.

    (All this btw is not to say that it is not possible in principle, just that it isn't possible in practice, not now and not any time soon.) Best,


  10. h-Index studies raise the h-indices of studyiers. Do orbits earn Frequent Flyer miles?

    A single failed reaction is a setback, a million failed reactions are a combinatorial library. Whatever its virtues, Big Data will "mature" into fashionable snipe hunts - then to be studied. Science was once performed not administered. 1) Collect frightening savants, dump them in back rooms, toss in raw meat. 2) Return to your desk, put your head in your hands, sweat blood. 3) Miracles occur. 4) Go in back, hose off the responsible lab apes, confiscate the wonders.

    Science is now defined as formalized mediocrity plus retaliation by codes of conduct. The future was meant to be dangerous. A doormat has a much larger surface area than an ice pick. So what?

  11. The role of reality in science: The universe's most powerful enabling tool is not knowledge or understanding but imagination because it extends the reality of one's environment.

  12. John Horgan has an interesting post at SciAm blogs about big data and science:

    "[I]n this post I’ll suggest that Big Data might be harming science, by luring smart young people away from the pursuit of scientific truth and toward the pursuit of profits."

    Are “Big Data” Sucking Scientific Talent into Big Business?

    (And I always forget that data is a plural.)

  13. Chris Anderson prefaces his Wired article with statistician George Box's pithy saying "All models are wrong, but some are useful" - which is true enough. But he should have thought about it a bit more - he might then have discovered the next level of that idea, which may be: "And some models are actually absurd", which nicely describes Anderson's claim about "The End of Theory"



COMMENTS ON THIS BLOG ARE PERMANENTLY CLOSED. You can join the discussion on Patreon.

Note: Only a member of this blog may post a comment.