Sunday, March 30, 2008

Lost in Translation

Ever wondered what happens if you use an algorithm to translate a text back and forth to French, German, Italian, Portuguese, and Spanish? You can to this with the super-babelizer Lost in Translation. As an example, here is the translation of the first paragraph of 0801.3471 (for no specific reason).

The original:

Classical General Relativity in more than four spacetime dimensions has been the subject of increasing attention in recent years. Among the reasons why it should be interesting to study this extension of Einstein’s theory, and in particular its black hole solutions, we may mention that

  • String theory contains gravity and requires more than four dimensions. In fact, the first successful statistical counting of black hole entropy in string theory was performed for a fivedimensional black hole. This example provides the best laboratory for the microscopic string theory of black holes.

  • The AdS/CFT correspondence relates the properties of a d-dimensional black hole with those of a quantum field theory in d − 1 dimensions.

  • The production of higher-dimensional black holes in future colliders becomes a conceivable possibility in scenarios involving large extra dimensions and TeV-scale gravity.

  • As mathematical objects, black hole spacetimes are among the most important Lorentzian Ricci-flat manifolds in any dimension.


And the translation:

Traditional general of relativity in more than four masses of that the time of the space was the subject of the increase attention these the slipped years. To the relations of transformation, so that he had that to being interesting, to this extension of the theory of Einstein to study and in the detail of the relative solutions to perforate black color, that we can we mentioned this
  • The theory of the series of the characters will count the force of the gravity and it more has the necessity of the one of mass four. They executed the first guessed right statistical client of the entropy of the black color that really perforates in the theory of the series of the characters the end to perforate the black color of the fivedimensional. This better example releases the laboratory available for the microscopic theory of the series of the characters of the black color of the perforations.

  • The correspondence of AdS/CFT connects the characteristics of a D dimensional schwarzen that the sacadores with those with a theory of the zone of the section of the time in the D without mass 1.

  • The production of the perforations that the high-dimensional-black color in her the future transforms of colliders inside the great possibilities imaginable ones into the writing of the suggestion adds of the film and in the fairs of TeV the force of the gravity.

  • As matemati of the messages those we belong spacetimes of the black color that the sacadores to the tubes the greatest piece of the important Stocherkaehne I gave curly Lorentzian in each possible measurement.



I hope that clarifies everything.

Makes me wonder why there is no requirement these translation maps be invertible.

26 comments:

Anonymous said...

Because the different languages are not isomorphic.

Bee said...

Not if you translate worde-wise, no. But if you consider strings of words, context etc, shouldn't it at least come close to being invertible? I mean, the above is an almost complete loss of information within 5 iterations (except possibly for names or other words that weren't translated). I don't find this very convincing.

William said...

Interesting that "string" becomes "series of the characters". Seems like an unfortunate side-effect of the fact that software is written by computer programmers.

Bee said...

Yes. One would think it can't be so difficult to check the next word after 'string', which is 'theory' and pick the correct translation?

Skavookie said...

The phrase "series of the characters" has a completely different meaning in mathematics (vs computer sci), and is definitely related to string theory, providing an alternative explanation, although not a very plausible one.

As for the "homology" of languages, the problem is far more complex than one might think. We are far from having a adequate model of syntax, let alone semantics, even for European languages, and even if we did, I would not expect languages to be even close to isomorphic. An interesting classic example is color: different languages classify colors differently, to the extent that native speakers of one language are unable to distinguish between colors that a native speaker of another language would. Even at a high level, languages are not even quasi-isomorphic.

And yes, I am both a mathematician and linguist.

Bee said...

See also Disambiguation

Didn't I read somewhere that the study which claimed color-distinguishing is encoded in the language was either faulty or doubtful?

Phil Warnell said...

Hi Bee,

This is exactly why in the past scholars only wrote and published their writings in Latin. The laws of gravity and motion are thus known as “Principia Mathematica”. I would contend that the natural solution lies not with algorithms, yet rather with standardizing human language. However, as with most things human there is little chance of this happening. When I was in High School Latin was still required to be studied for at least two years although I must admit that won’t take one very fair :-)

Best,

Phil

Skavookie said...

http://en.wikipedia.org/wiki/Character_theory

A recent article about color perception (yah, I know, Wired is not exactly a scholarly source - but the comment are often amusing in a sad way!): http://blog.wired.com/wiredscience/2008/03/babies-see-pure.html

Perhaps my choice of the color example was not the best, as it is slightly controversial (but only slightly). The idea that we perceive reality through the filter of language is an old idea, and the color example has been revisited many times, with the same results.

Frank said...

I remember playing around with this precise thing and have it work away at shakespeare sonnets, etc when online translators first became available.

Some musings:

You agree that it is obvious that a translation based on words can not be isomorphic. Translations are one to many depending heavily on context. But then what should be the unit that is isomorphically mapped? In many ways language is engineered for fault tolerance, that means that whatever "unit of meaning" you come up with that should be isomorphically mapped between languages it would by necessity be encoded redundantly in many subtle relationships within a text (and even outside the text itself).

You know this yourself I'm sure, from translating texts between English and German, the problem doesn't factorize. Sometimes you need to take the whole sentence or even the whole paragraph and basically completely re express it.

But then language is ambiguous due to the redundancy mentioned above so there will be many different texts that you would classify as equally good translations of one particular text, so even at the level of the whole text it is not clear that it can be isomorphic.

And a computer starting to build a translation from using the one to many maps of a dictionary and trying to chose according to context will obviously lose all the coherence between the words first, because it is not really translating this structure to begin with, it only is expressed in the dictionary choices made. Because this structure is so incredibly hard to model it seems natural that it's lost very quickly.

stefan said...

Dear Bee,

that's a cool toy :-)... I've played around a bit with the first paragraph of your text, and I have the impression that the quality of the translations to the different languages involved is quite uneven.

For example, the first translation to French seems very reasonable to me (apart from the "chaîne de caractères", which didn't make sense to me at all - thanks for the hint about computer science), and the backconversion to English is still quite understandable.

But the second translation, from still quite reasonable English to German, is awful: "in more than four dimensions of spacetime" becomes "in mehr als vier Maßen von spacetime", and "Among the reasons for which it should be interesting to study this extension of the theory of Einstein, and in particular its solutions of black hole" is converted to "Unter den Gründen, aus denen es interessant sein sollte, diese Extension der Theorie von Einstein zu studieren und insbesondere seinen Lösungen der schwarzen Bohrung," ... it seems that a lot of sense gets lost in the step English-German and back ;-). And the translation to Italian then introduces further confusion ("space time" → "tempo dello spazio" → "time
of the space")... It would be interesting to see the performance of the to and fro translation starting with the same text for the five languages separately.

The context when a string is a "chain of characters" and a, hum, "string" may still be possible to figure out automatically, and perhaps also when a hole is a hole or a drilling - but I always think about "field theory", where the to and fro via "Feldtheorie" or "Körpertheorie" may result in funny errors, and where the right context may be much harder to establish automatically ;-)

Best, Stefan

Bee said...

Hi Phil,

Problem is I receive a lot of papers I am supposed to referee which sound like the second version of above example rather than the first. If that's the way English will be 'standardized' we will indeed end up like the tower of Babel. Latin might maybe be better in the sense that it's more precise when it comes to grammatic constructions, but it's also much more complicated to learn. Say about English what you want, at least it's easy (if one neglects the pronounciation problem). Best,

B.

Phil Warnell said...

Hi Bee,

“Problem is I receive a lot of papers I am supposed to referee which sound like the second version of above example rather than the first.”

One day your honesty is going to have the better of you :-) Now can you imagine what it’s like when a novice such as I tries to understand what they contain? Of course I’m spared what it looks like before the referees have had their effect. On the other hand many scientists wonder why they are so misunderstood by the general public. I can tell you based humbly as one who has held science as an important hobby all of my life, there is nothing to wonder about.

“Latin might maybe be better in the sense that it's more precise when it comes to grammatic constructions, but it's also much more complicated to learn. Say about English what you want, at least it's easy (if one neglects the pronounciation problem).”

First I was not proposing we return to a dead language to solve the problem and yet am somewhat surprised that you have such a high regard for English. Not that I would object for it would spare me from being further left in the dust then I currently feel I am, yet I would do my best to learn anything that had been mandated and enforced to be a standard even Latin. Until then the scientist can only believe:

“Mathematics est verum”

Regards,

Phil

Bee said...

Well, yeah, don't misunderstand me. Not being a native speaker myself I understand that it is difficult to argue in a foreign language, but there are just limits to what one can guess together from the equations that are provided, even with the best intentions.

Phil Warnell said...

Hi Bee,

One should not be left to believe that your English serves to be in any way a handicap. Trust me, as being one outside the halls of academia, your comprehension and communication far exceeds that of more then just the average native’s ability (including my own). When it comes to understanding, I’m afraid the only form of communication currently more incomprehensible then science is that of diplomacy. I am then glad that you and Stefan not only continue to struggle to understand this for yourselves, yet also grateful you find time to aid others to perhaps sort some of it out.

“verum est verus”

Best,

Phil

William said...

Bee,

While the software you linked to doesn't seem to be able to pick out the context, here's a sample from google translate:

piece of string -> Bout de ficelle
string theory -> La théorie des cordes
ascii string -> Chaîne ASCII

I'm impressed. I wonder how they do that.

Uncle Al said...

1) Time flies like an arrow, fruit flies like a banana.

2) "out of sight, out of mind" => "invisible idiot."

3) Given Geshwindigkeitsbegrenzung, where is there room on the sign for the number? Hence natural evolution of the Autobahn with no speed limit (and presumably all the signs at the end).

Phil Warnell said...

Hi Willaim,
“I'm impressed. I wonder how they do that.”

As I understand it’s basically a combination of statistical comparison and what I would call a evolutionary element explained as follows:

“In order to make this happen, Google plans to bring some drastic reforms to the algorithms working behind scene. As is expected of a technology pioneer, they've devised a new approach to the whole issue - namely, "statistical machine translation" which differs from any of the past efforts in that it forgoes language experts who program grammatical rules and dictionaries into computers."

The new process involves feeding pre-translated parallel text in various languages into computers and then relying on them to discern patterns for future translations. At the moment the quality offered by this mechanism isn't perfect either - but there's a distinct advantage to such a statistical analysis engine. With time, "the more data we feed into the system, the better it gets" said Franz Och, the head of Google's translation effort at its Mountain View, California headquarters”

So one might ask how do I know? Simple, I of course searched Google. Oh Sergey and Brin are clever fellows. Dam, they are also rich:-)

Best,

Phil

Phil Warnell said...

Hi Willaim,

Sorry, that should read "Sergey and Larry" are clever fellows not "Sergey and Brin", since they are Sergey Brin and Larry Page. This also is to remind that while they are clever I am not. Simply a practiced Googler :-)

Best,

Phil

William said...

Thanks, Phil!

Arun said...

Dear Bee,

Part of the meaning of language does not reside in the language itself, it resides in our heads.

That is, there is not an absolute intrinsic meaning to a text; any reading involves interpretation. (We get nasty forms of religious fundamentalism when people forget this fact when they read ancient texts.)

This is why even when linguists,archaeologists have been able to decipher a dead language, the meaning of the texts still can be ambiguous.

Perhaps only the driest and most pedantic of scientific texts has an intrinsic meaning independent of the observer.

So until a computer can have some or all of our mental models, it cannot do a good job of translation.

----

An old joke is "The vodka is good, but the meat is rotten" as the English-Russian-English translation of "the spirit is willing but the flesh is weak".

----

Since I'm sure I'll be challenged on this, let me find a simple way of proving my main assertion above.

Best,
-Arun

Anonymous said...

Hi, I did my own research with Babelize. I tried "I had a dream". It goes back and forth perfectly thru German, French and Portuguese. I mean after the 3rd pass it still says in English "I had a dream". So far so good.Then it goes forward to translate into spanish and returns back to english. The final pass. And this is the result:

"I was a sleepy"

Try yourself. I can't imagine rev. Martin Luther saying "I was a sleepy"

Btw, other translators do a perfect translation english-spanish-english of the same sentence. Curious.

Arun said...

http://syrcom.cua.edu/Hugoye/Vol6No1/HV6N1PRPhenixHorn.html

(Not a easy read, I understand only the gist of it.)

An example which makes me wonder what the meaning of text is. For more than a thousand years, millions of men interpreted a text to mean nubile women would be available to them in heaven. In today's world, men are willing suicide bombers based on this belief. Now this guy comes along and tells them sorry, it was really white grapes all along.

Notice that one needs a theory of how the text was produced and how it was transmitted. I think in general there are whole hosts of assumptions we don't state when we read something. Fortunately for most part the assumptions are right and so the exercise works.

Bee said...

Well, taken together it seems to me the translation algorithms used for the babelizer aren't quite the best, maybe deliberately so. I've used Google translate various times (e.g. to check trackbacks to blogposts written in languages I don't speak), and though it isn't perfect it is usually sufficient to understand what is written about.

Anonymous said...

Indeed an interesting program:

Original English Text:
I love you

Translated to French:
Je t'aime

Translated back to English:
I love you

Translated to German:
Ich liebe Dich

Translated back to English:
I love you

Translated to Italian:
Ti amo

Translated back to English:
I love to you

Translated to Portuguese:
Eu amo-lhe

Translated back to English:
I love to it

Translated to Spanish:
Amo a él

Translated back to English:
Master to him

Arun said...

google tools:

This and that

English to German

Dies und Das

German to French

Si l'on y ajoute L'

French to English

If one adds'

-----------------

This and that

English to French

Que ce

French to German

Was

German to English

What

----

:)

Oliver said...

This reminds me of that masterpiece of the English language, `English as She is Spoke', by Pedro Carolino. He had the worthy idea of creating a Portuguese - English phrase book for the particular benefit of the Portuguese youth. Thankfully for us, he was not deterred by the fact that he didn't speak English. However, he did have to hand both a Portuguese - French phrase book and a French - English dictionary.

There's an entry about the book on Wikipedia, which also has a link to some of Carolino's `Idiotism's and Proverbs' (you couldn't make that up).