Embracing Diversity: Searching over multiple languages


Tommaso Teofili
Suneel Marthi

June 12, 2017
Berlin Buzzwords, Berlin, Germany

$WhoAreWe

Tommaso Teofili
@tteofili
  • Software Engineer, Adobe Systems
  • Member of Apache Software Foundation,
  • PMC Chair, Apache Lucene
  • Committer and PMC on Apache Joshua, Apache OpenNLP, Apache JackRabbit


Suneel Marthi
@suneelmarthi
  • Principal Software Engineer, Office of Technology, Red Hat
  • Member of Apache Software Foundation
  • Committer and PMC on Apache Mahout, Apache OpenNLP, Apache Streams

Agenda

  • What is Multi-Lingual Search ?
  • Why Multi-Lingual Search ?
  • What is Statistical Machine Translation ?
  • Overview of Apache Joshua
  • Dataflow Pipeline
  • Demo

What is Multi-Lingual Search ?

  • Searching
    • over content written in different languages
    • with users speaking different languages
    • both
  • Parallel corpora
  • Translating queries
  • Translating documents

Why Multi-Lingual Search ?

Embracing diversity

  • Most online tech content is in English
    • Wikipedia dumps:
      • en: 62GB
      • de: 17GB
      • it:   10GB
  • Good number of non-English speaking users
  • A lot of search queries are composed in English
  • Preferable to retrieve search results in native language
  • … or even to consolidate all results in one language

UC1 — tech domain, native first

UC2 — native only ?

What is Machine Translation ?

Generate Translations from Statistical Models trained on Bilingual Corpora.

Translation happens per a probability distribution p(e/f)

E = string in the target language (English)

F = string in the source language (Spanish)

e~ = argmax  p(e/f) = argmax p(f/e) * p(e)

e~ = best translation, the one with highest probability

Word-based Translation

How to translate a word → lookup in dictionary
Ge­bäu­de — building, house, tower.

Multiple translations
some more frequent than others
for instance: house and building most common

Look at a parallel corpus
(German text along with English translation)

Translation of Ge­bäu­deCountProbability
house5.28 billion0.51
building4.16 billion0.402
tower9.28 million0.09

Alignment

  • In a parallel text (or when we translate), we align words in one language with the word in the other
    DasGebäudeisthoch
    thebuildingishigh
  • Word positions are numbered 1—4

Alignment Function

  • Define the Alignment with an Alignment Function
  • Mapping an English target word at position i to a German source word at position j with a function a : i → j
  • Example
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4}

One-to-Many Translation

A source word could translate into multiple target words

DasisteinHochhaus   
Thisisahigh   risebuilding

Phrase-based Translation

Alignment Function

  • Word-Based Models translate words as atomic units
  • Phrase-Based Models translate phrases as atomic units
  • Advantages:
    • many-to-many translation can handle non-compositional phrases
    • use of local context in translation
    • the more data, the longer phrases can be learned
  • “Standard Model”, used by Google Translate and others

Phrase-Based Model

BerlinisteinherausragendesKunst- und Kulturzentrum.
BerlinisanoutstandingArt and cultural center.

  • Foreign input is segmented in phrases
  • Each phrase is translated into English
  • Phrases are reordered

Decoding

  • We have a mathematical model for translation p(e|f)
  • Task of decoding: find the translation ebest with highest probability
    ebest = argmax p(e|f)
  • Two types of error
    • the most probable translation is bad → fix the model
    • search does not find the most probable translation → fix the search

Translation Process

Translate this query from German into English

ertrinktjanochnichts
er    
    
he    

Pick and input phrase, translate

Translation Process

Translate this query from German into English

ertrinktjanochnichts
er  ja noch nichts
   
he does not yet 

Pick and input phrase, translate

Translation Process

Translate this query from German into English

ertrinktjanochnichts
er trinkt ja noch nichts
   
he does not yet drink

Pick and input phrase, translate

Apache Joshua

  • Statistical Machine Translation Decoder for phrase-based and hierarchical machine translation
  • Written in Java
  • Provide 64 language packs for machine translation
    • https://cwiki.apache.org/confluence/display/JOSHUA/Language+Packs
  • Project initiated by Johns Hopkins Univ. and University of Pennsylvania
  • Presently incubating at Apache Software Foundation
  • Used extensively by Amazon.com, NASA JPL
  • https://cwiki.apache.org/confluence/display/JOSHUA
  • @ApacheJoshua

Flows

References

  • Apache Joshua — https://cwiki.apache.org/confluence/display/JOSHUA
  • Apache OpenNLP — https://opennlp.apache.org
  • GitHub — https://github.com/smarthi/BBuzz-multilang-search
  • Slides — https://smarthi.github.io/bbuzz17-embracing-diversity-searching-over-multiple-languages/#/

Credits


  • Joern Kottmann — PMC Chair, Apache OpenNLP
  • Matt Post — PMC Chair, Apache Joshua
  • Bruno P. Kinoshita — Committer on Apache OpenNLP, committer and PMC on Apache Commons and Apache Jena

Questions ???