EWika: Towards the Digitalization of Philippine Languages Isalin Translate



Download 8,45 Kb.
Date08.12.2018
Size8,45 Kb.

eWika: Towards the Digitalization of Philippine Languages

  • Isalin
  • Translate
  • Charibeth K. Cheng (koc@dlsu.edu.ph)
  • DLSU, College of Computer Studies
  • Natural Language Processing Research Lab

MT Research in RP

  • started in 1993 at UP-Los Baňos
  • Dr. Rachel Roxas and Allan Borra
    • grammar-based
  • in 2004 start at DLSU

ENG-FIL MT System Project

  • 3-year project
  • started 2005
  • funded by DOST-PCASTRD
  • composition:
    • 6 faculty members of College of Computer Studies
    • 15 computer science majors
    • assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M

Architectural Design of the Program

  • Language Resources:
    • Lexicon (electronic dictionary),
    • Morphological Analyzer & Generator
    • Part-of-Speech tagger
    • Grammar,
    • Corpus (Tagged)
  • MT: Example-based
  • MT: Rule-based
  • User Interface
  • Output Modeller
  • Source Text
  • Target Text
  • Translator Engine

Rule-Based approach

  • Apply translation rules
  • The boy ate apples.
  • Kumain ng mga mansanas ang batang lalaki.
  • Where do we get the translation rules?

Example-Based

  • Learn the rules from examples
  • The boy ate apples.
  • Kumain ng mga mansanas ang batang lalaki.
  • A
  • B
  • C
  • D
  • A
  • B
  • C
  • D
  • Rule Learned:
  • A B C D
  • C ng D A B

Using the rule

  • A B C D
  • C ng D A B
  • The mother cooked fish.
  • Nagluto ng isda ang nanay.
  • A
  • B
  • C
  • D
  • A
  • B
  • C
  • D

Using the rule

  • A B C D
  • C ng D A B
  • The mother went home.
  • Umuwi ng bahay ang nanay.
  • A
  • B
  • C
  • D
  • A
  • B
  • C
  • D

Limitation of a Rule

  • The boy ate the fish.
  • A B C D
  • C ng D A B
  • A
  • B
  • C
  • D

Results of the MT Engine

  • Qualities of a Good Translation
    • Clarity – 3.3
    • Accuracy – 3.2
    • Naturalness - 2.8
  • highest score of 5
  • 100 respondents (5 linguists)

Challenge!

  • Language resources
    • Quality of translation is dependent on it.
    • Built from almost non-existent digital forms
    • manual vs. automatic construction
  • Dictionary
  • Sample Translations
  • Grammar

Lexicon

  • Diksyunaryo ng Wikang Filipino
  • automatic construction (AeFLEX):
    • accuracy rate - 57%
  • Currently contains about 30,000+ entries
  • Challenge: Lexical resources
    • translation documents
    • part-of-speech tagger

Morphological Analyzer and Generator

  • Dictionary is incomplete
  • Create a software that:
    • analyzes – determines the root word
    • generates – generates the inflected word
    • Given: eating -> eat -> kain -> kumakain
  • Challenge : Lexical resources
    • lexicon
    • part-of-speech tagger

Part-Of-Speech Tagger

  • automatic association of parts-of-speech to words in a document
    • Can? – kaya vs. lata
    • Baba? – chin or go down
  • Challenge : Lexical resource
    • corpora
    • lexicon
    • morphological analyzer
    • grammar

Corpora

  • collection of translation-pair documents
  • used by the lexicon extractor and part-of-speech tagger, example-based MT
  • came from translation works of DLSU English majors, verified by linguists
  • consists of 207,000 words

Lexicon Resource Dependency

  • Corpus
  • POS Tagger
  • Morph AG
  • Lexicon

Bringing it home …

  • 171 Philippine Languages (SIL)
  • No Philippine Corpora
  • Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc)
  • “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)

eWika: Digitalization of Philippine Languages

  • Build the Philippine Corpus
  • Build software tools to study or use the corpus
    • Across Regions
    • Across Forms and Genres
    • Across Languages

Across Regions

  • Web-based application: GLOBALIZATION
    • upload, download, tools
  • Contributors (Main players)
  • Verifiers
  • Server: DLSU-M commits to host the server for the next three years.
  • Terms of Use: Research purposes.

Across Languages

  • 171 Philippine Languages (SIL List)
  • start with 8 major languages
    • Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano
  • Filipino Sign Language

Across Forms and Genres

  • In various forms:
    • Text
    • Speech
    • Video: Filipino sign language
  • In various Genres:
    • Text – literary & creative, essays, news articles, religious, etc
    • Speech – scripted, conversations, etc
    • Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)
  • The dream of building electronic, online Philippine language resources and tools
  • Many many many major hurdles to overcome
  • NEEDED : Language Resources, Tools, & Peopleware



Share with your friends:


The database is protected by copyright ©sckool.org 2019
send message

    Main page