|
EWika: Towards the Digitalization of Philippine Languages Isalin Translate|
|
| Date | 08.12.2018 | | Size | 8,45 Kb. |
|
- Charibeth K. Cheng (koc@dlsu.edu.ph)
- DLSU, College of Computer Studies
- Natural Language Processing Research Lab
MT Research in RP - started in 1993 at UP-Los Baňos
- Dr. Rachel Roxas and Allan Borra
- in 2004 start at DLSU
ENG-FIL MT System Project - 3-year project
- started 2005
- funded by DOST-PCASTRD
- composition:
- 6 faculty members of College of Computer Studies
- 15 computer science majors
- assisted by the Filipino Dept and Dept in English & Applied Linguistics of DLSU-M
Architectural Design of the Program - Language Resources:
- Lexicon (electronic dictionary),
- Morphological Analyzer & Generator
- Part-of-Speech tagger
- Grammar,
- Corpus (Tagged)
Rule-Based approach - Kumain ng mga mansanas ang batang lalaki.
- Where do we get the translation rules?
Example-Based - Learn the rules from examples
- Kumain ng mga mansanas ang batang lalaki.
Using the rule - Nagluto ng isda ang nanay.
Using the rule - Umuwi ng bahay ang nanay.
Limitation of a Rule - Qualities of a Good Translation
- Clarity – 3.3
- Accuracy – 3.2
- Naturalness - 2.8
- highest score of 5
- 100 respondents (5 linguists)
Challenge! - Language resources
- Quality of translation is dependent on it.
- Built from almost non-existent digital forms
- manual vs. automatic construction
Lexicon - Diksyunaryo ng Wikang Filipino
- automatic construction (AeFLEX):
- Currently contains about 30,000+ entries
- Challenge: Lexical resources
- translation documents
- part-of-speech tagger
- Dictionary is incomplete
- Create a software that:
- analyzes – determines the root word
- generates – generates the inflected word
- Given: eating -> eat -> kain -> kumakain
- Challenge : Lexical resources
- lexicon
- part-of-speech tagger
Part-Of-Speech Tagger - automatic association of parts-of-speech to words in a document
- Can? – kaya vs. lata
- Baba? – chin or go down
- Challenge : Lexical resource
- corpora
- lexicon
- morphological analyzer
- grammar
Corpora - collection of translation-pair documents
- used by the lexicon extractor and part-of-speech tagger, example-based MT
- came from translation works of DLSU English majors, verified by linguists
- consists of 207,000 words
Lexicon Resource Dependency Bringing it home … - 171 Philippine Languages (SIL)
- No Philippine Corpora
- Unfortunately, today, the Philippines has one of the highest rates of dying languages (Solfed Foundation Inc)
- “Without our language, we have no culture, we have no identity, we are nothing.” (Thorrson)
eWika: Digitalization of Philippine Languages - Build the Philippine Corpus
- Build software tools to study or use the corpus
- Across Regions
- Across Forms and Genres
- Across Languages
Across Regions - Web-based application: GLOBALIZATION
- Contributors (Main players)
- Verifiers
- Server: DLSU-M commits to host the server for the next three years.
- Terms of Use: Research purposes.
Across Languages - 171 Philippine Languages (SIL List)
- start with 8 major languages
- Tagalog, Cebuano, Ilocano, Hiligaynon, Bikol, Waray, Kapangpangan, Boholano
- Filipino Sign Language
Across Forms and Genres - In various forms:
- Text
- Speech
- Video: Filipino sign language
- In various Genres:
- Text – literary & creative, essays, news articles, religious, etc
- Speech – scripted, conversations, etc
- Video – common signs, regional signs, signs for specific purposes (legal, IT, etc.)
- The dream of building electronic, online Philippine language resources and tools
- Many many many major hurdles to overcome
- NEEDED : Language Resources, Tools, & Peopleware
Share with your friends: |
|
|