A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, hoo, dante and the Kelly Project



Download 22,64 Kb.
Date conversion07.04.2017
Size22,64 Kb.

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project

  • Adam Kilgarriff
  • Lexical Computing Ltd
  • http://www.sketchengine.co.uk

English Profile

  • From 2006
  • Cambridge Univ, Univ Press, ESOL (+ others)
  • Goal
    • for each CEFR level, find characteristic lexis and grammar
      • CEFR: Common European Framework of Reference
    • Main resource: CLC
  • NTNU Nov 2011
  • KIlgarriff

Cambridge Learner Corpus (CLC)

  • Since 1993
  • Leading resource
  • CUP and Cambridge Assessment
    • For better dictionaries, ELT courses, tests
    • Material: all from exams (levels A1-C2)
  • 45m words; 22m error-tagged
  • 200,000 scripts, 138 L1s, 203 nationalities
  • NTNU Nov 2011
  • KIlgarriff

Sketch Engine

  • Leading corpus tool
  • Word sketches
    • One-page summaries of a word’s grammatical and collocational behaviour
  • In use at OUP, CUP, Collins, Macmillan, INL …
  • 55 languages
    • 175 corpora
    • Since May including CHILDES: demo
    • Since last year including CLC
  • NTNU Nov 2011
  • KIlgarriff
  • NTNU Nov 2011
  • KIlgarriff
  • Macmillan English Dictionary
  • For Advanced Learners
  • Ed: Rundell, 2002

Error-coded corpus

  • Challenge
    • Intuitive to search for x
      • anywhere
      • only where it is part of an error
      • only where it is part of a correction
    • where x can be a word, phrase, grammar pattern …
    • Requirement for CLC in Sketch Engine
  • NTNU Nov 2011
  • KIlgarriff

Error-coded corpora in SkE

  • demo
  • NTNU Nov 2011
  • KIlgarriff

HOO / HOO+

  • Helping Our Own
  • HOO: English-NNS NLP researchers
    • Developer = user: motivation
    • Shared task/competitive evaluation
      • Organisers define task and prepare ‘gold standard’
      • Teams participate by running their software over test data
      • Six teams (incl Tübingen), workshop end Sept
  • NTNU Nov 2011
  • KIlgarriff

HOO+ (2012)

  • Probably
    • English: learner data from CLC
    • Other languages?
    • Tasks
      • Essay scoring
      • Determiner, preposition errors
      • ?
      • http://www.clt.mq.edu.au/research/projects/hoo/
  • NTNU Nov 2011
  • KIlgarriff

DANTE

  • Highlights of English lexicography
  • NTNU Nov 2011
  • KIlgarriff

DANTE

  • NTNU Nov 2011
  • KIlgarriff

DANTE

  • NTNU Nov 2011
  • KIlgarriff

DANTE

  • NTNU Nov 2011
  • KIlgarriff

DANTE

  • http://webdante.com
  • NTNU Nov 2011
  • KIlgarriff

The KELLY Project

  • EU Lifelong Learning Project
  • Word cards
    • 9 languages
      • Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish
    • All 36 pairs
    • Words the learner should know (at A1 … C2)
  • Partners
      • Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd
  • NTNU Nov 2011
  • KIlgarriff

Interesting question

  • How close to purely corpus-based can a pedagogic list be?
  • NTNU Nov 2011
  • KIlgarriff

Method

  • Take a general corpus
  • Count
  • Review, add, delete using other lists and corpora
  • Translate (72 directed-lg-pairs)
  • Words not in source list which occur in translations:
    • Review source list
  • http://kelly.sketchengine.co.uk
  • NTNU Nov 2011
  • KIlgarriff
  • Symmatrical pairs: and
  • Cliques:
    • For x, y, z, … all pairs are symmetrical
    • 9-language cliques (English members)
      • hospital library music sun theory
  • NTNU Nov 2011
  • KIlgarriff

Web corpora

  • NTNU Nov 2011
  • KIlgarriff
  • Replaceable or replacable?
    • http://googlefight.com
    • http://looglefight.com

The web is

  • NTNU Nov 2011
  • KIlgarriff
  • The web is
    • Very very large
    • Most languages
    • Most language types
    • Up-to-date
    • Free
    • Instant access

Web corpus types

  • NTNU Nov 2011
  • KIlgarriff
  • Large, general corpora
  • Small, specialised corpora

Basic steps

  • NTNU Nov 2011
  • KIlgarriff
  • Gather pages
    • CSE hits
    • Select and gather whole sites
    • General crawl
  • Filter
  • De-duplicate
  • Linguistic processing
  • Load into corpus tool

WaC family corpora

  • NTNU Nov 2011
  • KIlgarriff
  • 100m – 2b word corpora
  • 2-month project each
  • All major world languages available in Sketch Engine
    • Currently 42 languages
    • Growing monthly
      • Pioneers: Marco Baroni, Serge Sharoff
      • Corpus Factory
  • Seeds:
    • mid-frequency words from ‘core vocab’ lists and corpora
  • Google on seed words, then crawl

How good are they?

  • NTNU Nov 2011
  • KIlgarriff
  • How to assess?
    • Hard question, open research topic
  • Good coverage
    • Newspapers: news, politics bias
    • Web corpora: also cover personal, kitchen vocab
  • Web corpus / BNC / journalism corpus
    • First two are close

Evaluating word sketches

  • NTNU Nov 2011
  • KIlgarriff
  • 11 years
    • 1999-2011
  • Feedback
    • Good but anecdotal
  • Formal evaluation
  • Method also lets us evaluate corpora

Goal

  • KIlgarriff
  • Collocations dictionary
    • Model: Oxford Collocations Dictionary
    • Publication-quality
  • Ask a lexicographer
    • For 42 headwords
      • For 20 best collocates per headwords
    • “should we include this collocation in a published dictionary?”
  • NTNU Nov 2011

Sample of headwords

  • KIlgarriff
  • Nouns verbs adjectives, random
  • High (Top 3000)‏
  • N space solution opinion mass corporation leader
  • V serve incorporate mix desire
  • Adj high detailed open academic
  • Mid (3000- 9999)‏
  • N cattle repayment fundraising elder biologist sanitation
  • V grieve classify ascertain implant
  • Adj adjacent eldest prolific ill
  • Low (10,000- 30,000)‏
  • N predicament adulterer bake bombshell candy shellfish
  • V slap outgrow plow traipse
  • Adj neoclassical votive adulterous expandable
  • NTNU Nov 2011

Precision and recall

  • NTNU Nov 2011
  • KIlgarriff

High recall

  • NTNU Nov 2011
  • KIlgarriff
  • Lots of responses
  • Maybe not all good

High precision

  • NTNU Nov 2011
  • KIlgarriff
  • Fewer hits
  • Higher confidence
  • KIlgarriff
  • Precision and recall
  • We test precision
  • Recall is harder
    • How do we find all the collocations that the system should have found?
    • Current work
      • 200 collocates per headword
      • Selected from
        • All the corpora we have
        • Various parameter settings
      • Plus just-in-time evaluation for 'new' collocates
  • NTNU Nov 2011

Four languages, three families

  • KIlgarriff
  • Dutch
    • ANW, 102m-word lexicographic corpus
  • English
    • UKWaC, 1.5b web corpus
  • Japanese
    • JpWaC, 400m web corpus
  • Slovene
    • FidaPlus, 620m lexicographic corpus
  • NTNU Nov 2011

User evaluation

  • KIlgarriff
  • Evaluate whole system
    • Will it help with my task
      • Eg preparing a collocations dictionary
  • Contrast: developer evaluation
  • NTNU Nov 2011

Components

  • KIlgarriff
  • Corpus
  • NLP tools
    • Segmenter, lemmatiser, POS-tagger
  • Sketch grammar
  • Statistics
  • NTNU Nov 2011

Practicalities

  • KIlgarriff
  • Interface
    • Good, Good-but
      • Merge to good
    • Maybe, Maybe-specialised, Bad
      • Merge to bad
  • For each language
    • Two/three linguists/lexicographers
    • If they disagree
      • Don't use for computing performance
  • NTNU Nov 2011

Results

  • KIlgarriff
  • Dutch 66%
  • English 71%
  • Japanese 87%
  • Slovene 71%
  • NTNU Nov 2011
  • NTNU Nov 2011
  • KIlgarriff
  • Two thirds of a collocations dictionary can be gathered automatically

Thank you http://www.sketchengine.co.uk

  • NTNU Nov 2011
  • KIlgarriff
  • NTNU Nov 2011
  • KIlgarriff

Lexicography: finding facts about words

  • NTNU Nov 2011
  • KIlgarriff
  • collocations
  • grammatical patterns
  • idioms
  • synonyms
  • meanings
  • translations

Four ages of corpus lexicography

  • NTNU Nov 2011
  • KIlgarriff
  • NTNU Nov 2011
  • KIlgarriff
  • Age 1:
  • Pre
  • computer
  • Oxford English
  • Dictionary:
  • 5 million
  • index cards

Age 2: KWIC Concordances

  • NTNU Nov 2011
  • KIlgarriff
  • From 1980
  • Computerised
  • Overhauled lexicography

Age 2: limitations

  • NTNU Nov 2011
  • KIlgarriff
  • Age 2: limitations
  • as corpora get bigger:
  • too much data
    • 50 lines for a word: :read all
    • 500 lines: could read all, takes a long time, slow
    • 5000 lines: no

Age 3: Collocation statistics

  • NTNU Nov 2011
  • KIlgarriff
  • Problem: too much data - how to summarise?
  • Solution: list of words occurring in neighbourhood of headword, with frequencies
  • Sorted by salience

Age-3 collocation statistics: limitations

  • NTNU Nov 2011
  • KIlgarriff
  • Lists contain
  • junk
  • unsorted for type – mixes together adverbs, subjects, objects, prepositions
  • What we really want:
  • noise-free lists
  • one list for each grammatical relation

Age 4: The word sketch

  • NTNU Nov 2011
  • KIlgarriff
  • Large well-balanced corpus
  • Parse to find
  • One list for each grammatical relation
  • Statistics to sort each list, as before
  • NTNU Nov 2011
  • KIlgarriff
  • Working practice
  • Lexicographers mainly used sketches not concordances
    • missed less, more consistent
    • Faster

Euralex 2002

  • NTNU Nov 2011
  • KIlgarriff

Euralex 2002

  • NTNU Nov 2011
  • KIlgarriff
  • Can I have them for my language please

The Sketch Engine

  • NTNU Nov 2011
  • KIlgarriff
  • Input:
  • Word sketches integrated with
  • Corpus query system
    • Supports complex searching, sorting etc
  • Credit: Pavel Rychly, Masaryk Univ

Customers

  • NTNU Nov 2011
  • KIlgarriff
  • Dictionary publishers
    • Oxford University Press
    • Cambridge University Press
    • Collins
    • National dictionary projects in
      • Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia
  • Universities
    • Teaching and research
    • Languages, linguistics, language technology
    • UK, Germany, US, Greece, Taiwan, Japan, China, …
  • Other
    • Language teaching, textbook writing
    • Information management, web search
  • NTNU Nov 2011
  • KIlgarriff
  • Demo
    • http://sketchengine.co.uk
    • Free trial

What is there on the web?

  • NTNU Nov 2011
  • KIlgarriff
  • Web1T
    • Present from google
    • All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of English
      • 1,000,000,000,000
  • Compare with BNC
    • Take top 50,000 items of each
    • 105 Web1T words not in BNC top50k
    • 50 words with highest Web1T:BNC ratio
    • 50 words with lowest ratio

Web-high (155 terms)‏

  • NTNU Nov 2011
  • KIlgarriff
  • 61 web and computing
    • config browser spyware url www forum
  • 38 porn
  • 22 US English
  • 18 business/products common on web
    • poker viagra lingerie ringtone dvd casino rental collectible tiffany
    • NB: BNC is old
  • 4 legal

Web-low

  • NTNU Nov 2011
  • KIlgarriff
  • Exclude British English, transcription/tokenisation anomalies
    • herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Observations

  • NTNU Nov 2011
  • KIlgarriff
  • Pronouns and past tense verbs
    • Fiction
  • Masc vs fem
  • Yesterday
    • Probably daily newspapers
  • Constancy of ratios:
    • He/him/himself
    • She/her/herself


The database is protected by copyright ©sckool.org 2016
send message

    Main page