Hejmen » Deepdict » Reference

Reference

DeepDict’s data are ultimately derived from large corpora (electronic text collections) that have been annotated with grammatical information using language-specific Constraint Grammar parsers. The most important pieces of information are so-called dependency links between words. The verb ‘eat’, for instance, will “know”, who is eating (subject) and what he is eating (object). Since the links are between words, not phrases, a giant red fox eating a tiny cute rabbit will become ‘fox’ -> ‘eat’ -> ‘rabbit’. Also, to achieve dictionary style entries, inflected forms are normalized, and ‘eat’, ‘eats’, ‘eating’ and ‘ate’ will all count as ‘eat’ (infinitive).

Certain word classes exhibit huge lexical variation without major differences in usage, and in order to prevent a combinatorial explosion in DeepDict’s statistical database, the following conventions have been adopted:

  • (cardinal) numerals are all expressed as ‘NUM’
  • relatives and interrogatives (who, that, where, how) are collapsed into ‘rel’ and ‘interr’
  • proper nouns (names) are expressed as ‘PROP’. For some languages, depending on parsing quality, subcategories are used instead:
    • hum = human
    • org = organisation
    • inst = institution
    • civ = administrative place (civitas)
    • top = natural place (topology)
    • event = event

Personal and quantifier pronouns are so frequent that exact statistical measures are of little interest. However, they may provide semantic information in a prototypical fashion, and they are therefore listed - by order of frequency - at the top of subject and object fields. For English, for instance, personal pronouns may help classify activities as typically male (he) or female (she), or mark objects as inanimate (it) or mass nouns (much). Check out, for instance, the verbs ‘caress’ and ‘drink’.

In the parsers providing the corpus data behind DeepDict, nouns are classified according to semantic prototype class, e.g. as <Hprof> (professional human), <tool-cut> (cutting tool) or <Vair> (air vehicle), and this semantic generalisation has been made available for some DeepDict languages. For details, see the Semantic Prototype Overview (pdf).

Adverb-verb collocations may appear in several functional shades, ranging from (a) free temporal, locative and modal adverbs (work where/when/how) to (b) valency bound adverbial complements (feel how, go where) and (c) verb-integrated particles (give up, fall apart). In some cases, it may even be difficult to decide on one or other category (eat out). Since DeepDict is basically intended to be a dictionary tool, syntactic hair splitting is less important, and only the verb particles (c) are singled out, to cover phrasal verbs, while all other adverb-verb collocates are lumped together in a single (brown-shaded) field. Also, some of the most comon adverbs, without connotational specificity (like not, also, then) have been stripped altogether.

A similar point can be made for prepositional complements of verbs, which may be either free adverbials (work after dinner), valency-bound place and direction adverbials (live in Paris) or regular objects (think of sb., believe in sth.). While the parsers behind DeepDict do make such distinctions, their dictionary value is limited, and all cases form part of the same prepositional collocation table (with prepositions in red and prepositional arguments with yelow shading). With the closest and most frequent collocates on top (rather than alphabetical order), a dictionary listing of phrasal verbs is emulated.

Directly beneath the head word, some general classification is provided for nouns (countable or mass) and verbs (transitivity). Note that this information is corpus-derived, too - not human-edited, but based on collocation strengths involving certain determiners, numerals and objects. Therefore, typical uses will overshadow theoretically possible but rare uses, and formally deprecated but common usage may be listed as well, if corroborated by corpus evidence.

Comments are closed.