Explanation of Fields
The distribution of entries in the WORD field by the number of letters that they contain is shown in Table 2.
Table 2. The Distribution of Word Lengths Given by NLET.
| NUMBER OF OCCURRENCES | NLET |
| 31 | 1 |
| 168 | 2 |
| 1342 | 3 |
| 4719 | 4 |
| 10199 | 5 |
| 16818 | 6 |
| 21118 | 7 |
| 22302 | 8 |
| 20426 | 9 |
| 16409 | 10 |
| 11697 | 11 |
| 7566 | 12 |
| 4451 | 13 |
| 2342 | 14 |
| 1158 | 15 |
| 479 | 16 |
| 250 | 17 |
| 81 | 18 |
| 32 | 19 |
| 14 | 20 |
| 4 | 21 |
| 1 | 22 |
| 2 | 23 |
The distribution of entries in the WORD field by the number of phonemes that they contain is shown in Table 3.
The distribution of entries in the WORD field by the number of syllables that they contain is shown in Table 4.
K-F-FREQ, K-F-NCATS, K-F-NSAMP
The first of these refers to a
word's frequency of occurrence as given in the norms of Kucera
and Francis (1967). The maximum frequency in the file is 69971,
the minimum is 0. The meaning of K-F-NCATS and K-F-NSAMP are
defined by Kucera and Francis (1967).
Table 3. The Distribution
of Phoneme Counts Given by NPHON.
| NUMBER OF OCCURRENCES | NPHON |
| 109060 | 0 |
| 32 | 1 |
| 276 | 2 |
| 1442 | 3 |
| 3396 | 4 |
| 4561 | 5 |
| 4985 | 6 |
| 4691 | 7 |
| 4199 | 8 |
| 3317 | 9 |
| 2429 | 10 |
| 1536 | 11 |
| 862 | 12 |
| 450 | 13 |
| 206 | 14 |
| 110 | 15 |
| 42 | 16 |
| 9 | 17 |
| 3 | 18 |
| 3 | 19 |
Table 4. The Distribution of Syllable
Counts Given by NSYL.
| NUMBER OF OCCURRENCES | NSYL |
| 58081 | 0 |
| 12485 | 1 |
| 32837 | 2 |
| 27751 | 3 |
| 14159 | 4 |
| 4530 | 5 |
| 856 | 6 |
| 134 | 7 |
| 14 | 8 |
| 1 | 9 |
This is the frequency of occurrence as given in the L count of Thorndike and Lorge (1942). If you plan to use this frequency count, you are advised to read details about it in the Thorndike-Lorge book. For example, the frequency value of a singular word which has a regular plural includes the frequency of the plural form, and this is true for other kinds of derivations too.
This stands for the frequency of occurence in verbal language derived from the London-Lund Corpus of English Conversation by Brown (1984). There are 14529 entries for 8985 different strings in the WORD field. The range of entries is 0 - 6833 with a mean of 35 and a standard deviation of 252.
This stands for 'printed familiarity'. The FAM values were derived from merging three sets of familiarity norms: Pavio (unpublished), Toglia and Battig (1978) and Gilhooly and Logie (1980). The method by which these three sets of norms were merged is described in detail in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). This method may not meet with everyone's approval. FAM values lie in the range 100 to 700 with the maximum entry of 657, a mean of 488 and a standard deviation of 99: note that they are integer values (in the original norms the equivalent range was 1.00 to 7.00).
This is concreteness, and it too is derived from a merging of the Pavio, Colerado, and Gilhooly-Logie norms: details of merging are given in Appendix 2 of the MRC Psycholinguistic Database User Manual (Coltheart, 1981a). CONC values are integer, in the range 100 to 700 (min: 158; max 670; mean 438; s.d. 120).
This is imageability, derived from merging the three sets of norms referred to above, and having values in the range 100 to 700 (min 129; max 669; mean 450; s.d. 108).
These are the meaningfulness ratings from the Toglia and Battig (1978), multiplied by 100 to produce a range from 100 to 700 (min 127; max 667; mean 415; s.d. 78).
This is the meaningfulness from the norms of Pavio (unpublished) multiplied by 100 to produce a range from 100 to 700. The two sets of meaningfulness ratings were not merged because their correlations were low ( only + .529) and the mean values for a set of words common to the two sets of norms were very low (see Toglia and Battig, 1978, Table 2).
These differences are due to differences in the instructions to subjects. Thus the two sets of meaningfulness ratings are not comparable, and so were kept seperate (min 192; max 922; mean 600; s.d. 107).
This is age of acquisition from the norms of Gilhooly and Logie (1980), multiplied by 100 to produce a range from 100 to 700 (min 125; max 697; mean 405; s.d. 120).
When TQ2 has the value Q (40810 occurrences), this word is a derivational variant of another.
This is syntactic category as represented in the SOED database assembled by Dolby, Resnikoff and MacMurray (1963). There are ten different syntactic categories, coded as shown in Table 5.
Table 5. Syntactic Category Codes for WTYPE
| SYNTACTIC CATEGORY | CODE | OCCURRENCES |
| Noun | N | 77355 |
| Adjective | J | 25547 |
| Verb | V | 30725 |
| Adverb | A | 4243 |
| Preposition | R | 230 |
| Conjunction | C | 108 |
| Pronoun | U | 134 |
| Interjection | I | 352 |
| Past Participle | P | 5939 |
| Other | O | 6136 |
When you are interested in syntactic category, WTYPE can sometimes be unsatisfactory. For example, the words FREEZE and HARASS are Nouns according to WTYPE (as well as verbs); and indeed when these are looked up in SOED or Webster's, they are described as nouns. If you want to avoid such esoteric usages, PDWTYPE may be useful. It refers to the syntactic categories given in Jones' Pronouncing Dictionary (Jones, 1963), and very unusual uses of words are not considered. However PDWTYPE uses only four categories, not ten: these four are noun (N, 22061 occurrences), verb (V, 6333 occurrences), adjective (J, 8817 occurrences) and other (O, 1179 occurrences). The mapping from WTYPE to PDWTYPE is shown in Table 6.
Table 6. The Mapping from WTYPE to PDWTYPE
| OCCURRENCES | WTYPE | PDWTYPE |
| 3751 | A | |
| 492 | A | O |
| 47 | C | |
| 61 | C | O |
| 261 | I | |
| 91 | I | O |
| 16730 | J | |
| 8817 | J | J |
| 55294 | N | |
| 22061 | N | N |
| 5785 | O | |
| 351 | O | O |
| 5939 | P | |
| 115 | R | |
| 115 | R | O |
| 65 | U | |
| 69 | U | O |
| 24392 | V | |
| 6333 | V | V |
If this = A, then the word is an abbreviation (130 occurrences); if S, the word is a suffix (282 occurrences); if P, a prefix (1374 occurrences); if H, the word is hyphenated (13716 occurrences); if T, a multi-word phrasal unit (436 occurrences). For all of these categories, NSYL = 0. For all other words ALPHYSL is blank.
The 15 possible categories of STATUS are listed in Table 7; these are as given in the Dolby database (Dolby et al., 1963) derived from the Shorter Oxford English Dictionary, and perusal of Table 7 should make the meanings of these categories sufficiently clear.
Table 7. The Possible Values of STATUS
| STATUS OF WORD | CODE | OCCURRENCES |
| Dialect | D | 2780 |
| Alien | F | 6003 |
| Archaic | A | 959 |
| Colloquial | Q | 405 |
| Capital | C | 2 |
| Erroneous | N | 0 |
| Nonsense | E | 62 |
| Nonce Word | W | 33 |
| Obsolete | O | 10549 |
| Poetical | P | 183 |
| Rare | R | 2756 |
| Rhetorical | H | 22 |
| Specialised | $ | 7731 |
| Standard | S | 58065 |
| Substandard | Z | 0 |
This refers to words which have the same spelling but different pronunciation and syntactic classes. When the pronunciations differ only in respect of stress (e.g. object, insult) VAR = O (212 occurrences).When the pronunciations differ phonemically (e.g. moderate, abuse), VAR = B (1233 occurrences).
If this = C, then the word is normally written with an initial capital letter. This can be used as an indicator of proper nouns such as the names of people, towns, states and countries.
This refers to the plurality of words. Where IRREG = Z, the word is plural (17441 occurrences), this can be used in conjunction with TQ2 to select irregular forms; where IRREG = Y, the word is a singular form (1024 occurrences); where IRREG = B, the word is both the singular and the plural form (151 occurrences); where IRREG = N, the word has no plural form (4407 occurrences); where IRREG = P, the word is plural but acts singular (88 occurrences)
The dictionary is ordered by the ascii sequence of these strings. Although there are 150837 entries in the dictionary, there are only 115331 different strings. The distribution of homographs is as follows:
| NUMBER OF ENTRIES | NUMBER OF WORDS |
| 1 | 94225 |
| 2 | 22132 |
| 3 | 2967 |
| 4 | 703 |
| 5 | 96 |
| 6 | 20 |
| 7 | 5 |
The 12th edition of Daniel Jones's Pronouncing Dictionary (Jones, 1963) was transferred to magnetic tape by Professor L. Guierre (Guierre, 1966). These are used as the basis of the phonetic transcriptions in the PHON field. The phonetic symbols used on this tape were adjusted following suggestions from Roger Mitton (see Mitton, 1986) to conform to the U.K. Alvey standard for machine readable phonetic transcription (Wells, 1986). The changes in phonetic symbols used from Coltheart (1981a) made by by Quinlan (1986) include: devoiced consonants have been folded into their voiced equivalents; Coltheart (1981a) refers to the symbol 3, which has been ditched as no occurrence could be found; I( and U( have been mapped into I and U respectively. The symbols currently used in PHON field are a '/' character to denote syllable boundaries and those presented in Table 8 with, where printable, the International Phonetic Alphabet equivalents. The DPHON field uses these symbols without the syllable distinguisher, but with the inclusion of the TQ2 symbols following the phonetic transcription. DPHON also includes the following three characters: - + R. The hyphen is used to represent the hyphen in hyphenated spellings. The 'R' character is used to represent a final R in the first part of hyphenated words which is only pronunced if the second part of a hyphenated word begins with a vowel. The '+' sign is used to indicate the division between the two parts of a compound noun written without a space (indicated by ALPHSYL = T) or hyphenation (indicated by ALPHSYL = H).
Table 8. Phonetic Symbols used in the Dictionary
| CONSONANTS | VOWELS | ||||
| IPA PHONETIC SYMBOL | EXAMPLE | DATABASE PHONETIC SYMBOL | IPA PHONETIC SYMBOL | EXAMPLE | DATABASE PHONETIC SYMBOL |
| p | put | p | i: | bean | i |
| b | but | b | a: | barn | A |
| t | ten | t | : | born | O(oh) |
| d | den | d | u: | boon | u |
| k | can | k | v | burn | 3 |
| m | man | m | i | pit | I |
| n | not | n | S | pet | e |
| l | like | l | de | pat | & |
| r | run | r | ^ | putt | V |
| f | full | f | o | pot | 0 (zero) |
| v | very | v | C | good | U |
| s | some | s | ] | about | @ |
| z | zeal | z | ei | bay | eI |
| h | hat | h | ai | buy | aI |
| w | went | w | i | boy | oI (oh) |
| g | game | g | oC | no | @U |
| t^ | chain | tS | aC | now | aU |
| dz | Jane | dZ | i] | peer | I@ |
| \ | long | 9 | S] | pair | e@ |
| O | thin | T | C] | poor | u@ |
| I | then | D | |||
| ^ | ship | S | |||
| Q | measure | Z | |||
| j | yes | j | |||
Amsler, R.A. (1984). Machine-Readable
Dictionaries. In M.E. Williams (Ed.), Annual Review of
Information Science and Technology (ARIST), 19, 161-209. American
Society for Information Science (ASIS); Knowledge Industry
Publications, Inc.
Brown, G.D.A. (1984). A frequency count
of 190,000 words in the London-Lund Corpus of English
Conversation. Behavioural Research Methods Instrumentation
and Computers, 16 (6), 502-532.
Coltheart, M. (1981a). MRC
Psycholinguistic Database User Manual: Version 1. [This is a now hard-to-find
"in house" production. Mike Wilson has kindly provided
an OCR transcript online.]
Coltheart, M. (1981b). The MRC
Psycholinguistic Database. Quarterly Journal of Experimental
Psychology, 33A, 497-505.
Dolby, J.L, Resnikoff, H.L. and MacMurray,
F.L. (1963). A tape dictionary for linguistic experiments.
In Proceedings of the American Federation of information
processing societies: Fall Joint Computer Conference, Volume 24.
Baltimore, MD: Spartan Books. 419-23.
Gilhooly, K.J. and Logie, R.H. (1980). Age
of acquisition, imagery, concreteness, familiarity and ambiguity
measures for 1944 words. Behaviour Research Methods and
Instrumentation, 12, 395-427.
Guierre, L. (1966). Un codage des mots
anglais en vue de l'analyse automatique de leur structure
phonetique. Etudes de linguistique appliquee, 4, 48-64.
Kiss, G.R., Armstrong, C., Milroy, R. and
Piper, J (1973). An associative thesaurus of English and its
computer analysis. In Aitkin, A.J., Bailey, R.W., and
Hamilton-Smith, N. (Eds.), The computer and Literary Studies.
Edinburgh: University Press.
Kucera and Francis, W.N. (1967). Computational
Analysis of Present-Day American English. Providence: Brown
University Press.
Mitton, R. (1986). A description of the
files CUVOALD.DAT and CUV2.DAT. The machine usable form of the
Oxford Advanced Learner's Dictionary. The Oxford Text
Archive: Oxford, U.K.
Pavio, A., Yuille, J.C. and Madigan, S.A.
(1968). Concreteness, imagery and meaningfulness values for
925 words. Journal of Experimental Psychology Monograph
Supplement, 76 (3, part 2).
Quinlan, P. (1986). Description of
machine-readable dictionary files. Report. Dept. of
Psychology, Birkbeck College, London.
Svartik, J. and Quirk, R. (1980). A
Corpus of English Conversation. Lund: Gleerup.
Thorndike, E.L. and Lorge, I. (1944). The
Teacher's Word Book of 30,000 Words. New York: Teachers
College, Columbia University.
Toglia, M.P. and Battig, W.R. (1978). Handbook
of Semantic Word Norms. New York: Erlbaum.
Wells, J.W. (1986). A standardised
machine-readable phonetic notation. In Proceedings of the IEE
conference on speech input/output: techniques and applications.
London, Easter 1986.
