Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Lexis in English language corpora, Exams of English Language

Jan Svartvik,. Department of English, Lund University, Sweden. Lexis in English language corpora. 1. The second corpus generation.

Typology: Exams

2021/2022

Uploaded on 09/27/2022

mancity4ever
mancity4ever 🇬🇧

4.5

(15)

251 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Jan
Svartvik,
Department
of
English,
Lund
University,
Sweden
Lexis
in
English language
corpora
1.
The second
corpus
generation
Many
more years ago than
I
care to remember,
on
the
occasion
of my inaugural lecture
at
Lund University,
I
spoke with some enthusiasm about the bright future of corpus-based
study
of
spoken language, what with tape-recorders getting smaller,
and
computers
getting bigger.
In
1992,
at the
Fifth Euralex Congress
in
Tampere,
the
future
of
corpus
linguistics
seems even brighter than
on
that previous
occasion.
Yet,
while tape-recorders
may indeed
be a bit
smaller
(the
stereo
set,
though, seems colossal compared
to our
gramophone), computers are actually getting smaller
too:
there has been
a
radical devel-
opment from
the
mainframe
to
the micro, personal, desktop, laptop, palmtop
and
note-
book.
But not only are computers getting smaller but also faster and cheaper. This fantas-
tic
technological
hardware
development that
we are
witnessing
is of
course only
one
reason for my
belief
that the future of corpus linguistics is even brighter now than
at the
beginning
of the
seventies.
The
best
part
is
that
the
hardware
is
also becoming well
matchedby software,
and
software development is indeed crucial if the corpus approach
is
going
to
fulfil its promise.
The
meaning of "corpus" as given
in
most dictionaries is rather vague
and
gives little
indication of bright prospects, for example:
MACQUARIE DICTIONARY:
"a body of data".
COLLWS COBUILD DICTIONARY:
"a
large number of
articles,
books, magazines, etc that
have been deliberately collected together for some purpose".
LONGMAN DICTIONARY OF CONTEMPORARY
ENGLISH:
"a
collection..
of
material or infor-
mation
for
study" (New edition,
1987).
LONGMAN DICTIONARY OF THE ENGLISH LANGUAGE
0^Jew
edition,
1991) is
more
explicit:
"a
collection
of
spoken and/or written language
for
scientific
study
of
word forma-
tion, sentence structure, sounds, etc".
COBUILD
adds
the warning:
"a
formal, technical word" ft>ut, like
LONGMAN,
also gives the
helpful hint that the plural can be either
corpora
or
corpuses).
AIl of the definitions in these
recent
works
fail
to
specify "machine-readable", which is ofcourse the current norm and
also
the
topic
of
this paper,
in
particular electronic corpora
of
spoken English.1 Only
LONGMAN
gives
a
clear indication that there are,
and
should be, corpora
of
speech
- by
far
the most common use of language and the variety that has too long been neglected
in
both grammatical
and
lexicographical description.
It
is not
often that
we can
date
the
beginning
of a new bud on the
linguistic tree
structure,
but
this
is
indeed possible with corpus linguistics,
at
least English corpus
linguistics.
It is now getting mature, just over
30
years
of
age.
From the humble beginning
engaging only
a
small number
of
linguists, corpora have become
"the
flavour
of the
pf3
pf4
pf5
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Lexis in English language corpora and more Exams English Language in PDF only on Docsity!

J a n Svartvik, D e p a r t m e n t of English, Lund University, Sweden

Lexis in English language

corpora

1. The second corpus generation

Many more years ago than I care to remember, on the occasion of my inaugural lecture at Lund University, I spoke with some enthusiasm about the bright future of corpus-based study of spoken language, what with tape-recorders getting smaller, and computers getting bigger. In 1992, at the Fifth Euralex Congress in Tampere, the future of corpus linguistics seems even brighter than on that previous occasion. Yet, while tape-recorders may indeed be a bit smaller (the stereo set, though, seems colossal compared to our gramophone), computers are actually getting smaller too: there has been a radical devel opment from the mainframe to the micro, personal, desktop, laptop, palmtop and note book. But not only are computers getting smaller but also faster and cheaper. This fantas tic technological hardware development that we are witnessing is of course only one reason for my belief that the future of corpus linguistics is even brighter now than at the beginning of the seventies. The best part is that the hardware is also becoming well matchedby software, and software development is indeed crucial if the corpus approach is going to fulfil its promise. The meaning of "corpus" as given in most dictionaries is rather vague and gives little indication of bright prospects, for example:

- MACQUARIE DICTIONARY: "a body of data". - COLLWS COBUILD DICTIONARY: "a large number of articles, books, magazines, etc that have been deliberately collected together for some purpose". - LONGMAN DICTIONARY OF CONTEMPORARY ENGLISH: "a collection.. of material or infor mation for study" (New edition, 1987). - LONGMAN DICTIONARY OF THE ENGLISH LANGUAGE 0^Jew edition, 1991) is more explicit: "a collection of spoken and/or written language for scientific study of word forma tion, sentence structure, sounds, etc". COBUILD adds the warning: "a formal, technical word" ft>ut, like LONGMAN, also gives the helpful hint that the plural can be either corpora or corpuses). AIl of the definitions in these recent works fail to specify "machine-readable", which is ofcourse the current norm and also the topic of this paper, in particular electronic corpora of spoken English.^1 Only LONGMAN gives a clear indication that there are, and should be, corpora of speech - by far the most common use of language and the variety that has too long been neglected in both grammatical and lexicographical description. It is not often that we can date the beginning of a new bud on the linguistic tree structure, but this is indeed possible with corpus linguistics, at least English corpus linguistics. It is now getting mature, just over 30 years of age. From the humble beginning engaging only a small number of linguists, corpora have become "the flavour of the

18 EURALEX^ '92 -^ PROCEEDINGS

decade" (Sinclair 1992: 379). The beginning of this movement was the making of the Brown Corpus of written American English which set a pattern for the making of a host of corpora of representing other varieties of English (for descriptions of English language corpora, see Aijmer & Altenberg 1991:315-318; Taylor, Leech & FHgelstone 1991). It was a typical feature of this first generation of corpora that they totalled one million words made up from 2000 or 5000 word-samples intended to be representative of some of the uses of the language, and were made available on computer tape for batch processing on mainframe machines located behind glass doors and operated by systems engineers in white coats. We are now beginning to experience the second generation of corpora. They are char acterized by larger size than those of the first generation: for example, the British Na tional Corpus is planned to include 100 million words (Quirk 1992), and the corpus used by one group working on machine translation is reported to total 365,893,263 words (Brown et al 1991). Instead of the "representative", finite size corpus of the first gener ation we are likely to be seeing more typological variation, such as the "monitor" corpus where "sources of language text in electronic form would be fed on a daily basis across filters which retrieve evidence as necessary" (Sinclair 1991:9). There is a movement in the direction of corpus pluralism: the index of the proceedings from a symposium on corpus linguistics, which took place in Stockholm a year ago, includes the following corpus types: core, dialect, expanded, grammatical, lexicographical, monitor, non-standard, re gional, specialized, spoken, test and training corpus. Their days are by no means over, but "standard corpora" will probably serve more and more as stepping-stones to other, specific corpus types. One obstacle to corpus use has been the lack of a standard encoding system, but this is now disappearing with the emergence of SGML (Standard Generalized Markup Lan guage), which is likely to be in wide use. It is only to be hoped that SGML will also support a generalized system for prosodic transcription of spoken language (see Johans son 1991).

2. Why use a corpus in the first place?

Particularly in the last decade improved access to massive corpora, efficient machines and user-friendly programs has changed the working conditions of those linguists who use "real language data". Of course, not all linguists want to use corpora. In Chomsky's approach (1988:45), "externalized language" (E-language) and "internalized language" (I- language) are separate entities, and it is I-language, ie the native speaker's mental competence, that is the primary subject of linguistics. This view is, however, not shared by linguists such as Chafe, Fillmore, Halliday and Leech (all 1992), who rather emphasize the interdependence of linguistic theory building and language data analysis. Yet, while many linguists value corpus data, the terms "corpus linguistics", and even more so "corpus linguisr", are considered unfortunate by Wallace Chafe: 'The term 'corpus linguist' puts the emphasis on one tie to reality that has been neglected by many contemporary linguists, I believe to the great detriment of the field: a tie that must be vigorously pursued if our understanding of language and the mind is to enjoy significant progress. But there is a complementary danger in implying that that is all a linguist should do, of pitting corpus linguists against introspective linguists or experimental linguists or compu tational linguists. I would like to see the day wnen we will all be more versatile in our

20 EURALEX^ '92^ -^ PROCEEDINGS

occurrence. Future descriptive grammars and dictionaries are hardly likely to be pro duced without recourse to authentic examples. Furthermore, corpus work will no doubt make its mark in many other areas like historical and applied linguistics. The CD-ROM versions of such historical depositories as the OED and the Helsinki Corpus of English Texts (see Kyto 1991) are likely to open up new possibilities in the field of diachronic studies (as examples of what a historical corpus can offer, see work by Matti Rissanen and his group at Helsinki, such as Neva- lainen 1991 and Raumolin-Brunberg 1991). The now easily retrievable historical data can shed new light on historical developments such as the influx of Romance lexical material and the influence of French on English but also on theoretical issues, for example the relation of grammar and lexis, as stated in a recent study of suffixal derivation in Middle English: "mteresting though they were, the results of the morphological analysis, were not always significant. In the end it became fairly clear that it was semantics which was the more powerful driving force behind the shifts and reshuffles in the Middle English derivational system. Potentially, this is a finding which could feed back into our understanding and theoretical conception of word-formation and its position in a model of grammar as it seems to me to underline the role of the lexicon" (Dalton-Puffer 1991:327). In language teaching, assuming that both teaching methods and exposure to authentic language are important for language learning, there is naturally much to be learned from "real data", as opposed to the "concocted examples" often used in linguistic studies or the "pedagogical language" as commonly encountered in language learning textbooks. We all have some experience of students coming to university with a naive attitude to usage as being either correct or incorrect. For such students, a hands4 >n, self-access experience of real data in the classroom could provide a valuable eye-opener to the wider linguistic issues of frequency, acceptability, collocability and style in current usage (see Tribble & Jones 1990).

3. Corpora of spoken English

AH handbooks in linguistics have long stressed the importance of the spoken language, and for some time now we have witnessed novel approaches to the study of spoken discourse. Our contribution at Lund University to this field was the launching of the Survey of Spoken English in the mid-seventies. Our first undertaking was to obtain suitable data. Having been an associate of the research team on Randolph Quirk's Survey of English Usage at University College London in the sixties, it was a natural step to make use of this corpus by computerizing the spoken component of the carefully tran scribed material, then stored only on paper slips in Foster Court filing cabinets. Given the technology available to us at the time, computerization of such complicated data with its detailed prosodic transcription was by no means a simple task, but the operation was nevertheless considered essential for three main reasons. We wanted, first, to have easy access to the material at our Lund base; second, to make use of the computer's superb possibilities as a tool for retrieval, storage, classification, etc.; third, to be able to share the database with fellow researchers no matter where they happened to be working. The original version of the London-Lund Corpus of Spoken English, which was distributed on computer tape and included 87 texts, became available in 1980, when we also publish-

Svartvik: Lexte in English l a n g u a g e c o r p o r a 21

ed а printed book including conversations in the corpus (Svartvik & Quirk 1980). The complete version, including all 100 texts (see the description in Greenbaum & Svartvik 1990), totalling half a million words, recently appeared in a CD-ROM version together with other English language corpora, and all with retrieval tools 0VordCruncher and TACT) included.^2 The majority of the texts in the London-Lund Corpus are conversa tions. One reason for this is that informal, spontaneous, interactive discourse is by far the most common form of language use, another that it has been an underresearched area of modern English; this was conspicuously so in the late fifties when the plans were drawn up for the London Survey (see Quirk 1960). The chief aim of the Survey of English Usage was to create a basis for studying English grammar rather than its lexis. For general lexical work, such as dictionary-making, a corpus of one million words, half of them written, half spoken, is clearly inadequate. For comparison, Cobuild, which is a project dedicated to lexical computing, has a text corpus of general English which "stands at around 20 million words in daily use, backed up by a range of more specialised texts coming to a total of about another 20 million" (Sinclair 1987: vii). Yet, while the London-Lund Corpus has been used chiefly for studies of gram mar and discourse (see Greenbaum & Svartvik 1990, Appendix 2), it can indeed be used also for lexical studies, particularly if we take the view that grammar and lexis form a continuum and focus on Murray's "little words". I will now briefly survey some areas where lexical work has been done on corpus- based spoken English: statistical vocabulary studies, adverbials and prosody, discourse items, register variation, semantic fields, and collocation. Most of these areas fields also hold great promise for future research.

4. Statistical vocabulary studies

The aim of the first uses of corpora, including those B.C. ft>efore computers), was chiefly lexico-statistical. The studies on English by Thorndike (1921), Fries & Traver (1940), Thorndike & Lorge 1944, and Bongers (1947) were closely connected with language teaching and the "vocabulary control movement". In his work on vocabulary Palmer included six thousand collocations which led him to suggest that even common colloca tions "exceed by far the popular estimate of the number of simple words contained in our everyday vocabulary", thus "throwing a new light on the nature of vocabulary" (1933:7; for a useful survey of this field, see Kennedy 1992). So far the most extensive dedicated pedagogical use of corpora has been to produce statistics on frequency of vocabulary items and structural patterns. One form ofinforma- tion derived from word frequency counts is that, in most texts, a small number of differ ent words (ie types) account for a very large proportion of all word tokens: in most written texts 5,000 words will account for up to 95% of the tokens, and 1,000 words will account for 85%; in speech, 50 function words account for up to 60% of the tokens (cf Kennedy 1992:339; for LOB analyses, see Johansson & Hofland 1989). Recent approaches, such as the lexical syllabus (Sinclair & Renouf 1987), highlight the common uses of common words, stressing the importance of the good company of words rather than the large number of words. Hence the foremost task for language learners is not to learn as many words as possible but the highly frequent words in their customary environment (cf Sinclair 1987:159):

Table 1. The 50 m o s t frequent w o r d s In SEC a n d c o m p a r i s o n s w i t h LOB. B r o w n , a n d LLC

been 39 37 43 68 - n't (^40) - WlII (^50 48 47) -

  • Svartvik: Lexis in English l a n g u a g e c o r p o r a
    • the Word SEC LOB Brown LLC
      • of
      • a n d
    • to
      • a
      • In
    • that
      • was
      • for
      • it
      • he
      • Is
      • on
      • as
      • at
      • his
      • with
      • I
      • but
      • by
        • 's
      • be this 22 22 - 21 - 14 -
      • опѳ
      • you
      • from
      • they
      • have
      • we
      • an
      • are
      • were
        • all
      • not
      • which
      • there
      • h a d
      • their
      • two so 41 46 52 - 30 -
        • has
        • sald
        • who
        • or
        • when
        • c a n
        • up
      • SEC vs. LOB Table 2. Sums o f rank d i f f e r e n c e s for t h e 50 m o s t c o m m o n w o r d s In SEC
        • SECvs.Brown
      • SEC vs. LLC
        • LOB vs. Brown
        • LOB vs. LLC

24 EURALEX^ '92 -^ PROCEEDINGS

Adverbials occupy an intermediate position on the grammar/lexis continuum: they have specific grammatical functions but form a large, open lexical class with a wide range of meanings. Qearly they must be properly covered in the dictionary. Grammatical tagging of entries in dictionaries is now fairly commonplace, at least in learners' diction- aries, but it is of course doubtful whether this type of information is properly used. My own experience is that it is not. There are several likely reasons for this: one is that so far there have been only weak or nonexistent attempts on the part of lexicographers to establish a solid link between grammar, lexis and prosody; another that there is no universally accepted system of grammatical and prosodic categories; most importantly, once we leave the reasonably obvious lexical definition of the word and enter the nebu- lous realms of grammar and prosody, the level of linguistic abstraction makes definitions more complicated. The understanding of, and motivation to learn, terms like "disjunct", "falling-rising tone" and even "transitive" are bound to be limited among general dic- tionary users who are accustomed to look up words in a dictionary mainly to check spelling or meaning. Yet as dictionaries have become more and more specialized and geared to the needs of different user-categories, those users who are familiar with gram- matical and prosodic terminology are likely to benefit from more complete information than is offered in general-purpose dictionaries. Although it carries meaning, prosody has been almost totally neglected in dictionaries.

6. Discourse items

In the word<lass tagging of part of the spoken corpus that we undertook at Lund, it became clear that the set of traditional word-classes was inadequate. Hence we devised a new tagset consisting of over 200 categories. This is large in comparison with other similar sets: the tagged Brown Corpus uses 179 different wordtags, the LOB tagset com- prises 132 tags, and the Leeds tagset 137 tags (for a description of the tagset, see Svartvik 1990: 94; for the implementation of probabilistic word-class tagging on LLC and the design of a model for morphological knowledge representation, see Eeg-Olofsson 1991). The types of problems we faced can be exemplified by mm, you know and sort of thing. 'Responses' transcribed as m, mm or mhm are usually not to be found in dictionaries; COBUILD seems to be an exception here: "Mm is used in writing to represent a sound that you make when someone is talking, to indicate that you are listening to them, that you agree with them, or that you are preparing to say something" (928).

The frequency list indicated that the verbs know, think, mean, see were extremely frequent in spoken as compared with written English. The reason is of course that a word-based frequency list fails to capture word combinations like you know, you see and / mean func- tioning as 'softeners', 'responses' such as / see, that's right, and 'hedges' such as sort of thing, which tend to find a place neither in dictionaries nor grammars. Yet in a sample of 50,000 words such 'discourse items' occupy fourth place, ahead of the well-established grammatical word-classes of prepositions, adverbs, conjunctions and adjectives. 'Discourse items' which are almost exclusively restricted to spoken discourse have been divided into groups (cf Nattinger 1988: 78-79; Stenstròm 1990:144; Stenstrom forthcom-

26 EURALEX^ '92 -^ PROCEEDINGS

ТаЫе 3

ТаЫѳД

y o u k n o w 152

[m] [m]

128 y e s y e s 120 I t h i n k 106 sort o f (^100) y o u s e e 95 o h y e s (^94) i s n ' t i t 88 and t h e n 82 w h i c h is 81 I m e a n 74 and he 73 and t h e y 72 t h a n k you 72 at all 65

at the m o m e n t 203 for a m o m e n t 16 at this m o m e n t 12 in a m o m e n t 11 o n e m o m e n t 8 for the m o m e n t 6 1 u s t a m o m e n t 5 w a i t a m o m e n t 4 for o n e m o m e n t 4 a few m o m e n t s 4 that m o m e n t 3 a m o m e n t ago 2 a (^) m o m e n t p l e a s e 2 any m o m e n t 2 at (^) anv g i v e n m o m e n t 2 d r e a d f u l m o m e n t 2 from the m o m e n t 2 o f the (^) m o m e n t 2 this m o m e n t (^2) at this very m o m e n t 2 w i t h i n a m a t t e r of m o m e n t s 2

ТаЫѳ 5

t h a n k s (19%)

L m a n v t h a n k s (1%)

ф (64)

г^ Ф Ф^ i(97) L for

г t h a n k you (80%) verv m u c h (29) -

NP (3) г Ф (90)

verv m u c h (51)

L for NP (10)

verv m u c h indeed (6)

so m u c h (1)

Г Ф (77) L for. N P (23)

Ф (50) NP (50) verv m u c h Jndeed (8)

awfullv (3)

Ф (37)

г Ф (ï L for

Svarfvik: Lextó in English tanguage c o r p o r a^27

mer/producer of the process, and the speech process is radically different from the writing process, in particular with its real-time constraint.

8. Semanticfields

What appears to be a most fruitful lexical use of corpora is the analysis of specific semantic fields and pragmatic categories. In his study of the expression of modality, Hermerén (1986) found, among other things, that verbs are used much more frequently than other word classes to express Obligation, Permission, Volition and their negated equivalents, yet "modal auxiliaries express these modalities less often than the exponents of other word classes put together", and modal nouns are generally more frequent in written than spoken English (90). Similarly, in her study of epistemic modality as expressed in some ESL textbooks as compared with real corpus-data, Janet Holmes has shown that many textbook writers "devote an unjustifiably large amount of attention to modal verbs, neglecting alternative linguistic strategies for expressing doubt and certainty" (1988: 40). Such alternatives include lexical verbs (appear, believe, doubt, seem, suggest, etc), adverbials (apparently, cer tainly, doubtless, inevitably, necessarily, etc) and nouns (belief, certainty, idea, opinion, possi bility, tendency, etc). The reason for the traditional emphasis on modal verbs to the exclu sion of lexical verbs, adverbials and nouns can be traced to structural grammars where the morphological peculiarities of modal auxiliaries (lack of third-person-s, infinitive, and participle forms, etc) naturally place these auxiliaries high on the list of teaching items. Other semantically equivalent expressions (suggest, apparently, belief, etc) do not constitute any morphological problem and, consequently, have no place in a morpho logically-biassed textbook. Kennedy has studied the uses of certain lexical items such as between and through. While they are among the most frequent words in the English language there is neither descriptive nor pedagogical guidance about them. In addition to offering a statistical dimension to this area, Kennedy provides information about their occurrence: "like other structural words, [they] are leamt not as representatives of word classes or lexemes in isolation,but in association with other words" (1991:110).

9. Collocation

Large collections of real data offer a rich, but as yet largely uncultivated, field for stu dying habitual cooccurrences of lexical items, whether they be called lexical phrases, collocations, prefabs or preassembled chunks. Some such multi-word items belong to the speech-specific categories already mentioned (ifyou don't mind, etc), but most types do not appear to be characteristic of either the spoken or written varieties. Yet there is a reason why such prefabs may be considered particularly relevant for the student of spoken discourse. Interactive speech takes place in real time which - unlike written discourse - offers no opportunity of resorting for help to a dictionary, a friend or an embassy. In the typical information structure of speech we speak in brief chunks (ie information units, tone units) which are often made up of habitual cooccurrences.

Svartvik: Lexis In English l a n g u a g e c o r p o r a 29

Notes

1 I want to thank Bengt Altenberg and Anne Wichmann for comments on a draft of this paper. 2 The title of the CD-ROM (ISBN 82-72834fc4-7, December 1991) is "ICAME CoUection of En glish Language Corpora". It includes the Brown, Helsinki, Kolhapur, LOB, and London- Lund corpora and is distributed by Norwegian Computing Centre for the Humanities, Bergen, Norway, P.O. Box 53, N-5027 Bergen, Norway. 3 The project "Public Speaking" is funded by the Swedish Council for Research in the Human ities and Social Sciences (HSFR). 4 From LLC only a list of 100 was available, hence the two missing words, their and will. The contractions 's and n't are defined as words only in SEC "Not would have a rank of 15 in SEC if all the negations were counted together. The 's total comprises contractions of both is and has. ii we add up aU occurrences of is, we get the total of 619, which would have a rank of 7. Contracted forms have been counted as distinct words in the other corpora" (Ekedahl 1992). (^5) The Ekedahl (1992) formula used was 1 1 Rii -R2t I, where Ru is the rank of the word number i' in the first list, and Ra is the rank of the same word in the second Ust; i is the number of the word in the SEC list and varies between 1 and 50. The two ' I ' mean that the value between them is always to be turned into a positive number.

References

Aijmer, Karin & Bengt Altenberg (eds.). 1991. English corpus linguistics. London: Longman. Allén, Sture. 1992. "Opening address". In Svartvik (ed.), 1-3. Allerton, D.J. & A. Cruttenden. 1976. 'The intonation of medial and final sentence adverbials in British English". Archivum Linguisticum 7:29- 59. Altenberg, Bengt. 1990. "Spoken English and the dictionary". m Svartvik (ed.), 177-191. Altenberg, Bengt. 1991. "The London-Lund Corpus of Spoken EngUsh: Research and applications". Proceedings of the Seventh Annual Conference of the UW Centre for the New OED and Text, 71^3. University ofWaterIoo, Waterloo, Ontario, Canada. Biber, Douglas. 1988. Variation across speech and writing. Cambridge: Cambridge University Press. Bohm, Cecilia. 1992. Readability analysis by computer. An evaluation of the readability pro gramme Corporate Voice. [Research paper.] Department ofEnglish, Lund University. Bolinger, Dwight. 1975. Aspects ofknguage. New York: Harcourt Brace. ^ngers, H. 1947. The history and principles ofvocabukry control. Woerden: Wocopi. Brown, Peter F., Vincent J. Delb Pietra, Peter V. de Souza, Jenifer C Lai & Robert L. Mercer. 1991. "Class-based n-gram models of natural language. lPaper for the Pisa conference on European corpus resources, 24-26 January 1992.] Chafe, Wallace. 1992. 'The importance of corpus Unguistics to understanding the nature of lan guage", m Svartvik (ed.), 79-97. Chomsky, Noam. 1988. Generative grammar: Its basis, development and prospects. Kyoto: Kyoto University of Foreign Studies. Collins CobuiM English hmguagedictionary. 1987. London: Collins. Dalton-Puffer, Christiane. 1991. Suffixal derivation in Middle English. A corpus-based study. Ph.D. dissertation, Department of English, University of Vienna. Eeg4Dlofsson, Mats. 1991. Word<kss tagging. Sonv computational took. [Ph.D. Diss] Department of Computational Linguistics, University of Goteborg. Ekedahl, Olof. 1992. Word and tag frequencies in SEC [Research paper.] Department of English, Lund University.

30 EURALEX^ '92 -^ PROCEEDINGS

Filhnore, Charles J. 1992. "'Corpus linguistics' or 'Computer-aided armchair linguistics'", m Svart vik (ed.), 35^0. Fries, Charles C & A. Aileen Traver. 1940. English word lists. A study oftheiradaptability for instruc tion. Washington: American Council on Education. Greenbaum, Sidney & Jan Svartvik. 1990. 'The London-Lund Corpus of Spoken English". In Svartvik (ed.) 1990,11-45. Halliday, M.A.K. 1992. "Language as system and knguage as instance: The corpus as a theoretical construct". In Svartvik (ed.), 61-78. Hermerén, Lars. 1986. "Modalities in spoken and written English. An inventory of forms". In English in speech and writing: A symposium edited by Gunnel Tottie & Ingegerd Backlund, 57-91. Studia Anglistica Upsaliensia 60. Stockholm: Almqvist & Wiksell. Hobnes, Janet. 1988. "Doubt and certainty in ESL textbooks". Applied Linguistics 9: 2 Ы 4. Johansson, Stig. 1991. "Some thoughts on the encoding of spoken texts in machine-readable form" [MS]. Johansson, S. & K. Hofland. 1989. Frequency analysis of English vocabuhry and grammar. Oxford: Oxford University Press. Kennedy, Graeme. 1991. "Between and through: The company they keep and the functions they serve". In Aijmer & Altenberg, 95-110. Kennedy, Graeme. 1992. "Preferred ways of putting things with implications for language teacli- ing". _h_ Svartvik (ed.), 335-373. Kjellmer, Goran. 1991. "A mint of phrases". In Aijmer & Altenberg (eds.), 111-127. Knowles, Gerry. 1990. 'The use of spoken and written corpora in the teaching of language and linguistics." Literary and Linguistic Computing 5:45^8. Kucera, Henry. 1992. 'The odd couple: The linguist and the software engineer. The struggle for high quaUty computerized bnguage aids", m Svartvik (ed.), 401^20. Kyto, Merja. 1991.Мдиия/ to the diachronic part ofthe Hehinki corpus ofEnglish texts. Coding conven tions and lists of source texts. Department of English, University of Helsinki. Leech, Geoffrey. 1992. "Corpora and theories oflinguistic performance".In Svartvik(ed.), 105-122. Longman Dictionary ofContemporary English. 1987. New edition. London: Longman. Longman Dictionary ofthe EnglishLanguage. 1991. Newedition. London: Longman. Macquariedictionary. 1991. Second edition. MacquarieUniversity. Murray, K.M. Elisabeth. 1977. Caught in the web of words: ]ames Murray and the Oxford English Dictionary. New Haven and London: YaIe University Press. Nattinger, J. 1988. "Some current trends in vocabulary teaching". Vocabuhry and knguage teaching, edited by Ronald Carter & M. McCarthy, 62^2. London: Longman. Nevalainen, Terttu. 1991. "But, only, just". Focusing adverbial change in Modern English 1500-1900. Helsinki: Société Néophilologique. Pabner, Harold E. 1933. Second interim report on English collocations. Tokyo: Institute for Research in English Teaching. Quirk, Randolph. 1960. 'Towards a description of English usage". Transactions ofthe Philological Society 7960:40^1. Quirk, Randolph. 1992. "On corpus principles and design". In Svartvik (ed.), 457-469. Raumolin-Brunberg, Helena. 1991. The noun phrase in early sixteenth- century English. A study based on Sir Thomas More's writings. Helsinki: Société Néophilologique. Sinclair, John. 1987. 'The nature of the evidence". In Looking up. An account ofthe COBUiLD project in kxical computing, edited by John Sinclair, 150-166. London: Collins. Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press. Sinclair, John. 1992. 'The automatic analysis of corpora", in Svartvik (ed.), 379-397.