Uploaded by Valentina Tyrygina

CORPUS LINGUISTICS

advertisement
CORPUS LINGUISTICS
a short introduction
Tatiana I. Shutova, LUNN
What is a corpus?

A corpus (pl. corpora) is a large and structured set of
texts (nowadays usually electronically stored and
processed).

What can a corpus of texts be possibly used for?
Types of corpora




Specialized
General
Written
Spoken








Demographic
Context-governed
Multilingual
Parallel
Learner
Historical
Diachronic
etc
=> used for
 hypothesis testing
 checking occurrences and
combinations of words
 validating linguistic rules
 providing examples of actual
language use for teaching
languages
 dictionaries, tests, exercises
compilation
 tracing (sociolinguistic)
changes in the language
etc.
Software for Corpus Studies:
AntConc



works only with plain-text files with the .txt appendix
(e.g. Hamlet.txt)
AntConc will not read .doc, .docx, .pdf files.You will need
to convert these into .txt files.
To save the file as a .txt file, open Notepad (for Windows)
or TextEdit (on Mac) and insert it there.
AntConc: Launch
AntConc: intro







Concordance: Keyword in Context view (KWIC).
Concordance Plot: A very simple visualization of your
KWIC search, where each instance is represented as a little
black line from beginning to end of each file containing the
search term.
File View: a full file view for larger context of a result.
Clusters: words which very frequently appear together.
Collocates: words which _definitely _appear together in a
corpus; collocates show words which are statistically likely to
appear together.
Word list: All the words in your corpus.
Keyword List: comparisons between two corpora.
Loading corpus into AntConc
AntConc Basic Functions:
Keywords-in-Context (KWIC)

Write down definitions for the word “shot” (2 min.)

Type in “shot” into search term bar. What do you see?
What meanings of this word can you identify? Did you
guess all of them?
AntConc Basic Functions:
Keywords-in-Context (KWIC) – contd.
AntConc Basic Functions:
Search Operators

The * operator (wildcard)
The * operator can help, for instance, find both the
singular and the plural forms of nouns.

Task: Search for qualit*, then sort this search. What tends
to precede and follow quality & qualities? (Hint: they’re
different words, and have different contexts. Look for
patterns in usage using the KWIC!)
AntConc Basic Functions:
Search Operators – contd.

To find out the difference between * and ?, search for
th*n and th?n.

The ? operator is more specific than the * operator:
wom?n – both women and woman
m?n – man and men, but also min
contrast to m*n: not helpful, because you’ll get mean,
melon, etc.
AntConc Basic Functions:
Search Operators – contd.





Task: Compare these two
searches: wom?n and m?n
sort each search in a
meaningful way (e.g., by
search term, then 1L then
2L)
File > Save output to text
file (& append with .txt.)
open the plain text file in
your text editor and study
the information
do this for each of the two
searches and then look at
the two text files side by
side. What do you notice?
AntConc: Basic Functions
Collocates and wordlists
-
Collocation – a tendency / requirement for words to co-occur
/ make sense together
-
Task: Generate collocates for m?n and wom?n. Now sort
them by frequency to 1L.
-
This tells us about what makes a man or woman ‘movieworthy’:
– women have to be ‘beautiful’ or ‘pregnant’ or ‘sophisticated’
– men have to be somehow outside the norm – ‘holy’ or
‘black’ or ‘old’
AntConc: Basic Functions
Comparing Corpora

Corpus compared against Reference Corpus (bigger)

Keyness is the frequency of a word in the text when
compared with its frequency in a reference corpus.
Keyness reflects the most important topics of the text /
corpus.
AntConc: Basic Functions
Comparing Corpora / Keywords


Settings > Tool preferences
> Keyword List
Under ‘Reference Corpus’
make sure “Use raw files” is
checked
Add Directory > open the
folder containing the files
that make up the reference
corpus
Ensure you have a whole list
of files!
What are our keywords?
?

What sorts of research questions can you come up with?
Comparing what and what? Identifying what and what?
Limitations of Corpus Linguistics




Won’t tell us if something is impossible in the language
Generalizations based on the corpora are deductions, not
facts
Corpora give us evidence but no explanation /
information about the culture, societal information, etc.
Corpora give us language out of the context (no pictures,
body language, behavior, etc.)
Questions?

Thank you!
Download