|
本帖最后由 cracode 于 2015-5-14 20:31 编辑
Corpus of Contemporary American English Frequency List
The following word list contains approximately 500,000 word forms and part of speech, which appear at least four times in the 410 million words Corpus of Contemporary American English (COCA) [http://www.americancorpus.org].
The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus was created by Mark Davies of Brigham Young University, and it is used by tens of thousands of users every month (linguists, teachers, translators, and other researchers). COCA is also related to other large corpora that we have created.
The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts. It includes 20 million words each year from 1990-2012 and the corpus is also updated regularly (the most recent texts are from Summer 2012). Because of its design, it is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language (see the 2011 article in Literary and Linguistic Computing).
The version of the corpus used for this word list is the version from January 2011, which contains texts up through June 2010.
•Only those forms that occur four times or more in the corpus are listed here, but COCA contains all word forms and PoS.
•The list is not case-sensitive, so for example [Brown] and [brown] are grouped together. Often, however, these can be distinguished by the part of speech tag [np1] for proper nouns, e.g. "brown np1" or "bush np1".
•We did not create the part of speech tagger (CLAWS), but rather it was developed at Lancaster University (UK): http://ucrel.lancs.ac.uk/claws/
•For help with the part of speech codes, see the explanations at: http://ucrel.lancs.ac.uk/claws7tags.html.
•See also the explanation (at the bottom of that page) for PoS tags with two numbers (e.g. [out ii21] or [well ii32])
•The tagger often generates two or more possible PoS tags for a given word in a given context (e.g. "and *cuts
•in").
While COCA contains all possibilities, only the first / most probable tag is used for this frequency list. •Lower frequency forms are obviously much less accurate than higher-frequency forms. In other words, lower frequency forms contain errors. We already know this; this is the way taggers work. There is no need to contact us or the CLAWS tagger team to let us know about these forms.
•For similar lists from the British National Corpus (100 million words, 1980s-1993) — also based on CLAWS — see:
http://ucrel.lancs.ac.uk/bncfreq/flists.html
http://www.kilgarriff.co.uk/bnc-readme.html
数字是词频的排名,即数字越小,越常见,数字越大,越罕见。
不同词性词形按从高到低列出。如smooth一词,作形容词用的频率最高,其次是动词和名词。
链接: http://pan.baidu.com/s/1dDIWDET 密码: pg2t |
评分
-
2
查看全部评分
-
|