Corpora
Tous les membres de STL ont accès aux corpora en ligne mis à disposition par l'Université de Lille et le CNRS. Les corpora décrit ci-dessous ont été acquis par STL et ne sont donc accessibles qu'aux membres de notre UMR:
- Corpus of Contemporary American English (COCA): The Corpus of Contemporary American English (COCA) is the only large, genre-balanced corpus of American English. COCA is probably the most widely-used corpus of English, and it is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English.
The corpus contains more than one billion words of text (25+ million words each year 1990-2019) from eight genres: spoken, fiction, popular magazines, newspapers, academic texts, and (with the update in March 2020): TV and Movies subtitles, blogs, and other web pages. Corpus acquis dans le cadre du projet ANR REM - Corpus of Historical American English (COHA) is the largest structured corpus of historical English. It is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. If you are interested in historical corpora, you might also look at our Google Books (see comparison), Hansard, and TIME corpora. COHA contains more than 400 million words of text from the 1810s-2000s (which makes it 50-100 times as large as other comparable historical corpora of English) and the corpus is balanced by genre decade by decade. The creation of the corpus results from a grant from the National Endowment for the Humanities (NEH) from 2008-2010. Corpus acquis dans le cadre du projet ANR REM
- Corpus News On the Web (NOW): The NOW corpus contains 12.0 billion words of data from web-based newspapers and magazines from 2010 to the present time (the most recent day is 2021-02-15). More importantly, the corpus grows by about 180-200 million words of data each month (from about 300,000 new articles), or about two billion words each year.
While other resources like Google Trends show you what people are searching for, the NOW Corpus is the only structured corpus that shows you what is actually happening in the language -- virtually right up to the present time. For example, see the frequency of words since 2010, as well as new words and phrases from the last few years. Corpus acquis dans le cadre du projet I-SITE Neolog - Corpus TV: The TV Corpus contains 325 million words of data in 75,000 TV episodes from the 1950s to the current time. All of the 75,000 episodes are tied in to their IMDB entry, which means that you can create Virtual Corpora using extensive metadata -- year, country, series, rating, genre, plot summary, etc. The TV corpus (along with the Movies Corpus) serves as a great resource to look at very informal language -- at least as well as with corpora of actual spoken English. In addition, the TV Corpus is much larger than any other corpus of informal English (other than other corpora from English-Corpora.org). For example, it is about 33x as large as the conversation portion of the BNC (including their 2014 update). The corpus also allows you to look at variation over time (1950s-1970s to 1990s-2010s) and variation between dialects (e.g. American and British English). In this sense, the corpus is related to many other corpora of English that we have created, which offer unparalleled insight into variation in English. Corpus acquis dans le cadre de la thèse de Diana Oliveira Santos.