読者です 読者をやめる 読者になる 読者になる

Rパッケージにおける言語関連のデータ #RAdventJP

corpus r ling

 R Advent Calendar 2014の8日目の記事です。自分メモをかねて、Rパッケージにおける言語関連のデータをまとめてみました(随時更新中)。*1 因みに、zipfRパッケージのデータを読み込む場合は、パッケージのマニュアルを参照してください。*2

dataset description
acq {tm} This dataset holds 50 news articles with additional meta information from the Reuters-21578 data set. All documents belong to the topic acq dealing with corporate acquisitions.
alice {languageR} The text of Lewis Carroll's 'Alice's Adventures in Wonderland', with punctuation marks removed.
AssociatedPress {topicmodels} Associated Press data from the First Text Retrieval Conference (TREC-1) 1992.
author {ca} This data matrix contains the counts of the 26 letters of the alphabet (columns of matrix) for 12 different novels (rows of matrix). Each row contains letter counts in a sample of text from each work, excluding proper nouns.
auxiliaries {languageR} For 285 regular and irregular Dutch verbs, the auxiliary for the present and past perfect is listed together with the count of verbal synsets in WordNet. Regular and irregular verbs are matched in the mean for lemma frequency.
beginningReaders {languageR} Visual lexical decision latencies for beginning readers (8 year-old Dutch children).
bfm {textometry} A lexical table containing frequencies of adverbs from the BFM (Base de Francais m\’edi\’eval) database in 5 different domains (literary, historical, didactic, law, religious).
BNCbiber {corpora} This data set contains a table of the relative frequencies (per 1000 words) of 65 linguistic features (Biber 1988, 1995) for each text document in the British National Corpus (Aston & Burnard 1998).
BNCcomparison {corpora} This data set compares the frequencies of 60 selected nouns in the written and spoken parts of the British National Corpus, World Edition (BNC). Nouns were chosen from three frequency bands, namely the 20 most frequent nouns in the corpus, 20 nouns with approximately 1000 occurrences, and 20 nouns with approximately 100 occurrences.
BNCInChargeOf {corpora} This data set lists collocations (in the sense of Sinclair 1991) of the phrase in charge of found in the British National Corpus, World Edition (BNC). A span size of 3 and a frequency threshold of 5 were used, i.e. all words that occur at least five times within a distance of three tokens from the key phrase in charge of are listed as collocates.
BNCmeta {corpora} This data set provides complete metadata for all 4048 texts of the British National Corpus (XML edition).
Brown {zipfR} Brown.tfl, Brown.spc and Brown.emp.vgc are zipfR objects of classes tfl, spc and vgc, respectively.
BrownSubsets {zipfR} Objects of classes spc and vgc that contain frequency data for various subsets of words from the Brown corpus (see Kucera and Francis 1967).
cora {lda} A collection of 2410 scientific documents in LDA format with links and titles from the Cora search engine.
corpora {lsa} This data sets contain example corpora for essay scoring.
crude {tm} This data set holds 20 news articles with additional meta information from the Reuters-21578 data set. All documents belong to the topic crude dealing with crude oil.
danish {languageR} Auditory lexical decision latencies for Danish complex words.
dative {languageR} Data describing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection.
dativeSimplified {languageR} Data describing the realization of the dative as NP or PP in the Switchboard corpus and the Treebank Wall Street Journal collection. Simplified version of the dative data set.
Dickens {zipfR} Objects of classes spc and vgc that contain frequency data for a collection of Dickens’s works from Project Gutenberg, and for 3 novels (Oliver Twist, Great Expectations and Our Mutual Friends).
durationsGe {languageR} Durational measurements on the Dutch prefix ge- in the Spoken Dutch Corpus.
durationsOnt {languageR} Durational measurements on the Dutch prefix ont- in the Spoken Dutch Corpus.
dutchSpeakersDist {languageR} A distance matrix for the conversations of 165 speakers in the Spoken Dutch Corpus. Metadata on the speakers are available in a separate dataset, dutchSpeakersDistMeta.
dutchSpeakersDistMeta {languageR} Meta-data for the cross-entropy based between-speaker distance matrix dutchSpeakersDist
english {languageR} This data set gives mean visual lexical decision latencies and word naming latencies to 2284 monomorphemic English nouns and verbs, averaged for old and young subjects, with various predictor variables.
etymology {langugaeR} Estimated etymological age for regular and irregular monomorphemic Dutch verbs, together with other distributional predictors of regularity.
faz {languageR} Frequencies of references to previous years in issues of the Frankfurter Allgemeine Zeiting published in 1994.
finalDevoicing {languageR} Phonological specifications for onset, nucleus and offset for 1697 Dutch monomorphemic words with a final obstruent. These final obstruents may exhibit a voicing alternation that is traditionally described as syllable-final devoicing: underlying /d/ in /hond/ becomes a /t/ when syllable-final ([hOnt]) and remains a /d/ otherwise ([hOn-den]).
havelaar {languageR} The frequency of the determiner 'het' in the Dutch novel 'Max Havelaar' by Multatuli (Eduard Douwes Dekker), in 99 consecutive text fragments of 1000 tokens each.
heid {languageR} A simplified version of the primingHeid dataset.
imaging {languageR} Filtered fMRI signal at the most significant voxel and average priming scores for brain-damaged patients, in a study addressing the extent to which phonological and semantic processes recruit the same brain areas.
ItaPref {zipfR} ItaRi.spc and ItaRi.emp.vgc are zipfR objects of classes tfl, spc and vgc, respectively. They contain frequency data for all verbal lemmas with the prefix ri- (similar to English re-) in the Italian la Repubblica corpus.
latinsquare {languageR} Simulated lexical decision latencies with SOA as treatment, using a Latin Square design with subjects and items, as available in Raaijmakers et al. (1999).
lexdec {languageR} Lexical decision latencies elicited from 21 subjects for 79 English concrete nouns, with variables linked to subject or word.
lexicalMeasures {languageR} Lexical distributional measures for 2233 English monomorphemic words. This dataset provides a subset of the data available in the dataset english.
lexicalMeasuresClasses {languageR} A data frame labelling the lexical measures in the dataset lexicalMeasures as measures of form or meaning.
moby {languageR} The text of H. Melville's 'Moby Dick', with punctuation marks removed.
nesscg {languageR} Frequency (m) and frequency of frequency (Vm) for string types with the suffix -ness in the context- governed subcorpus of the British National Corpus sampling spoken British English.
nessdemog {languageR} requency (m) and frequency of frequency (Vm) for string types with the suffix -ness in the demographic subcorpus of the British National Corpus sampling spoken British English.
nessw {languageR} Frequency (m) and frequency of frequency (Vm) for string types with the suffix -ness in the subcorpus of the British National Corpus sampling written British English.
oldFrench {languageR} Frequencies of 35 morphosyntactic tag trigrams in 343 Old French texts.
oldFrenchMeta {languageR} Meta data for the oldFrench data, a matrix of frequencies for texts (rows) by tag trigrams (columns). The meta data provide information on the texts, manuscript variants, their authors, their region and approximate date of origin, their general topic, and their genre.
oz {languageR} The text of L. F. Baum's 'The Wonderful Wizard of Oz', with punctuation marks removed.
periphrasticDo {languageR} The development of periphrastic do in English: Ellegard's counts for the use of do across four sentence types in 11 consecutive time periods between 1390 and 1710.
phylogeny {languageR} Phylogenetic relations between Papuan and Oceanic languages: 127 grammatical traits (absent/present) for 31 languages.
primingHeid {languageR} Primed lexical decision latencies for Dutch neologisms ending in the suffix -heid.
primingHeidPrevRT {languageR} Primed lexical decision latencies for Dutch neologisms ending in the suffix -heid, with information on RTs to preceding trials added to the data already in primingHeid.
quasif {languageR} Simulated lexical decision latencies with SOA as treatment, traditionally requiring an analysis using quasi-F ratios, as available in Raaijmakers et al. (1999).
ratings {languageR} Subjective frequency ratings, ratings of estimated weight, and ratings of estimated size, averaged over subjects, for 81 concrete English nouns.
regularity {languageR} Regular and irregular Dutch verbs and selected lexical and distributional properties.
robespierre {textometry} A lexical table containing frequencies of 5 words from 9 different public discourses of French politician Robespierre (between november 1793 and july 1794).
selfPacedReadingHeid {languageR} Self-paced reading latencies for Dutch neologisms ending in the suffix -heid.
shrinkage {languageR} Simulated data set for illustrating shrinkage.
sizeRatings {languageR} Subjective estimates of the size of the referents of 81 English concrete nouns, collected from 38 subjects.
spam {kernlab} A data set collected at Hewlett-Packard Labs, that classifies 4601 e-mails as spam or non-spam. In addition to this class label there are 57 variables indicating the frequency of certain words and characters in the e-mail.
spanish {languageR} Relative frequencies of the 120 most frequent tag trigrams in 15 texts contributed by 3 authors.
spanishFunctionWords {languageR} Relative frequencies of the 120 most frequent function words in 15 texts contributed by 3 authors.
spanishMeta {languageR} By-text metadata for the spanish and spanishFunctionWords data sets.
splitplot {languageR} Simulated lexical decision latencies with priming as treatment and reaction time in lexical decision as dependent variable.
through {languageR} The text of Lewis Carroll's 'Through the Looking Glass', with punctuation marks removed.
Tiger {zipfR} Objects of classes tfl, spc and vgc that contain frequency data for the syntactic expansions of Noun Phrases (NP) and Prepositional Phrases (PP) in the Tiger German treebank.
twente {languageR} Frequency (m) and frequency of frequency (Vm) for string types in the Twente News Corpus.
variationLijk {languageR} This dataset documents variation in the use of the suffix -lijk, as realized in 32 words, in spoken Dutch across region (Flanders versus The Netherlands), sex (females versus males) and education (high versus mid).
ver {languageR} Semantic transparency (dichotomous) and frequency for 985 words with the Dutch prefix ver-.
verbs {languageR} A simplified version of the dative data set, used for expository purposes only.
VSS {corpora} This data set contains a small corpus (8043 tokens) of short stories from the collection Very Short Stories (VSS).
warlpiri {languageR} This data set documents the use of ergative case marking in the narratives of native speakers of Lajamanu Warlpiri (8 children, 13 adults) describing events in picture books.
weightRatings {languageR} Subjective estimates on a seven-point scale of the weight of the referents of 81 English nouns.
writtenVariationLijk {languageR} This dataset documents variation in the use of the 80 most frequent words ending in the suffix -lijk in written Dutch.

 これ以外に言語関連のデータをご存知の方は、@langstatまでご一報頂けると、大変ありがたく存じます。

(追記)QiitaのR Advent Calendar 2014の20日目も担当させて頂く予定です。

*1:ここでは、かなり広い意味で「言語関連」という言葉を解釈しています。

*2:例えば、Brownには、Brown.tfl、Brown.spc、Brown.emp.vgcといった複数のデータセットが含まれており、個々に呼び出す必要があります。