CCHP

The crosslinguistic corpus of hesitation phenomena, collated beginning in 2012

CCHP ReadMe.txt file

================================================================
Crosslinguistic Corpus of Hesitation Phenomena (CCHP)
http://filledpause.com/chp/cchp
Last updated: 2012/09/19
================================================================

Thank you for downloading all or some of the CCHP. This
ReadMe.txt file is intended to give a technical overview of the
corpus as well as stand as a record of updates to the corpus.
Although it possible to download only parts of the corpus, this
file should accompany all downloads.

----------------------------------------------------------------
License

This work is licensed under the Creative Commons
Attribution-NonCommercial-ShareAlike 3.0 Unported
License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-nc-sa/3.0/
(see also license.html).

----------------------------------------------------------------
Overview

The Crosslinguistic Corpus of Hesitation Phenomena (CCHP) is
designed for research into the first and second language use of
hesitation phenomena in various kinds of elicited speech.  In
particular, it is designed to allow a comparison across first
and second language speech by recordings responses to parallel
elicitation tasks in both languages.  The recordings are
transcribed with special attention to the use of hesitation
phenomena but also with a view toward high transcription
accuracy and thus high usability by other researchers.  Since
the construction of the corpus is being funded by a Japanese
government research grant, it is being made publicly available
for the benefit of other researchers and learners.

----------------------------------------------------------------
Technical Description

Participants in the corpus are all university students and were
recruited through advertisement in university bulletin boards.
After signing a consent form which informed them of the public
distribution of the corpus, each participant was asked to make
three recordings of about 3-4 minutes each in each of their
first and second languages.  The elicitation tasks for three
recordings were as follows (in the order performed).

- Reading aloud: Participants were given a printed text and were
  asked to read it aloud.  They were given no advance
  preparation time.

- Picture description: Participants were shown a picture or
  cartoon strips and asked to describe it.  This was repeated
  several times in order to fill the 3-4 minute target time.
  They were told they could take a few seconds to study each
  picture, but were asked to begin speaking as soon as possible.

- Topic narrative: Participants were given a topic to talk about
  freely (e.g., describe the sport of basketball).  They were
  asked to imagine that they were speaking to someone during
  this task.  If necessary, a second topic (e.g., table tennis)
  was given to fill the 3-4 minute target.

The participants were recorded in a sound-attenuated room using
an AKG C300 microphone channeled through an ART Dual Pre
microphone pre-amp to a Toshiba Dynabook R731 in mono 16-bit
48kHz quality.  The files were processed using the normalize
and noise reduction functions in Audacity (ver. 2.0.1;
http://audacity.sourceforge.net/).  The audio files are
provided in the CCHP archive as wav files for further analysis
and also as more portable mp3 files.

Each recording has been transcribed by two transcribers
independently.  The transcribers are native speakers of the
same native language as the participants and advanced speakers
of the second language the participants spoke in.  These two
transcriptions were checked by a third transcriber who focused
on resolving differences between the two transcriptions as well
as double-checking for errors.

The most detailed transcriptions are contained in the XML files.
For the most part, the annotations should be self-explanatory.
Following is an overview of most of the elements.

  <TRANSCRIPT> represents one recording and attributes on this
  element indicate what language the participants spoke in and
  in response to which elicitation task.  Other attributes give
  some demographic details about the participant.
  
  <T>, which stand for "token" essentially represents standalone
  words or partial words (shown with a hash mark '#' at the cut-
  off point) as well as filled pauses.
  
  Filled pauses (typically uh/um in English, e-/e-to in
  Japanese) were marked as <T> elements like other words but
  have a FILLED-PAUSE='yes' attribute.
  
  <UTTERANCE> marks a complete utterance.  Utterance boundaries
  were determined by intonation primarily, though occasionally
  by the presence of long pauses followed by an utterance
  clearly intended as new.
  
  <PUNC> marks punctuation.  Though unspoken, of course, these
  are provided at the end of each utterance for processing
  purposes (e.g., for creating the minimal text transcriptions
  desribed below).
  
  <RP> demarcates repair sequences.  The reparandum is marked
  with an <O> tag (for "Original") and the repair is marked with
  an <E> (for "rEpair").  Editing terms like filled pauses or
  interjections were placed between <O> and <E> elements.  Also,
  when speakers made multiple attempts at repairs, these were
  marked as <E> elements.  Hence, the final <E> node under a
  <RP> node represents the repaired speech.
  
  <RT> denotes a repeat sequence.  The structure is similar to
  the <RP> sequence with <O> marking the original sequence of
  words and <E> marking the repetition, with multiple <E> tags
  showing iterated repetition.  In rare cases, there is a <T>
  element between the <O> and <E> elements indicating a filled
  pause.
  
  <FS> indicates a sequence of words which constitutes a false
  start.
  
  <OH> indicates an interjection of some sort (e.g., "Oh",
  "Ah").
  
  <AHEM/> indicates throat-clearing (i.e., "ahem").
  
  <SIGH/> indicates a sigh.
  
  <ING/> indicates a sound made when sucking air in through
  closed teeth.
  
  <IA> is used to mark a sequence of words which transcribers
  found indeterminate.  In some cases, a guess has been provided
  within the <IA> element, but this was not always possible.
  
  <BREAK/> indicates the boundary between pictures or topics in
  the picture description and topic narrative elicitation tasks.
  
  <PAUSE/> indicates a silent pause.
  
  For all elements that take up some time (i.e., <T>, <PAUSE>,
  <AHEM>, etc.), attributes showing the start time and the end
  time of that element are given as attributes.  these times
  are measured from the start of the recording.
  
In addition to the detailed XML files, a plain text version of
the transcribes is also available.  This is a simple formatted
text consisting of the <T> nodes (i.e., words and filled
pauses).  This version is probably not useful for detailed
analysis, but may be useful to get a quick overview of the
speech.

Finally, TextGrid files are provided which give the duration
details of the transcription in the TextGrid format used by
Praat (praat.org).  These files may be opened together with
the corresponding wav audio file in Praat for further analysis.

All of the text-based files are encoded in UTF-8 and should be
readable in most any text editor.

----------------------------------------------------------------
News and Updates

2012/09/01 - This is the initial release of CCHP materials.
  This release includes audio files and transcripts for six
  participants: p102-p104, p106-p108.  The transcription
  process is still ongoing.  Thus, transcripts in this release
  do not yet contain time markings and there are no Praat
  TextGrid files yet.
  
2012/09/19 - This release adds files for five more participants
  (p109-p114).  However, the collections have not been updated
  yet since a further release is expected soon with additional
  participants.  The collections will be updated in the next
  release.

----------------------------------------------------------------
Credits

The CCHP was compiled by and is maintained by Ralph Rose,
Center for English Language Education (CELESE) in Waseda
University Faculty of Science and Engineering in Tokyo,
Japan.

Other Research Staff (former and current)

Hiroaki Suzuki
Junichi Inagaki
Masayuki Motoori
Yukikatsu Fukuda

----------------------------------------------------------------
Sponsorship

The CCHP was created under a research grant-in-aid (principal
investigator: Ralph Rose; Project No. 24520661) from the Japan
Ministry for Education, Culture, Sports, Science, and Technology
(MEXT; http://kaken.nii.ac.jp/d/p/24520661.en.html).

Crosslinguistic Corpus of Hesitation Phenomena (CCHP) First Release!

The Filled Pause Research Center is please to announce the initial release of Corpus of Hesitation Phenomena (CCHP) materials. This release includes audio files (wav and mp3) and transcripts (annotated xml and plain text) for six participants.  The transcription process is still ongoing. Thus, transcripts in this release do not yet contain time markings and there are no Praat TextGrid files yet.

Those who wish to access the corpus are asked to create a new account in the FPRC.  After doing so, the corpus archive can be accessed on the CCHP main page. Registered users may then download the entire corpus (as released so far) or sub-collections of the corpus or browse and download individual files in the corpus.

First Stage of Transcription of CCHP Recordings has begun

I met with the research support staff recently to go over the procedures for transcribing all the recordings. There is roughly 9 hours of recordings to be transcribed in several stages. The first stage, which is probably the most arduous, is to transcribe all the words in each recording delimiting them minimally into utterances. In addition, the staff will transcribe all overt hesitation phenomena including filled pauses, false starts, repair sequences, and repeats (see Taxonomy for some details).

Crosslinguistic Corpus of Hesitation Phenomena (CCHP)

The crosslinguistic corpus of hesitation phenomena (CCHP) is an ongoing project to organize a corpus of first and second language recorded speech in response to several speaking tasks. It is supported by a three-year Grant-in-Aid from Japan's Ministry for Education, Culture, Sports, Science, and Technology under the title "Hesitation Phenomena in Second Language Development" (in Japanese, 「第二言語習得における躊躇現象」).

Syndicate content