Filled Pause
Research Center

Filled Pause
Research Center

Filled Pause
Research Center

Investigating 'um' and 'uh' and other hesitation phenomena

Investigating 'um' and 'uh' and other hesitation phenomena

Investigating 'um' and 'uh' and other hesitation phenomena

The 3rd Workshop on Disfluency in Spontaneous Speech (DiSS 2003)

Intro | DiSS 1999 | DiSS 2001 | DiSS 2003 | DiSS 2005 | DiSS-LPSS 2010 | DiSS 2013 | DiSS 2015 | DiSS 2017 | DiSS 2019 | DiSS 2021

Disfluency in Spontaneous Speech (DiSS) workshop 2003 logo

The third Workshop on Disfluency in Spontaneous Speech was held as an International Speech Communication Association (ISCA) research workshop.

Date: September 5-8, 2003

Location: Göteborg University; Göteborg, Sweden

Organizers: Jens Allwood, Robert Eklund, and Åsa Wengelin

Papers presented

(Download references in bibtex format here. Proceedings available in full here)

  • Martine Adda-Decker, Benoît Habert, Claude Barras, Gilles Adda, Philippe Boula de Mareuil, and Patrick Paroubek, “A disfluency study for cleaning spontaneous speech automatic transcripts and improving speech language models,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 67-70. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_067.pdf.

    Abstract The aim of this study is to elaborate a disfluent speech model by comparing different types of audio iranscripts. The study makes use of 10 hours of French radio interview archives, involving journalists and personalities from political or civil society. A first type of transcripts is press-oriented where most disfluencies are discarded. For 10% of the corpus, we produced exact audio transcripts: all audible phenomena and overlapping speech segments are transcribed manually. In these iranscripts about 14% of the words correspond to disfluencies and discourse markers. The audio corpus has then been iranscribed using the LIMSI speech recognizer. With 8% of the corpus the disfluency words explain 12% of the overall error rate. This shows that disfluencies have no major effect on neighboring speech segments. Restarts are the most error prone, with a 36.9% within class error rate.

    Keywords DiSS

  • Matthew P. Aylett, “Disfluency and speech recognition profile factors,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 51-54. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_051.pdf.

    Abstract This paper reports on work bringing together disfluency coding carried out by Lickley [1] and recognition work carried out as part of the ERF project (Bard, Thompson & Isard, [2]) at Edinburgh University. A set of factors are investigated which characterise the behaviour of the ASR during recognition based on an analysis of the resulting word laffice. These factors can be grouped as: Entropy Factors - the entropy of the acoustic and language model likelihoods, within the word lattice, over a 10 ms frame, and, Arc Factors - the number of non-unique and unique arcs in the word lattice in any given 1 Oms time frame, together with the variance of start and end times of these arcs, and the number of arcs starting or ending in the frame. The values of all factors were used to train a simple CART model. The CART model was used to predict: recognition failure, interruption point location (the point where a disfluency begins), and whether the location was in a repair or a reparandum. The entropy of the language model values contributed most to the models prediction of recognition failure, and whether a frame was in a repair or reparandum. In contrast, the number of unique word hypotheses contributed most to the successful prediction of a frame being close to an interruption point.

    Keywords DiSS

  • Ramona Benkenstein, and Adrian P. Simpson, “Phonetic correlates of self-repair involving word repetition in German spontaneous speech,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 81-84. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_081.pdf.

    Abstract A phonetic description of self-initiated self-repair sequences involving the repetition of words in German spontaneous speech is presented. Data are drawn from the Kiel Corpus of Spontaneous Speech. The description is primarily impressionistic auditory, but it also employs acoustic records to verify and objectify the impressionistic findings. A number of different patterns around cut-off are identified. The comparison of phonetic differences between reparandum and repair tokens is used to argue that repair sequences can also provide an interesting insight into the way in which fluent stretches of spontaneous speech are phonetically organized.

    Keywords DiSS

  • Yasuharu Den, “Some strategies in prolonging speech segments in spontaneous Japanese,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 87-90. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_087.pdf.

    Abstract Abstract In this paper, we investigate segmental prolongation in a corpus of spontaneous Japanese monologues consisting of over 700,000 words. We examine effects on the rate of prolongation of various factors including speech types, the genders of speakers, word classes, word positions in the phrase and in the inter-pausal unit, and the presence of preceding fillers. Based on the empirical findings, we state some sirategies in prolonging speech segments used by Japanese speakers.

    Keywords DiSS

  • Sheena Finlayson, Victoria Forrest, Robin Lickley, and Janet Mackenzie Beck, “Effects of the restriction of hand gestures on disfluency,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 21-24. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_021.pdf.

    Abstract This paper describes an experimental pilot study of disfluency and gesture rates in spontaneous speech where speakers perform a communication task in three conditions: hands free, one arm immobilized, both arms immobilized. Previous work suggests that the restriction of the ability to gesture can have an impact on the fluency of speech. In particular, it has been found that the inability to produce iconic gestures, which depict actions and objects, results in a higher rate of disfluency. Models of speech production account for this by suggesting that gesture and speech production are part of the same integrated system. Such models differ in their interpretation of the location of the gesture planning mechanism in relation to the speech model: some authors suggest that iconic gestures relate closely to lexical access, while others suggest that the link is located around the conceptualization stage. The findings of this study tentatively confirm that there is a relationship beiween gesture and fluency - overall, disfluency increases as gesture is restricted. But it remains unclear whether the disfluency is more related to lexical access than to conceptualization. Proposals for a larger study are suggested. The work is of interest to psycholinguists focusing on the integration of gesture into models of speech production and to Speech and Language Therapists who need to know about the impact that an impaired ability to produce gestures may have on communication.

    Keywords DiSS

  • Kotaro Funakoshi, and Takenobu Tokunaga, “Evaluation of a robust parser for spoken Japanese,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 55-58. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_055.pdf.

    Abstract We implemented a parser designed to handle ill-formedness in Japanese speech. The parser was evaluated by utilizing newly collected speech data, which was obtained from an experiment designed to produce ill-formed data effectively. Introducing the proposed method increased the number of correctly analyzed utterances from 171 to 322, from among 532 utterances in the corpus.

    Keywords DiSS

  • Robert J. Hartsuiker, Martin Corley, Robin Lickley, and Melanie Russell, “Perception of disfluency in people who stutter and people who do not stutter: Results from magnitude estimation,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 35-37. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_035.pdf.

    Abstract Recent accounts of stuttering consider disfluencies the result of an interaction between speech planning and self- monitoring, emphasizing the continuity beiween errors made in everyday speech and those made by people who stutter. On Vasi9 & Wijnen's account, the monitor is hypervigilant for upcoming problems and interrupts and restarts the speech signal, resulting in disfluent speech. Crucially, on this account, self-monitoring is a perceptual function. Therefore, this account makes iwo predictions (1) people who stutter are also hypervigilant in perceiving another person's speech. (2) the quality of disfluencies made by people who stutter and those who do not will be comparable. We tested these hypotheses using a magnitude estimation judgment task. Twenty participants who stutter and 20 conirols were asked to rate the fluency of excerpted fluent and disfluent fragments from recorded dialogues, either between people who stutter or beiween non-stutterers. In line with the first hypothesis, people who stutter tended to rate all fragments as more disfluent than controls did. However the second hypothesis was not confirmed: across judges, fluent and disfluent fragments excerpted from recordings of people who stutter were rated as less fluent than those excerpted from conirol dialogues, suggesting that there are perceptually relevant differences between the speech of PWS and PWDNS, independent of number and type of disfluencies.

    Keywords DiSS

  • Sandrine Henry, and Berthille Pallaud, “Word fragments and repeats in spontaneous spoken French,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 77-80. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_077.pdf.

    Abstract This paper presents the results of a study conducted on the interaction of two disfluencies: repeats and word fragments. It is based on 150 repeated word fragments (e.g., "on le re- re- revendique encore une fois") extracted from a one-million-word corpus of spoken French. Word fragments such as: "notre metier spé- spécifique", are, like repeats (e.g., "vous avez évalué le le montant des dégâts"), very frequent events in spoken language: on average, there is 1 word fragment every 50 seconds, 1 repeat every 17 seconds. Speakers and listeners alike are generally unaware of these phenomena as if they were not part of the communication process. They seldom trigger a metalinguistic reaction from the speaker and are even more rarely acknowledged by the listener. These phenomena have sometimes been interpreted as 'errors' in the communication process, like slips of the tongue. Word fragments and repeats encompass different categories of phenomena, and this enables us to define them as an heterogeneous group ruled by different types of constraints and mechanisms.2 This analysis rests on the following criteria: structural aspects of the repeat, types of word fragments, morphological and syntactic aspects. Analyses of these repeated of identical word fragments from two different angles - that of the repeats and then that of the word fragments - confirm the relevance of the distinction beiween these two types of disfluencies.

    Keywords DiSS

  • Peter Howell, “Is a perceptual monitor needed to explain how speech errors are repaired?,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 31-34. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_031.pdf.

    Abstract Kolk & Postma [2] proposed, following Dell & O'Seaghdha [1], that when a speaker chooses a word, phonologically-related words as well as the intended word are activated. Initially, the activations of all these words are similar, though eventually the intended word reaches a higher asymptotic value when activation is complete [1]. According to Kolk & Postma [2], if a response is made in the phase where activation is building up (rather than at full activation), there is a higher chance of the competing, rather than the intended, word being selected (i.e. an error). They propose that a speaker detects such errors when they are produced overtly using the perceptual system, and a monitor in the linguistic system responds by interrupting and initiating the correction [2]. Word repetition and hesitation (not errors in themselves) have been regarded as signifying underlying errors that are detected and interrupted before speech is output in a similar way to overt errors. An assumption in [2] is that activation for a word stops (or, if it continues, is ignored) immediately a candidate word is selected. The brain processes responsible for speech production have massive parallel capacity. Consequently, activation for all the candidates for a word slot could continue beyond the point where a word is selected in cases where a word is responded to prematurely. when the selected word reaches asymptote, the relative activations of this and the other candidate words indicate when an error has occurred (when the selected word has a lower activation than one of the competing words), and what correction is appropriate (the word with the highest activation). This provides the basis for error detection and correction without the need for a perceptual monitor. Continuing the buildup of activation after a word has been selected, implies that activation of nearby words in its phrase overlaps. It is shown, with some realistic assumptions about how activation builds up and decays across different words in a phrase, that this model predicts word repetition and hesitation and also part-word disfluencies (a characteristic of stuttering), again without the need for a perceptual monitor.

    Keywords DiSS

  • Kim Kirsner, John Dunn, and Kathryn Hird, “Fluency: Time for a Paradigm Shift,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 13-16. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_013.pdf.

    Abstract Pauses in spontaneous speaking constitute a rich source of data for several disciplines. They have been used to enhance automatic segmentation of speech, classification of patients with acquired communication disorders, the design of psycholinguistic models of speaking, and the analysis of psychological disorders. Unfortunately, however, although pause analysis has been with us for more than 40 years, their interpretation has been compromised by several problems [1]. The first problem is that the pause distribution is skewed, making mean duration a poor measure of central tendency. The second problem is that there are at least two components to the pause duration distribution, a problem that has been confounded by the fact that most authors have assumed that short pauses can be ignored. The third problem is that many scholars have used an arbitrary criterion to separate the pause components thereby adopting statistics that reflect errors of commission or omission. In this paper we review recent work that resolves each of these issues and illustrates the application of the new paradigm to a variety of problems. Our research indicates that, first, there are at least two pause duration distribufl'ons, each of which may be sensitive to theoretically interesting variables; second, the distributions are log-normal, thereby opening the way to appropriate measures of central tendency and dispersion, and, third, the distributions can be reliably separated by application of signal detection theory, and the proportion of misclassifications minimised and estimated. This paper reviews recent research using the new approach to pause analysis.

    Keywords DiSS

  • Torbjörn Lager, “In dialogue with a desktop calculator: A concurrent stream processing approach to building simple conversational agents,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 59-62. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_059.pdf.

    Abstract Human spontaneous face-to-face conversations are characterized by phenomena such as turn-taking, feedback, sounds of hesitation and repairs. A simple and highly modular stream-based approach to natural language processing is proposed that attempts to deal with such things. A basic version of the model has been implemented in the Oz programming language.

    Keywords DiSS

  • Piroska Lendvai, Antal van den Bosch, and Emiel Krahmer, “Memory-based disfluency chunking,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 63-66. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_063.pdf.

    Abstract We investigate the feasibility of machine learning in automatic detection of disfluencies in a large syntactically annotated corpus of spontaneous spoken Dutch. We define disfluencies as chunks that do not fit under the syntactic iree of a sentence (including fragmented words, laughter, self-corrections, repetitions, abandoned constituents, hesitations and filled pauses). we use a memory-based learning algorithm for detecting disfluent chunks, on the basis of a relatively small set of low-level features, keeping track of the local context of the focus word and of potential overlaps between words in this context. We use attenuation to deal with sparse data and show that this leads to a slight improvement of the results and more efficient experiments. We perform a search for the optimal settings of the learning algorithm, which yields an accuracy of 97% and an F-score of 80%. This is a significant improvement of the baselines and of the results obtained with the default settings of the learner.

    Keywords DiSS

  • Krisztina Menyhárt, “Age-dependent types and frequency of disfluencies,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 45-48. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_045.pdf.

    Abstract The age-dependent changes of one's speech production from childhood up to old age are relatively well known. However, there has been less research conducted concerning the possible alterations of the disfluency phenomena in speakers' spontaneous speech determined by age. Our hypothesis is that permanent changes are going on in the operation of speech production processes from early childhood up to old age, and that those changes can be studied via observing disfluency phenomena. A series of experiments has been carried out with the participation of altogether 30 Hungarian-speaking persons, children, midle-aged adults and old subjects (ages of 77). Their spontaneous speech was recorded and analyzed concerning the articulation and speech tempi, silent and filled pauses, as well as other disfluency phenomena (like false starts, repetitions, slips, etc.). The aim of the research is to explore the invariant and variable factors of the disfluencies depending on age. The results highlight also the individual differences that seem to be independent of the age factor.

    Keywords DiSS

  • Hannele Nicholson, Ellen Gurman Bard, Rohin Lickley, Anne H. Anderson, Jim Mullin, David Kenicer, and Lucy Smallwood, “The intentionality of disfluency: Findings from feedback and timing,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 17-20. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_017.pdf.

    Abstract This paper addresses the causes of disfluency. Disfluency has been described as a strategic device for intentionally signalling to an interlocutor that the speaker is committed to an utterance under construction. It is also described as an automatic effect of cognitive burdens, particularly of managing speech production during other tasks. To assess these claims, we used a version of the map task and tested 24 normal adult subjects in a baseline untimed monologue condition against conditions adding either feedback in the form of an indication of a supposed listener's gaze, or time-pressure, or both. Both feedback and time-pressure affected the nature of the speaker's performance overall. Disfluency rate increased when feedback was available, as the strategic view predicts, but only deletion disfluencies showed a significant effect of this manipulation. Both the nature of the deletion disfluencies in the current task and of the information which the speaker would need to acquire in order to use them appropriately suggest ways of refining the strategic view of disfluency.

    Keywords DiSS

  • Sieb G. Nooteboom, “Self-monitoring is the main cause of lexical bias in phonological speech errors,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 27-30. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_027.pdf.

    Abstract In this paper I present new evidence, stemming both from an experiment and from spontaneous speech, demonstrating that (a) lexical bias is caused by self-monitoring of inner speech, as proposed by Levelt et al. [1], and (b) that there is phoneme-to-word feedback in the mental programming of speech, as supposed by Dell [2] and Stemberger [3]. It is argued here that possibly phoneme-to-word feedback is an unavoidable side-effect of self-monitoring of inner speech.

    Keywords DiSS

  • Caroline L. Rieger, “Disfluencies and hesitation strategies in oral L2 tests,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 41-44. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_041.pdf.

    Abstract This paper presents an investigation of hesitation strategies of intermediate learners of German as a second or foreign language (L2) when they take part in oral L2 tests. Previous studies of L2 hesitation strategies have focused on beginning and advanced L2 learners. They found that beginners tend to leave their hesitation pauses unfilled making their speech highly disfluent [17], while advanced L2 speakers - similar to native speakers - use a variety of fillers. In oral L2 tests, intermediate learners hesitate mainly for two reasons: to search for a German word or structure, or to think about the content of their utterance. Some participants use a variety of strategies to signal to the addressee that they are hesitating. This variety is not as rich as it is for advanced L2 learners or native speakers. Other participants leave their hesitation pauses unfilled or rely on quasi-lexical fillers to hold the floor when hesitating.

    Keywords DiSS

  • Guergana Savova, and Joan Bachenko, “Prosodic features of four types of disfluencies,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 91-94. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_091.pdf.

    Abstract We present a corpus-based approach for using intonation and duration to detect disfluency sites. The questions we aim to answer are: what are the prosodic cues for each disfluency type? Can predictive models be built to describe the relationship between disfluency types and prosodic cues? Are there correlations beiween the reparandum onset and offset and the repair onset and offset? Is there a general prosodic strategy? Our findings support four main hypotheses: 1) The Combination Rule: A single prosodic feature does not uniquely identify disfluencies or their types. Rather, it is a combination of several features that signals each type. 2) The Compensatory Rule: If there is an overlap of one prosodic feature, then another cue neutralizes the overlap. 3) The Discourse Type Rule: Prosodic cues for disfluencies vary according to discourse type. 4) The Expanded Reset Rule: Repair onsets are dependent on reparandum onsets and reparandum offsets. The limitation of the current study is the relatively small corpus size. Further testing of our proposed hypotheses is needed.

    Keywords DiSS

  • Shu-Chuan Tseng, “Repairs and repetitions in spontaneous Mandarin,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 73-76. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_073.pdf.

    Abstract 246 overt repairs, 653 complete repetitions and 475 partial repetitions were identified in an annotated corpus of spontaneous Mandarin conversations. On the basis of the data, this paper investigates Mandarin repairs and repetitions by segmenting them into the reparandum part, the editing part and the reparans part and by tagging them using the CKIP automatic word segmentation and tagging system. Results of the use of editing term, the distribution of part of speech and syllables in the reparandum are presented. Semantic differences and similarity in the discrepancy of tagging results of the reparandum and the reparans are also discussed.

    Keywords DiSS

  • Fan Yang, Peter A. Heeman, and Susan E. Strayer, “Acoustically verifying speech repair annotations,” in Disfluency in Spontaneous Speech (DiSS '03) (Gothenburg Papers in Theoretical Linguistics), vol. 90, Göteborg, Sweden, September 2003, pp. 97-100. http://www.isca-speech.org/archive_open/archive_papers/diss_03/dis3_097.pdf.

    Abstract Identifying speech repairs is a critical part of annotating spontaneous speech. DialogueView is an annotation tool that provides visual and audio supports for directly annotating speech repairs. In this paper, we report the usability of clean play, a special feature implemented in DialogueView, which cuts out the annotated reparanda and editing terms and plays the remaining speech. We find that although clean play does not help users detect repairs, it does help them determine the extent of repairs. We also find that clean play improves users' confidence because they have another way to verify their annotations.

    Keywords DiSS