![]() |
|
|||||||||||||||||||||||||||||||||
|
|
GraduiertenkollegForschungsprogramm / Research Program"Prosodic Representations in Language and Speech"Ziele des Programms / Research GoalsMy primary research interests are relevant for the research program of the Graduiertenkolleg in several respects. The common denominator for this link between my individual and the Kolleg's joint research programs may be formulated as the investigation of symbolic, acoustic, and cognitive representations of prosody in both language and speech.One of my major goals in the Graduiertenkolleg is to establish a research paradigm that will facilitate the study of prosody by providing direct conceptual links between its manifestations and representations in speech production, speech acoustics, and speech perception. The DFG project "A computational model of target oriented production of prosody" (2001-2004, jointly coordinated by Grzegorz Dogil and myself) has aimed to provide experimental evidence for the validity of extending Guenther and Perkell's speech production model (Guenther, 1995; Guenther et al., 1998; Perkell et al., 2000; Guenther, 2003) to the domain of prosody; a certain degree of synergy from the partial overlap between this project and the present research program is expected. My second main research goal is to investigate linguistic/symbolic representations of prosody, with the ultimate goal of building models that can be implemented in the linguistic components of natural language and speech systems. Such computational models will allow us to empirically verify our interpretations of prosodic representations. In his "blueprint of the speaker", Levelt (1999) has outlined an agenda that calls for the design and implementation of empirically viable working models of the processing components involved in the blueprint. The Prosody Generator is among the least elaborated modules in the blueprint, despite or because of the - generally recognized - important integrating function of prosody in the organization and production of speech (Levelt, 1989; Dogil, 2000). This part of my research program will overlap to some extent with Grzegorz Dogil's research, and in fact we expect to carry out jointly some of the pertinent research. Finally, and in extension of the second goal outlined above, I am interested in the design of interface(s) between semantic, syntactic, phonological and phonetic representations of prosody. The representational design is meant to have its motivation in linguistic and phonetic theory, and its validity should be verifyable in an end-to-end generation system, such as a concept-to-speech system. I intend to contribute to the general research program of the Graduiertenkolleg in the following ways:
Forschungsschwerpunkte / Research Topics1. Theoretical Framework for the Study of Prosody Acquisition, Perception, and ProductionI am interested in investigating Exemplar Theory as a common theoretical framework for the study of prosody acquisition, perception, and production by humans, as well as prosody generation by machine. Whereas Pierrehumbert (2001) discusses primarily segmental evidence, I will argue that the key ingredients of the theory can be generalized and extended to the prosodic domain. 2. Prosody Production and Generation I intend to explore to what extent Exemplar Theory, besides the speech production model (Guenther, 1995; Guenther et al., 1998; Perkell et al., 2000) that the DFG project "A computational model of target oriented production of prosody" (Dogil and Möbius, 2001c; see also Dogil and Möbius, 2001a, 2001b) builds on, is also a relevant theoretical framework for the study of prosody in speech production. More concretely, the theory might help explain how the multidimensional target regions of speech events are established and maintained in an individual speaker (production) and across speakers (perception). I predict that frequency effects, entrenchment, and exemplar cloud updates are crucial ingredients of such an explanation. Exemplar Theory also provides a motivational link between human speech production and automatic speech generation. In speech production a particular exemplar is selected to realize a given category; Pierrehumbert (2001) discusses three models, with increasing complexity, of how the selection is performed. Currently, neither of these models take into account the context in which the target event is embedded. I suggest that the segmental, prosodic and positional contexts be considered explicitly in a model of exemplar selection in speech production. Pierrehumbert's notion of an abstract, idealized exemplar prototype in speech production (Pierrehumbert, 2001) is compatible with the notion of an ideal point in a multidimensional target region as it is currently applied in automatic speech synthesis (Möbius, 2001a). Whereas prosody certainly contributes to the multidimensionality of target specifications, it is considered as secondary information in state-of-the-art unit selection synthesis systems (e.g., Balestri et al., 1999; Beutnagel et al., 1999; Taylor and Black, 1999), and one of my research goals is to exploit more explicitly the selectional restrictions imposed by appropriate prosodic representations. 3. Internal Representations of Prosody Phonemic settings and the internal models that they represent are learned in the process of language and speech acquisition. Postural settings, in contrast, rely on continuous auditory monitoring and tend to break down quickly if this monitoring process is inhibited during speech production. Evidence presented in the literature indicates that stable internal models are mostly associated with segmental phonemic targets (Perkell et al., 2000), whereas prosodic features often display postural characteristics. I will argue that the dichotomy of phonemic and postural settings applies not only to segmental properties of speech but to prosodic features as well. When compared to segmental characteristics of speech, which are best subserved by strong and stable internal representations (Perkell et al., 2000), prosodic properties may rely more strongly on a balanced mixture of continuous, auditory feedback-based update and learned internal models (Jones and Munhall, 2001). Based on evidence reported in the literature (e.g., Jilka, 2000) and on theoretical considerations I intend to test two hypotheses: first, that the relative importance of acquired internal models of phonemic targets, on the one hand, and of immediate adjustments of postural settings, on the other hand, is flexible and depends on the actual communicative and situative conditions; and second, that the speaker may have access to several internal models, each representing the most appropriate balance of phonemic and postural settings for a prototypical communicative and situative context (Möbius and Dogil, 2002). 4. Acquisition of Prosody The DIVA model (Guenther, 1995) provides a simulation of the acquisition of internal phonemic models. An exemplar-based interpretation of the target regions also needs to explain how such internal models emerge during language and speech acquisition. In the DFG project "Ein exemplartheoretisches Modell zum Erwerb der akustischen Korrelate der Betonung" (2004-2006) I have started to study the acquisition of syllabic stress. The project will investigate when children start perceiving stress contrasts, when and how they start realizing syllabic stress, and whether they tend to produce stress by actively using the same acoustic correlates of stress as their parents do, as would be predicted by exemplar theory. The acquisition of prosodic targets should be investigated beyond the specific topic of syllabic stress. The parts of my research program outlined above (1.-4.) will be carried out in close collaboration with Grzegorz Dogil. 5. Concept-to-Speech Generation Concept-to-speech (CTS) systems (Alter et al., 1997; Teich et al., 1997) provide a direct link between language generation and acoustic-prosodic components (Möbius, 2001a; Batliner and Möbius, 2002). But to fully exploit the improvement to synthesized prosody potentially available in a CTS system, the optimal granularity of information needs to be defined. On the one hand, the acoustic-prosodic components will have to specify exactly which pieces of linguistic information are optimally required to produce naturally sounding prosody; on the other hand, this specification must be synchronized with the type of information that a language generation component can be reasonably expected to provide. The research challenge can thus be formulated in short-hand as the design of an optimal semantics/syntax-prosody interface. This part of my research program is expected to benefit from collaboration with the semantics and discourse specialists in the Kolleg, in particular Hans Kamp and Uwe Reyle. 6. Methodological Issues My approach to addressing the research topics outlined above is generally theory-driven but may be characterized additionally by the following methodological considerations: Corpus-based methods. Language and speech corpora provide access to real data with realistic frequency distributions, and relevant features can be detected, learned and modeled by means of appropriate empirical and statistical methods, including machine learning (e.g., Prescher, 2002). Statistical methods. I recognize the necessity to apply sophisticated statistical methods (e.g., van Santen, 1993; Baayen, 2001; Evert and Lüdeling, 2001) that can handle extremely uneven frequency distributions of language and speech events and the resulting sparse data problem (Möbius, 2001b; Möbius, 2003a). Probabilistic models. I expect probabilistic information to be relevant at virtually all levels of prosodic description: entrenchment (acquisition), productive morphological and phonological processes (production, generation), well-formedness judgments (perception), unit selection (speech synthesis).
Computational models. Quantitative computational models will
facilitate the evaluation of hypotheses and assumptions, by way of
being implemented and integrated in natural language systems.
Stand der Forschung / State of the ArtExemplar Theory was first introduced in psychology as a model of perception and categorization. Only with some delay was it extended to speech sounds by Johnson (1997) and Lacerda (in press) (cf. related work by Hintzman, 1986, and Goldinger, 1996). Pierrehumbert (2001) demonstrates how Exemplar Theory can provide a way to formalize the detailed phonetic knowledge that native speakers have about the categories of their language. The acquisition of this knowledge can be regarded as the acquisition of a large number of memory traces of speech-based experiences. Pierrehumbert discusses primarily segmental evidence, but there are good reasons to assume that the theory can be generalized and extended to the prosodic domain. For instance, it is posited that
Pierrehumbert (2001) explains that the optimal location of a given exemplar prototype may not always be actually represented by an existing exemplar token. Optimal locations may thus represent idealized, abstract prototypes. In my own work on speech synthesis (Möbius, 2001a) I have elaborated the concept of an "ideal point" (originally introduced by van Santen and Möbius; cf. Sproat, 1998), the center of a multidimensional region of pre-defined size. The ideal point serves either as the reference target for the online selection of speech units at synthesis runtime or as a reference for the optimal location of cut and concatenation points in offline acoustic unit inventory construction. In either scenario the size of the region represents the limits of (acoustically, perceptually) acceptable deviations of unit candidates from the ideal target. Selection criteria are multidimensional too: they comprise both spectral and prosodic features and, accordingly, the unit candidates in the speech corpus are annotated with segmental and prosodic feature vectors. Note that prosody is considered as secondary information in state-of-the-art unit selection systems (e.g., Balestri et al., 1999; Beutnagel et al., 1999; Taylor and Black, 1999). Another interesting property of the speech production model (Guenther et al., 1998) is the dichotomy of phonemic settings and postural settings. In mature speech production auditory feedback has two functions (Boutsen and Christman, 2001). First, it helps maintain phonemic settings, i.e. parameters of phonemic distinctions; second, it assures intelligibility by monitoring the acoustic environment and accommodating the baseline postural settings of the respiratory, laryngeal, and supraglottal systems appropriately. We have suggested that the dichotomy of phonemic and postural settings applies not only to segmental properties of speech but to prosodic features as well (Möbius and Dogil, 2002). Finally, one answer to the research question of an appropriate representation of prosody at the interface between categorical (symbolic) and continuous (parametric) levels of description may be found by developing computational models of prosody in the framework of natural language systems. Certain speech output generation strategies beyond the classical text-to-speech (TTS) scenario offer rather immediate interfaces between symbolic and acoustic representations of prosody (Batliner and Möbius, 2002). Concept-to-speech (CTS) systems, in particular, provide a direct link between language generation and acoustic-prosodic components. A CTS system has access to the complete linguistic structure of the sentence that is being generated; the system "knows" what to say, and how to render it. The degree of potential improvement to synthesized prosody can be illustrated by manually marking up the text or by providing access to semantic and discourse representations (Prevost and Steedman, 1994). But even in a TTS scenario it has been demonstrated that models which use rich and detailed prosodic information, for instance accent type labels in addition to accent location alone, can generate intonation contours that are perceptually more acceptable than models which use accent location alone (Syrdal et al., 1998). The problem is that computing from text such detailed prosodic features as accent type is difficult and unreliable, but they may be more readily accessible in different speech generation strategies such as concept-to-speech.
Yet, in CTS systems it is still necessary to specify the mapping from
semantic to symbolic features and from categorical symbolic features
to continuous acoustic parameters. The issue of how much, and what
kind of, information the language generation component should deliver
to optimize the two mapping steps (i.e., the definition of a
semantics/syntax-prosody interface) is an urgent research topic. Once
the two mapping steps are optimized, we may even be able to advance
one step further and get rid of the intermediate, phonological
representation of prosody (cf. Batliner and Möbius, 2002).
Eigene Vorarbeiten / Own WorkProsody: speech production and internal representations:- Dogil and Möbius (2001a,b,c), Möbius and Dogil (2002), Schweitzer and Möbius (2003a,b, 2004)
Speech generation and synthesis:
Prosodic modeling:
Methodological issues: Publikationen der letzten 3 Jahre / Publications in the previous 3 years
Themen geplanter Dissertationsprojekte / Planned PhD Dissertation Projects
Verzahnung innerhalb des Kollegs / Links to Other Parts of the GraduiertenkollegThematic relations within the Kolleg:
Literatur / References
|
|
| Letzte Änderung: 22.12.2004 (bm) | |