|
![]() |
Our research spans three major topics: First, we do usability tests of all our prototypes and applications. In order to be able to do these tests, we needed to develop methods for evaluating multimodal and mobile applications. These tests give valuable feedback on the advantages and challenges of multimodality for the user.
Our second research topic is the development of a language for device independent description of user interfaces. We create the means for editing and actually rendering the multimodal interface on devices as different as a WAP phone or a PDA.
Our speech related research investigates conversational speech recognition and distributed speech synthesis. We evaluated the performance of semantic language models
for conversational speech recognition. To find out which synthesis methods is best for our applications we did a user study.
MONA UIML
The User Interface Markup Language (UIML) is an abstract meta-language for a canonical XML representation of any user interface. UIML itself does not define any device- or toolkit specific tags. It introduces a basic set of tags for defining a user interface structure. In order to use UIML, one must first define a vocabulary.
A number of such vocabularies have been defined. However, many vocabularies are limited in the sense that they are specifically designed for a certain single target technology (e.g. Java AWT/Swing, HTML, WML, VoiceXML). This contradicts the original concept of device independent authoring, since the author requires knowledge of both UIML and the target language. Another limitation shared among the vocabularies we encountered is the fact that they assume a strict one-to-one mapping of abstract user interface elements to target markup language elements. Practical experiences within the MONA project have shown that this is insufficient. A third weak point of current UIML vocabularies (and other generic user interface markup language approaches) is poor control over layout and graphical appearance. This requirement is in conflict with the requirement of device independence and is addressed by our vocabulary’s layout model.
Rendering Engine
The rendering engine has the task to convert a UIML page to a format the device at hand (e.g. a PocketPC PDA, or a P800 smartphone) can display in a sensible way. The main challenge is to maintain a consistent look and feel and full functionality of an application over various different devices. Applications supply priority information for the various elements of a user interface which the rendering engine uses to adapt the user interface to small screen devices where not all the information can be displayed. On small-screen WAP phones, the user interface can also be split into several WAP cards.
The rendering system takes care of the set of HTML tags and Java script commands a device can display, so the application programmer needs to write just one presentation of his user interface.
Conversational Speech Recognition
The recognition of conversational speech is a hard problem.
Semantic relatedness measures can improve speech recognition
performance when using contextual information. The standard n-gram
approach in language modeling for speech recognition cannot cope
with long distance dependencies. Therefore it was proposed to combine n-gram language
models, which are effective for predicting local dependencies,
with latent semantic analysis for long distance dependencies.
WordNet-based semantic relatedness measures can also be used for word
prediction using long distance dependencies.
The performance of eight WordNet-based semantic
similarity/relatedness measures for word prediction in
conversational speech was evaluated. We give a ranking of the
different measures which shows that the performance of the
measures differs significantly for noun and verb prediction. We
also varied the dialog context and used cross part-of-speech
comparison.
Distributed Speech Synthesis
For the user study on speech synthesis quality we used eight different sources. The sources were defined by taking a certain synthetic/natural voice on a certain device
(e.g. a PocketPC PDA, or a NOKIA 7650 smartphone) with or without an application specific lexicon. The server-based unit selection voices were also used with different codecs. The P.85 recommendation from ITU.T was used as a guideline for the user study. The voices consisted of a natural voice, a server based 16kHz unit selection voice and
an embedded 8kHz diphone-based voice with lexicon and without lexicon.
Our user study showed that the subjective quality ratings were significantly different for the sources. Regarding the objective understanding there was however only a difference between the synthetic voices in general
and the natural voice.
Future Work
Enhanced multimodal user interfaces. The MONA rendering engine can currently generate graphical as well as voice user interfaces. Both can be combined to form multimodal interfaces. However, a good multimodal user interface beneficially combines both modalities instead of simply offering both of them in parallel. Future work will focus on improved rendering of true multimodal user interfaces in different input- and output-modality-combinations.
Conversational Speech Recognition is supported by the MONA platform. We will test this type of speech recognition, based on statistical language models, with several models. Robust recognition and summarization methods will increase the accuracy and usability of Conversational Speech Recognition.
Interface migration: Moving the user interface to a different device with different capabilities poses no basic challenge to our server but simply requires the user to login on the new device. The presentation server needs to load the new capabilities and forward subsequent UI descriptions to the new device.
Multi device interaction: We currently expect one user to have only one active device at a time, responsible for all modalities. As many people own and use several mobile devices – generally a phone and a PDA, both with screen and audio – it seems sensible to make use of them and investigate in this direction.
Other modalities: While we currently do not support other modalities than voice and graphics, we see that especially handwriting recognition will be an important input modality in the foreseeable future.
Partial UI Updates: In order to minimize user interruption, pushing complete pages should be avoided as it tends to block the client interface for some seconds. In future versions of the MONA presentation server an application may update single elements of the user interface using DOM mechanisms. It remains to be seen whether this simplifies application development and reduces bandwidth requirements.
|
|
|
|