{breath Thank you Mr. Chairman. {breath Ladies and gentlemen, good morning. {breath We present to you a work on the definition of a meth a methodology for {breath evalu evaluating a human machine spoken language interaction. [background] {breath Given a spoken language system {breath {breath we intend to characterize its use and i its performances along several dimensions. {breath Our target is the definition of these dimensions. [background] {breath [background] We think, {breath we think that {breath {lipsmack an assessment method must provide some measurable items {breath measurement procedures {breath figures of merit in which the measurable l items are combined {breath in order to stress some aspects of the assessment. {breath Figures {breath computing procedures {breath result supporting protocols. {breath In the literature it is possible to find some review on expectation from assessment efforts {breath in the scientific world. [background] {breath We think also that {breath in working e on the assessment, particular {breath on the assessment of assessment method, {breath the availability of systems {breath of u or of their models is essential {breath for an {breath iterative specification of assessment method. {lipsmack {breath We are looking for an alternative to the {breath availability of real systems It's never {breath easy {lipsmack to the simulation of well specified models {breath can be of solution. In particular {breath [background] we like use a well known simulation technique, the wizar [background] the Wizard of Oz technique. {lipsmack {breath We used this technique for simulating the phone directory application for which the project is completely specified from a func a functional point of view. {breath All modules are projected, and the modular architecture of the system is {breath partially partially implemented. {breath Just [background] one module is essentially not realized [background] {breath It is the spontaneous speech recognizer of course. {breath %uh The task of the Wizard {breath {lipsmack in our experiment is to receive, to [background] connect to connect the telephone the telephone nam n network, {breath is to receive by telephone the spoken user query, {lipsmack {breath to listen to it, {breath to transcribe it and to input it on to {breath keyboard of the computer system. {breath In order to clarify {breath the Wizard setting, we can say it must typewrite quickly at the computer prompt {breath what he listen to {breath after the beep. {breath Any constraint is not applied to user condition capability, {breath and noncorrection is required. {breath In one particular phase of our experiment, in %uh a sub experiment, {breath the Wizard has another task. He must improve the naturalness of the computer response. {breath {lipsmack He must just propose {breath it in more natural order for humans. [background] {breath This is a very short description {breath {breath of the setting of the experimental apparatus. {lipsmack {breath The Wizard {lipsmack receive {breath the input from the telephonic line. {breath {lipsmack {breath He transcribe {breath the spoken user message on to the keyboard into the _P_C. {breath Here, a ^Prolog information system {breath query by keyword {breath the database of phone and fax users in our company, {breath generates a response and send this answer {breath to the monitor and to the text to speech module. {breath In one phase of the experiment the system send directly {breath the +synthesizer's answer to the telephonic line then {breath to the user. {lipsmack {breath In the second part of the experiment, the Wizard gives a paraphrase of the computer. He read from the display and give {breath give %uh gives a paraphrase of the computer answer and {breath send it to the t _T_T_S module then the path can continue. [background] [background/] Of co %uh [/background] [background] {lipsmack {breath Now main targets of our experiment {lipsmack {breath we want to define assessment methodology {lipsmack to collect a database about sentences used by users in the spoken interaction with the computer based information system {breath and we like also study some aspect of human factors {breath in human machine interaction. {breath In particular {breath we think that it is important evaluate some figure of merit for characterization of {breath not only the system performance, {breath but the *acceptation and {breath usability by the users too. {lipsmack {breath Of course in my talk [background] {breath [background] the central theme is the assessment methodology. [background] {lipsmack {breath We [background] {breath distinguish {breath in the assessment, three aspects. Global evaluation, in which we want to evaluate the performance {breath of the complex system constituted by the system itself and the user. {breath They must cooperate in order to solve some problem. {breath We want to evaluate %uh the system itself, its performance At end we want {breath to evaluate the use of the system by the user, that is %uh performances of the user. [background] {lipsmack {breath In our experiment, [background] we involve fiftyfour users {breath {breath from bureaucratic environment, aged between {breath twenty and forty. {breath We distinguish four group, four groups, two big group of twentyseven peoples distinguished generation modalities. {breath For the {breath automatic generation group the response of the software system {breath is directly sent to the _T_T_S and then {breath sent to the user. {breath For the natural generation group before the production by the synthesizer and the sending to the user, {breath the message of computer {breath is paraphrased by the Wizard. {lipsmack {breath Each group it's once more {breath divided in two groups. The first group doing the experiment just one time {breath {breath the second group repeat the experiment after ten days. [background] {lipsmack {breath {breath Every user [background] {lipsmack {breath receive a set of seven task {breath called scheda or scenario. {breath Each task is represented by one table, each task one table {breath %uh partially filled, partially empty. {breath %uh for example, this task is the task, %uh {breath find the telephone number of Mr. ^Rossi. {lipsmack {breath And we give this form of the task in order not to influence {breath the grammatical form of the sentence that the user prefers {breath produced to explain this task to the computer. {lipsmack Another example, this is a confirmation task {breath example. {lipsmack All the table is full, {lipsmack and then the information can be {breath really true or false. The task is to recognize if the information is true. {breath Then the possible question {breath f is for this task %uh {breath %uh does Mr. ^Chiari work in the first department? [background] {breath %um {cough Pardon {breath {lipsmack {breath Every user receives also questionnaire in which i he can explain his feeling {breath about e the experiment, and about his *acceptation of the system. {breath {breath For example, a question is %uh {breath do you think an automatic information system like this {breath will be well accepted by the public? {breath [background] {breath [background] n The evaluation. {breath %uh For the global evaluation, that is for the evaluation of the complex system considered by the user of the information {breath system, {breath we evaluated th these items. The scenario completion, {breath that is how many time the user completed assigned seven tasks. Task correctness, {breath that is how many correct responses are +written {breath by the user on the scenario form. {lipsmack The communication drop, how many times the connection between the user and the system is dropped for {breath any reason. {breath Interaction time, how much time the user spend to comple to complete all task of the scenario. {breath Here [background] some results. [background] {breath %uh just one difference is significantly shown {breath %uh for for the interaction time. {breath In %uh test two, in the second repetition, {breath the time is significantly shorter than in the test one. [background] {breath [background] [background] For the system evaluation, in practical I like stress this part of the experiment in my talk, {breath we obtained at the moment measure for three items {lipsmack {breath at the moment. {breath %uh The performance of the keyword based natural language system for the interrogation the database. {breath {lipsmack A detailed analysis of errors of the subsystem is produced too. {breath At end, the intelligibility of the _T_T_S synthesizer is measured. [background] [background] {breath With regard to the {breath performance evaluation, {breath {breath a correctness judgment from grammatical and pragmatical point of view {breath were expressed for each {breath query answer pair {breath by two human evalua evaluator. {lipsmack A measure expressed by two experts [background/] {breath were not significantly different [/background] {breath and %uh they %uh were %uh around point seven. {breath [background] {lipsmack {breath %uh Some interest [background] [background] {breath {breath some interest %uh can be in the control of the consistency of human judgment, {breath we find result quite e in accord with other cases presented in {breath the literature {breath in the use of human evaluation. [background] {breath More {breath [background] more analytical [background] {lipsmack {breath more analytical interpretation of results obtained by the performance -s {breath evaluation shows {breath that {breath errors are concentrated in specific task e Essentially system errors can be subdivided {breath in two main groups. {breath A group, a group of error {breath %um occurred when the system didn't identify and or misinterpreted one or more keywords in the user question. {breath {breath The other group the other group occurred {breath when the system failed to access the database. {breath The analysis of error is very important for the design purposes {breath because on the basis of these analysis {breath it is possible to foresee {breath some improvement of the system. {breath In particular for access it's %uh the %um we found it's easy to correct the access procedure. {breath m and for e the the It's interesting that for keyword interpretation {breath %uh the user seem {breath %uh m the user try some recovery strategy. {lipsmack {breath [background] {lipsmack {breath Well at end task comprehension that is if subject formulated pertinent questions {breath with respect to assigned task {breath %uh number of calls per user number of words per sentence and per user {breath how many times user {breath spoke before the acoustical prompt after how many calls they {breath interrupted interaction {breath without listening to the {breath %uh system f %uh the final system formalities such %uh {breath such as a {breath %uh goodbye %uh thank you for calling and so on {breath well items for %uh for the user evaluation [background] {breath [background] %um [background] It is our feeling that the use of the simulation in assessment experiment {breath is a very productive approach to assess to the assessment to the to the assessment of the assessment {breath for the possibility of many qualitative observations {breath In the paper in the proced {breath -ding a way pres we present we we %uh you can fi %uh find several qualitative observation {breath each of which can become {breath a working hypothesis for more development development in assessment definition {breath Here is an example {breath %uh {lipsmack If we analyze the lexical complexity that is simply the number of word used by users {breath in any session {breath %uh {lipsmack {breath %uh we see not real not really the important difference {breath %uh between between groups with and without %uh repetition {breath %uh on the other hand {breath %uh as I said %uh before in %uh the global evaluation {breath we observe it some *significative {breath %uh difference in the analysis of the interaction length related to the first and the second repetition {breath this result can suggest for the moment s just suggest {breath %uh that with experience {breath users become able to have more rapid interaction {breath with the system but the used language {breath is equally rich {breath I'll okay it's [background] conclusion {breath [background] our conclusion of the works {breath the work is that some evaluation dimension for the assessment of the spoken language interaction {breath between human and machine are individuated {breath in particular the performance assessment of a spoken language system {breath is not enough {breath in order to characterize its use by the user it is important to evaluate {breath explicitly the satisfaction {breath of the user and the problem solving capability in the cooperation between system and user {breath Another conclusion is the use of the simulation in assessment work {breath is very productive and can be the source of {breath further development of more working and it can seen as a methodological approach like the general use of simulation in the system design. [background] So thank you very much