<utf dtd_version="utf-1.0" audio_filename="ad43s400" language="English" version="1" version_date="01-May-01">
<bn_episode_trans program="unknown" air_date="unknown">

<section type="nontrans" startTime="0.0" endTime="1.991688">
</section>

<section type="report" startTime="1.991688" endTime="982.505688">
<turn speaker="Andrea_Di_Carlo" spkrType="male" startTime="1.991688" endTime="982.505688">
 {breath Thank you Mr. Chairman.
<time sec="3.774750">
 {breath Ladies and gentlemen, good morning.
<time sec="6.569312">
 {breath We present to you
<time sec="8.803">
 a work on the definition of a meth<fragment> a methodology for
<time sec="12.898875">
 {breath evalu<fragment> evaluating a human machine spoken language interaction.
<time sec="17.093">
 [background] {breath
<time sec="22.554750">
 Given a spoken language system {breath
<time sec="25.109500">
 {breath we intend to characterize
<time sec="27.966625">
 its use and i<fragment> its performances along several dimensions.
<time sec="32.332">
 {breath Our target
<time sec="34.108438">
 is the definition of these dimensions.
<time sec="37.153312">
 [background] {breath
<time sec="40.542250">
 [background] We think, {breath
<time sec="42.737">
 we think that {breath
<time sec="44.379562">
 {lipsmack an assessment method must provide some
<time sec="47.174125">
 measurable items {breath
<time sec="49.037">
 measurement procedures
<time sec="50.583937">
 {breath figures of merit in which the measurable l<fragment> items are combined
<time sec="56.600625">
 {breath in order to stress some aspects of the assessment.
<time sec="61.376438">
 {breath Figures {breath
<time sec="63.457">
 computing procedures
<time sec="65.599563">
 {breath result supporting protocols.
<time sec="68.046">
 {breath In the literature it is possible to find some review on expectation from assessment efforts
<time sec="74.890500">
 {breath in the scientific world.
<time sec="76.913438">
 [background] {breath
<time sec="80.729938">
 We think also that {breath
<time sec="83.493250">
 in working e<fragment> on the assessment, particular {breath
<time sec="86.821">
 on the assessment of assessment method,
<time sec="89.051125">
 {breath the availability of systems
<time sec="92.440062">
 {breath of u<fragment> or of their models is essential
<time sec="97.549500">
 {breath for an {breath
<time sec="99.371">
 iterative specification of assessment method. {lipsmack
<time sec="103.180375">
 {breath We are looking for an alternative to the
<time sec="106.788313">
 {breath availability of real systems
<time sec="109.172">
 It's never {breath
<time sec="111.155">
 easy
<time sec="111.723">
 {lipsmack to the simulation of well specified models {breath
<time sec="116.062">
 can be of solution.
<time sec="117.429">
 In particular
<time sec="118.575">
 {breath [background] we like use a well known
<time sec="121.272188">
 simulation technique, the wizar<fragment> [background] the Wizard of Oz technique.
<time sec="125.849812">
 {lipsmack {breath We used this technique for simulating the phone directory application
<time sec="132.043813">
 for which the project is completely specified from
<time sec="135.787250">
 a func<fragment> a functional point of view.
<time sec="138.070938">
 {breath All modules are projected,
<time sec="140.573500">
 and the modular architecture of the system is {breath partially
<time sec="144.869625">
 partially implemented.
<time sec="146.923875">
 {breath Just [background] one module is essentially not realized [background]
<time sec="151.850">
 {breath It is the spontaneous speech recognizer of course.
<time sec="155.505750">
 {breath %uh The task of the Wizard
<time sec="159.009375">
 {breath {lipsmack in our experiment is to receive,
<time sec="163.149125">
 to [background] connect to connect the telephone the telephone nam<fragment> n<fragment> network, {breath
<time sec="167.998">
 is to receive by telephone the spoken user query,
<time sec="171.803937">
 {lipsmack {breath to
<time sec="174.040">
 listen to it,
<time sec="176.381625">
 {breath to transcribe it and to input it on to {breath
<time sec="180.281563">
 keyboard of the computer system.
<time sec="182.440063">
 {breath In order to clarify
<time sec="184.181437">
 {breath the Wizard setting, we can say it must typewrite quickly
<time sec="189.363937">
 at the computer prompt {breath
<time sec="191.221">
 what he listen to {breath
<time sec="192.691">
 after the beep.
<time sec="194.223125">
 {breath Any constraint is not applied to user condition capability,
<time sec="198.853000">
 {breath and non<hyphen>correction is required.
<time sec="201.605813">
 {breath In one particular phase of our experiment, in %uh a sub experiment,
<time sec="207.549500">
 {breath the Wizard has another task.
<time sec="210.229375">
 He must improve
<time sec="211.628">
 the naturalness of the computer response. {breath
<time sec="214.984375">
 {lipsmack He must just propose
<time sec="216.631937">
 {breath it in more natural order for humans.
<time sec="221.793563">
 [background] {breath
<time sec="224.942625">
 This is a very short description {breath
<time sec="229.030250">
 {breath of the setting of the experimental apparatus.
<time sec="232.732000">
 {lipsmack {breath The Wizard
<time sec="234.984375">
 {lipsmack receive
<time sec="236.267">
 {breath the input from the telephonic line. {breath
<time sec="240.583937">
 {lipsmack {breath He transcribe {breath
<time sec="242.656">
 the spoken user message
<time sec="244.608937">
 on to the keyboard into the _P_C.
<time sec="248.435875">
 {breath Here, a ^Prolog information system {breath
<time sec="252.492188">
 query by keyword
<time sec="254.546375">
 {breath the database of phone and fax users in our company,
<time sec="259.009375">
 {breath generates a response
<time sec="261.418125">
 and send this answer
<time sec="263.263812">
 {breath to the monitor and to the text to speech module.
<time sec="267.997937">
 {breath In one phase of the experiment the system send directly
<time sec="271.970813">
 {breath the +synthesizer's answer
<time sec="275.109500">
 to the telephonic line then {breath
<time sec="277.486938">
 to the user.
<time sec="278.383">
 {lipsmack {breath In the second part of the experiment, the Wizard gives a paraphrase
<time sec="283.117">
 of the computer.
<time sec="284.452562">
 He read from the display and give
<time sec="287.518250">
 {breath give %uh gives a paraphrase of the computer answer and
<time sec="291.345125">
 {breath send it to the
<time sec="293.388937">
 t<fragment> _T_T_S module then the path can continue.
<time sec="296.721">
 [background]
<time sec="299.817">
 [background/] Of co<fragment> %uh [/background]
<time sec="303.201250">
 [background] {lipsmack {breath Now main targets of our experiment
<time sec="306.955188">
 {lipsmack {breath we want to define assessment methodology {lipsmack
<time sec="310.740375">
 to collect
<time sec="312.123">
 a database
<time sec="313.325">
 about sentences used by users
<time sec="316.089688">
 in the spoken interaction with the computer based information system
<time sec="320.271125">
 {breath and we like also study some aspect of human factors
<time sec="324.619375">
 {breath in human machine interaction.
<time sec="326.903000">
 {breath In particular {breath
<time sec="328.650">
 we think that it is important
<time sec="331.019">
 evaluate some figure of merit for characterization of
<time sec="334.640250">
 {breath not only the system performance,
<time sec="337.549500">
 {breath but the *acceptation and {breath
<time sec="340.583937">
 usability by the users too.
<time sec="342.888438">
 {lipsmack {breath Of course in my talk [background]
<time sec="344.555313">
 {breath [background] the central theme is the assessment methodology.
<time sec="347.687625">
 [background] {lipsmack {breath We
<time sec="348.893250">
 [background] {breath
<time sec="350.837250">
 distinguish {breath in the assessment, three aspects.
<time sec="354.079125">
 Global evaluation, in which we want to evaluate
<time sec="357.274875">
 the performance {breath
<time sec="358.728563">
 of the complex system constituted by
<time sec="361.647438">
 the system itself and the user.
<time sec="363.851000">
 {breath They must cooperate
<time sec="365.812313">
 in order to solve some problem.
<time sec="368.194687">
 {breath We want to evaluate %uh the system itself,
<time sec="371.586563">
 its performance
<time sec="372.942187">
<b_unclear>
 At end
<e_unclear>
 we want
<time sec="374.188188">
 {breath to evaluate the use of the system by the user,
<time sec="377.152">
 that is %uh performances of the user.
<time sec="380.348937">
 [background] {lipsmack
<time sec="382.292937">
 {breath In our experiment, [background]
<time sec="383.827313">
 we involve fifty<hyphen>four users {breath
<time sec="386.428938">
 {breath from bureaucratic environment,
<time sec="388.753625">
 aged between {breath
<time sec="390.691875">
 twenty and forty.
<time sec="392.134000">
 {breath We distinguish
<time sec="393.806875">
 four group, four groups, two big group
<time sec="397.025688">
 of twenty<hyphen>seven peoples distinguished generation modalities.
<time sec="400.711750">
 {breath For the
<time sec="402.130812">
 {breath automatic generation group
<time sec="404.392063">
 the response of the software system {breath
<time sec="406.803313">
 is directly sent to the _T_T_S and then {breath sent to the user.
<time sec="410.945125">
 {breath For the natural generation group before the production by the synthesizer and the sending to the user,
<time sec="416.557875">
 {breath the message of computer
<time sec="418.149938">
 {breath is paraphrased by the Wizard.
<time sec="420.688125">
 {lipsmack {breath Each group it's once more
<time sec="423.549312">
 {breath divided in two groups. The first group doing the experiment just one time {breath
<time sec="428.642875">
 {breath the second group repeat the experiment
<time sec="431.336750">
 after ten days.
<time sec="432.452">
 [background] {lipsmack {breath
<time sec="436.130438">
 {breath Every user
<time sec="437.509063">
 [background]
<time sec="438.893500">
 {lipsmack {breath receive a set of seven task
<time sec="441.702812">
 {breath called scheda
<time sec="443.629437">
 or scenario.
<time sec="444.887000">
 {breath Each task is represented by
<time sec="448.030813">
 one table, each task one table
<time sec="451.076625">
 {breath %uh partially filled, partially empty.
<time sec="454.537688">
 {breath %uh for example,
<time sec="456.625938">
 this task is the task, %uh
<time sec="459.042937">
 {breath find the telephone number of Mr. ^Rossi.
<time sec="462.186750">
 {lipsmack {breath And we give this form of the task
<time sec="465.146000">
 in order not to influence
<time sec="467.407250">
 {breath the grammatical form
<time sec="469.103187">
 of the sentence that the user prefers
<time sec="471.704813">
 {breath produced to explain this task to the computer.
<time sec="475.350500">
 {lipsmack Another example, this is a confirmation task
<time sec="478.805813">
 {breath example. {lipsmack All the table is full,
<time sec="482.197687">
 {lipsmack and then the information can be {breath
<time sec="484.955062">
 really true or false. The task is to recognize if the information is true. {breath
<time sec="490.198625">
 Then the possible question {breath f<fragment>
<time sec="492.898312">
 is for this task %uh
<time sec="494.415437">
 {breath %uh does Mr. ^Chiari work in the first department?
<time sec="499.341750">
 [background]
<time sec="501.614500">
 {breath %um {cough Pardon {breath
<time sec="506.396625">
 {lipsmack {breath Every user receives also questionnaire
<time sec="509.361625">
 in which i<fragment> he can explain his feeling
<time sec="512.603500">
 {breath about e<fragment> the experiment,
<time sec="514.755188">
 and about his *acceptation of the system. {breath
<time sec="517.743250">
 {breath For example,
<time sec="519.329625">
 a question is %uh
<time sec="520.933250">
 {breath do you think an automatic information system like this
<time sec="524.977000">
 {breath will be well accepted by the public?
<time sec="528.859187">
 {breath [background] {breath
<time sec="533.923937">
 [background] n<fragment> The evaluation.
<time sec="536.467875">
 {breath %uh
<time sec="537.881125">
 For the global evaluation, that is for the evaluation of the complex system considered by the user of the information
<time sec="543.783">
 {breath system,
<time sec="544.801">
 {breath we evaluated th<fragment>
<time sec="546.555">
 these items.
<time sec="548.154875">
 The scenario completion,
<time sec="549.960375">
 {breath that is how many time the user completed assigned seven tasks.
<time sec="555.255875">
 Task correctness, {breath
<time sec="556.784">
 that is how many correct responses are +written {breath
<time sec="561.012875">
 by the user on the scenario form.
<time sec="563.741375">
 {lipsmack The communication drop,
<time sec="565.655">
 how many times the connection between the user and the system is dropped for {breath
<time sec="569.923">
 any reason. {breath
<time sec="571.298125">
 Interaction time,
<time sec="572.676750">
 how much time the user spend to comple<fragment>
<time sec="575.503375">
 to complete all task
<time sec="577.049313">
 of the scenario.
<time sec="578.664500">
 {breath Here [background] some results.
<time sec="583.169688">
 [background] {breath
<time sec="583.919625">
 %uh just one difference
<time sec="586.169313">
 is significantly shown
<time sec="588.440">
 {breath %uh for for the interaction time.
<time sec="592.035875">
 {breath In %uh test two, in the second repetition,
<time sec="595.479687">
 {breath the time is significantly shorter
<time sec="599.436875">
 than in the test one.
<time sec="601.646250">
 [background] {breath [background]
<time sec="605.769">
 [background] For the system evaluation,
<time sec="608.216563">
 in practical I like stress this part of the experiment in my talk,
<time sec="612.289125">
 {breath we obtained at the moment measure for three items {lipsmack {breath
<time sec="616.514">
 at the moment.
<time sec="617.179">
 {breath %uh The performance of the keyword based natural language system for the interrogation the database.
<time sec="625.141375">
 {breath {lipsmack A detailed analysis of errors of the subsystem is produced too.
<time sec="630.909875">
 {breath At end, the intelligibility of the _T_T_S synthesizer is measured. [background]
<time sec="636.470750">
 [background]
<time sec="637.959000">
 {breath With regard to the {breath
<time sec="640.659">
 performance evaluation, {breath
<time sec="642.481500">
 {breath a correctness judgment from grammatical and pragmatical point of view
<time sec="646.796375">
 {breath were expressed for each
<time sec="648.999937">
 {breath query answer pair
<time sec="651.284250">
 {breath by two human evalua<fragment> evaluator.
<time sec="656.106750">
 {lipsmack A measure expressed by two experts
<time sec="659.106375">
 [background/] {breath were not significantly different [/background]
<time sec="661.921375">
 {breath and %uh
<time sec="663.530">
 they %uh
<time sec="664.824">
 were
<time sec="665.567125">
 %uh around point seven. {breath
<time sec="668.682063">
 [background]
<time sec="670.839500">
 {lipsmack {breath %uh Some interest [background]
<time sec="673.666062">
 [background] {breath
<time sec="676.481125">
 {breath some interest %uh can be in the control
<time sec="681.834313">
 of the consistency of human judgment,
<time sec="684.348">
 {breath we find result quite e<fragment> in accord with
<time sec="687.914313">
 other cases presented in {breath
<time sec="690.089">
 the literature {breath
<time sec="691.190812">
 in the use of human evaluation.
<time sec="694.790375">
 [background] {breath
<time sec="696.890125">
 More {breath
<time sec="698.067750">
 [background] more analytical
<time sec="699.662063">
 [background]
<time sec="702.233500">
 {lipsmack {breath more analytical interpretation of results
<time sec="704.818937">
 obtained by the performance -s {breath
<time sec="707.140">
 evaluation shows {breath
<time sec="708.820">
 that {breath
<time sec="709.667">
 errors
<time sec="710.700500">
 are concentrated in specific task e<fragment>
<time sec="713.070875">
 Essentially system errors can be subdivided
<time sec="715.834000">
 {breath in two main groups.
<time sec="718.653250">
 {breath A group,
<time sec="720.266188">
 a group of error
<time sec="721.911938">
 {breath %um
<time sec="723.080750">
 occurred when the system didn't identify
<time sec="726.316063">
 and or misinterpreted one or more
<time sec="728.915562">
 keywords in the user question. {breath
<time sec="731.267250">
 {breath The other group
<time sec="733.488000">
 the other group occurred
<time sec="735.147750">
 {breath when the system failed to access the database.
<time sec="739.234000">
 {breath The analysis of error is very important for the
<time sec="743.390375">
 design purposes
<time sec="744.671438">
 {breath because on the basis of these analysis
<time sec="747.471937">
 {breath it is possible to foresee
<time sec="749.178437">
 {breath some improvement of the system.
<time sec="751.469313">
 {breath In particular for access
<time sec="753.980000">
 it's %uh
<time sec="755.045938">
 the %um we found it's
<time sec="757.364937">
 easy to correct the
<time sec="759.693250">
 access procedure. {breath
<time sec="761.184625">
 m<fragment> and for e<fragment> the the
<time sec="763.354000">
 It's interesting that
<time sec="765.219438">
 for keyword interpretation
<time sec="767.141000">
 {breath %uh
<time sec="768.467">
 the user
<time sec="769.642313">
 seem {breath %uh m<fragment> the user try some recovery strategy. {lipsmack
<time sec="775.257375">
 {breath [background]
<time sec="779.661562">
 {lipsmack {breath Well at end task comprehension that is if subject formulated pertinent questions
<time sec="788.497938">
 {breath with respect to assigned task
<time sec="790.779437">
 {breath %uh number of calls per user number of words per sentence and per user
<time sec="796.670375">
 {breath how many times
<time sec="797.942063">
 user {breath
<time sec="799.325938">
 spoke before the acoustical prompt
<time sec="802.449062">
 after how many calls they
<time sec="804.627750">
 {breath interrupted interaction
<time sec="806.684938">
 {breath without listening to the
<time sec="808.760750">
 {breath %uh system f<fragment> %uh the final system formalities such %uh
<time sec="813.389312">
 {breath such as a {breath
<time sec="814.590">
 %uh goodbye %uh thank you for calling and so on
<time sec="817.653250">
 {breath well items for %uh for the user evaluation
<time sec="822.085438">
 [background] {breath
<time sec="826.100">
 [background] %um
<time sec="826.899">
 [background]
<time sec="830.931125">
 It is our feeling
<time sec="833.279">
 that the use
<time sec="834.559188">
 of the simulation in assessment experiment
<time sec="837.439188">
 {breath is a very productive approach to assess
<time sec="840.927000">
 to the assessment to the to the assessment of the assessment
<time sec="845.060000">
 {breath for the possibility of many qualitative observations
<time sec="848.697375">
 {breath In the paper in the proced<fragment>
<time sec="850.922812">
 {breath -ding a way pres<fragment> we present we we %uh you can fi<fragment> %uh find several qualitative observation
<time sec="857.290625">
 {breath each of which can become {breath
<time sec="859.525437">
 a working
<time sec="860.928063">
 hypothesis for more development
<time sec="863.218938">
<b_unclear>
<e_unclear>
<time sec="863.482">
 development in assessment definition
<time sec="865.930625">
 {breath Here is an example {breath
<time sec="867.911">
 %uh
<time sec="868.258938">
 {lipsmack If we analyze the lexical complexity
<time sec="870.995">
 that is simply the number of word used by users
<time sec="875.337375">
 {breath in any session
<time sec="876.805438">
 {breath %uh {lipsmack {breath
<time sec="879.255312">
 %uh we see not real not really the important difference
<time sec="885.286500">
 {breath %uh
<time sec="886.925">
 between between groups with
<time sec="890.130125">
 and without %uh repetition
<time sec="893.215812">
 {breath %uh on the other hand
<time sec="895.618938">
 {breath %uh as I said %uh before in %uh the global evaluation
<time sec="899.864125">
 {breath we observe it some *significative
<time sec="902.444938">
 {breath %uh difference in the analysis of the interaction length related to the first and the
<time sec="908.354500">
 second repetition {breath
<time sec="910.580000">
 this result can suggest
<time sec="912.991">
 for the moment s<fragment> just suggest
<time sec="915.302063">
 {breath %uh that with experience
<time sec="917.948313">
 {breath users become able to have more rapid interaction
<time sec="922.586188">
 {breath with the system
<time sec="924.147750">
 but the used language
<time sec="926.597625">
 {breath is equally rich
<time sec="929.935812">
 {breath I'll okay it's [background] conclusion
<time sec="933.320750">
 {breath [background] our conclusion of the works
<time sec="936.125937">
 {breath the work is that some evaluation dimension for the assessment
<time sec="939.782063">
 of the spoken language interaction
<time sec="941.577375">
 {breath between human and machine are individuated
<time sec="944.064687">
 {breath in particular the performance assessment of a spoken language system
<time sec="947.795562">
 {breath is not enough
<time sec="949.198188">
 {breath in order to characterize its use by the user it is important to evaluate
<time sec="953.957625">
 {breath explicitly the satisfaction
<time sec="956.388812">
 {breath of the user and the problem solving capability
<time sec="959.829875">
 in the cooperation between system and user
<time sec="962.784688">
 {breath Another conclusion is the use of the simulation in assessment work
<time sec="967.347750">
 {breath is very productive and can be the source of
<time sec="970.611125">
 {breath further development of more working and
<time sec="973.687500">
 it can seen as a methodological approach like the general use of simulation in the system design.
<time sec="980.485937">
 [background] So thank you very much
</turn>
</section>

<section type="nontrans" startTime="982.505688" endTime="983.347">
</section>
</bn_episode_trans>
</utf>
