A Multimodal Analysis of Floor Control in Meetings
Lei Chen1 , Mary Harper1 , Amy Franklin2 , Travis R. Rose3 , Irene Kimbara2 , Zhongqiang Huang1 , and Francis Quek3
School of Electrical Engineering, Purdue University, West Lafayette IN, firstname.lastname@example.org, email@example.com 2 Department of Psychology, University of Chicago, Chicago, IL CHCI, Department of Computer Science, Virginia Tech, Blacksburg, VA
Abstract. The participant in a human-to-human communication who controls the ?oor bears the burden of moving the communication process along. Change in control of the ?oor can happen through a number of mechanisms, including interruptions, delegation of the ?oor, and so on. This paper investigates ?oor control in multiparty meetings that are both audio and video taped; hence, we analyze patterns of speech (e.g., the use of discourse markers) and visual cues (e.g, eye gaze exchanges) that are often involved in ?oor control changes. Identifying who has control of the ?oor provides an important focus for information retrieval and summarization of meetings. Additionally, without understanding who has control of the ?oor, it is impossible to identify important events such as challenges for the ?oor. In this paper, we analyze multimodal cues related to ?oor control in two di?erent meetings involving ?ve participants each.
Meetings, which play an important role in daily life, tend to be guided by a clear set of principles about who should talk when. Even when multiple participants are involved, it is fairly uncommon for two people in a meeting to speak at the same time. An underlying, auto-regulatory mechanism known as ‘?oor control’, enforces this tendency in human dialogs and meetings. Normally, only one participant is actively speaking; however, around ?oor control transitions, several participants may vie for the ?oor and so overlapped speech can occur. The active speaker holds the ?oor, and the participants all compete for and cooperate to share the ?oor so that a natural and coherent conversation can be achieved. By increasing our understanding of ?oor control in meetings, there is a potential to impact two active research areas: human-like conversational agent design and automatic meeting analysis. To support natural conversation between an embodied conversational agent and humans, it is important that those agents use human conversational principles related to the distribution of ?oor control so that they can speak with appropriate timing. Further, the same embodied cues (e.g., gesture, speech, and gaze) that are important for creating e?ective conversational agents (e.g., ) are important for understanding ?oor control and how it contributes to revealing the topical ?ow and interaction patterns that emerge during meetings. In this paper, we investigate multimodal aspects of ?oor control in meetings. Historically, researchers in conversational analysis have proposed several models to describe the distribution of ?oor control. Perhaps the most in?uential is the model by Sacks et al. . A basic principle in this model is that a conversation is built on turn constructional units (TCUs), which are typically complete units with respect to intonation contours, syntax, and semantics. A TCU may be a complete sentence, a phrase, or just a word. The completion of a TCU results in a transition relevance place (TRP), which raises the likelihood that another speaker can take over the ?oor and start speaking. Hearers are often able to predict the end of TCUs using various cues.
Most previous research has been on ?oor control coordination on dialogs. A series of multimodal cues from syntax, prosody, gaze and gesture have been presented for turn-taking in dialogs [6, 1, 14, 7, 25, 17]. Syntax completion and some special expressions, like “you know” are useful syntactic cues for turn change . Silence pause, rises or falls of intonation, variation of speech rate, ?nal lengthening, and other prosodic patterns are related to turn keeping or yielding [6, 14, 7, 25]. Deictic gestures can be used to yield the ?oor . As for gaze, during ?oor transitions, there exist short periods of mutual gaze between two adjacent turn holders followed by the next holder breaking this mutual gaze. This is known as mutual gaze breaks , which happen in around 42% of the turn exchanges . Recently there has been increasing attention given to multiparty meetings. For example, a simulation study on group discussions has been carried out to investigate turn-taking models of meetings [18, 19]. To support vibrant research on automatic meeting analysis, it is important that high quality corpora are available to the research community. To this end, several audio or multimodal meeting corpora have been collected, including the ISL audio corpus  from Interactive Systems Laboratory (ISL) of CMU, the ICSI audio corpus , the NIST audio-visual corpus , and the AMI audio-visual corpus  from Europe. With the availability of these data resources, researchers have begun to investigate the detection of various events in meetings using multimodal cues (e.g.,  and). During this process, some meeting events have been annotated. These annotation e?orts are mostly focused on dialogue acts (DAs) [22, 10]. 1.1 Our Focus:
Floor control is an important aspect of human-to-human conversation, and it is likely that multimodal cues involving speech, gaze, and gesture play important roles in tracking ?oor control structure. However, in multiparty conversation, the research on multimodal cues for ?oor control issues is still sparse. Fortunately, with the emergence of increasing amounts of multimodal meeting data, future progress is quite likely. To better support work related to ?oor control in meetings, we have attempted to adopt a nomenclature that is focused on only the essential aspects of this structure. Although there are tags that express the role of utterances in turn management, they do not completely cover the phenomena involved in ?oor control management. Hence, in this paper, we de?ne a new ?oor control annotation scheme that builds on the notion of a sentence unit, and then use the annotations to identify multimodal cues of importance for predicting ?oor changes. We describe the audio/video meeting corpus used in our investigations in Section 2.1 and the annotations that are used for analysis in Section 2.2. In Section 2.3, we raise some questions related to ?oor control in meetings and present our quantitative results. In Section 3, preliminary conclusions based on our analysis of the data are presented.
2.1 Meeting Description In this paper, we analyze two meetings from the VACE multimodal meeting corpus . This corpus was collected in a meeting room equipped with synchronized multichannel audio, video and motion-tracking recording devices. In these recordings, participants (from 5 to 8 civilian, military, or mixed) engage in planning exercises. For each recorded meeting, we collected multichannel time synchronized audio and video recordings. Using a series of audio and video processing techniques, we obtained word transcriptions and prosodic features, as well as 3D head, torso and hand tracking traces from visual tracking and a Vicon motion capture system.
The two meetings selected for the current study were named based on their recoding dates. The Jan07 meeting, recorded on January 7th, involves the exploitation of a foreign weapons device. In this meeting, 5 military o?cers from di?erent departments (e.g., weapons testing, intelligence, engineering, ?ghter pilot) collaborated to plan the best way to test the capability of the weapon part. Each participant played the important role of representing the perspective of his/her department. The Mar18 meeting, recorded on March 18th, involves the selection of graduate fellowship recipients. In this meeting, 5 faculty members from AFIT developed criteria for selecting 5 awardees from a pool of 15 applicants and then made the selections based on those criteria. Each participant, after reviewing the quali?cations of 3 applicants, gave their opinions about the ranking of their candidates and their suitability for selection. After an initial round of presentations, the participants developed selection criteria and ranked all of the candidates accordingly. The Jan07 and Mar18 meetings di?er, in part, because the Mar18 participants needed to consult application materials during their interactions and also made use of a white board to organize the selection process. Because these artifacts played a major role in Mar18, there was much less eye contact among participants. 2.2 Data Preparation and Annotation
The data annotation procedures are depicted in Figure 1. Details related to each step are provided below.
video signal extraction
multimodal behavior annotation in MacVissta
XML convers ation
word/SU ?oor control
manual word transcription
ASR forced alignment
sentence unit (SU) annotation
?oor control annotation
Fig. 1. Data ?ow diagram of multimodal meeting data annotation procedure
Word/SU Annotation: The meetings in the VACE corpus were transcribed by humans according to the LDC Quick Transcription (QTR) guidelines and then time aligned with the audio. Word-level transcriptions do not provide information that is available in textual sources, such as punctuation or paragraphs. Because sentence-level segments provide an important level of granularity for analysis, we chose to segment the words into sentences and mark the type of sentence prior to carrying out ?oor annotation. The importance of structural information for human comprehension of dialogs has already been demonstrated and methods have been developed to automatically annotate speech dialogs . Using EARS MDE annotation speci?cation V6.2, we annotated sentence units (SUs). An SU expresses a speaker’s complete thought or idea. There are four types: statement, question, backchannel, and incomplete. Initially we automatically annotated SU boundaries based on a hidden-event SU language model (LM) trained using the EARS MDE RT04S training data containing about 480, 000 word tokens, and then we manually corrected them and labeled their types using a multimodal interface. This interface displays time aligned word transcriptions with the automatic markups
and allows the annotator to listen to audio and view video corresponding to selected portions of the transcripts. The original EARS MDE SU annotations were created by LDC using a tool that was developed to annotate spoken dialogs. As we began this e?ort, there was no existing tool optimized to support the ?ve-plus channels that must be consulted in our meetings in order to accurately annotate SUs. Furthermore, because we believed that the video cues were vital for our markups, we also needed access to the video. Hence, to annotate SUs (and subsequently ?oor control), we considered a variety of multimodal tools. We ended up choosing Anvil  for the following reasons: (1) it is an extremely recon?gurable tool developed to support multimodal annotations; (2) it supports the simultaneous display of annotations and playback of audio and video segments; (3) because it uses XML to represent the markups, the tool is able to ?exibly support a variety of markups; (4) markups can be setup for color display, which is attractive especially for quickly post-editing annotations. For SU annotation, we used four di?erent colors to display the SU types, which was quite helpful for noticing and correcting annotation mistakes. Using the SU Anvil interface designed by the ?rst author, the second author annotated each meeting segment with SU boundaries and SU types. While carrying out this analysis, she also noticed and repaired a small number of transcription and word alignment errors. Floor Control Annotation: There is no existing standard of annotation for ?oor control, although LDC discussed the notion of a turn versus control of the ?oor in their annotation guidelines for MDE 4 . In previous research, most researchers have focused on turn-taking in dyadic conversations [23, 24, 2], but do not explicitly discuss the relationship between turn and ?oor control. Following LDC’s de?nition, we chose to de?ne a speaker turn as an interval of speech uttered by a single discourse participant that is bounded by his/her silence (>= 0.5s), which can essentially be done by using an appropriately parameterized speech activity detection algorithm. When participant A is talking to participant B and B is listening without attempting to break in, then A clearly has “control of the ?oor”. The person controlling the ?oor bears the burden of moving the discourse along. Change in control of the ?oor can happen through a number of mechanisms, including regular turn-taking and successful interruptions. Although turns are important for meeting analysis, not all speaker turns involve ?oor control, and it is possible to control the ?oor despite the presence of fairly long pauses. A participant’s turn may or may not coincide with them holding the ?oor, and so may overlap with that of another participant who is holding the ?oor. Overlaps that do not cause the ?oor holder to concede the ?oor include backchannels (passive contributions to discourse, which constitute a speaker turn), failed interruptions, helpful interjections, or side-bar interactions. We have developed an annotation scheme related to control of the ?oor that involves several types of events. Since it is possible for ?oors to split, in our annotations we keep track of who is participating in a particular ?oor control event. There are several types of events helpful to mark up for ?oor analysis:
Control: This corresponds to the main communication stream in meetings. Sidebar: This event type is used to represent sub-?oors that have split o? of a more encompassing ?oor. Again we need to know who has control and which participants are involved. Backchannel: This is an SU type involving utterances like “yeah” that is spoken when another controls the ?oor.
Challenge: This is an attempt to grab the ?oor. For example, the utterance “do I ” by E is a challenge. C: yeah we need to instrument it we need /E: do I... do I need to be concerned... Cooperative: This is an utterance that is inserted into the middle of the ?oor controller’s utterance in a way that is much like a backchannel but with propositional content. Other: These are other types of vocalizations, e.g., self talk, that do not contribute to any current ?oor thread.
Using the Anvil interface designed by the ?rst author which displays audio, video, and time aligned word transcriptions with SU annotations, the ?rst author annotated each meeting segment with the above ?oor control related events, and these annotations were double checked by the second author. When annotating each event type we chose to respect the SU boundaries and event types. For example, SU backchannels have to be backchannels in our ?oor markups, and control events continue until the end of an SU, even if another speaker starts a new control event before the ?rst control event is complete. The annotation process for these control events involved several passes. In the ?rst pass, the annotator focused on tracking the “major” control thread(s) in the meeting, resulting in a sequence of ?oors controlled by various participants. Then, in the second pass, annotator focused on the “?ner” distinctions of ?oor control structure (e.g., challenge, cooperative). Anvil provides excellent play-back control to take some of the tedium out of viewing the data multiple times. Gaze and Gesture Annotation: In each VACE meeting, 10 cameras were used to record the meeting participants from di?erent viewing angles, thus making it possible to annotate each participant’s gaze direction and gestures. Gesture and gaze coding was done on MacVissta , a general-purpose multimodal video display and annotation tool on Mac OS X. It supports the simultaneous display of multiple videos (representing di?erent camera angles) and enables the annotator to select an appropriate view from any of 10 videos to produce more accurate gaze/gesture coding. The annotators had access to time aligned word transcriptions and all of the videos when producing gaze and gesture annotations. Following McNeill lab’s gesture coding manual, ?ve common types of gestures that are related to the content of concurrent speech, including metaphoric, iconic, emblematic, deictic and beat, were annotated. These exclude ?dgeting movements (e.g., tapping ?ngers while thinking, touching clothes, etc.) as well as instrumental movements (e.g., holding a cup, arranging papers on a desk, etc.). Gaze coding was completed by marking major saccades, which are intervals that occur between ?xations of the eye. Such intervals begin with the shift away from one ?xation point and continue until the next ?xation is held for roughly 1/10 of a second (3 frames). Inclusion of micro-saccades is not possible using the available technologies nor is it necessary for our level of analysis. The segmentation of space into areas and objects for ?xation include other people, speci?c non-human entities in the environment (e.g. board, papers, thermos), personal objects (e.g. watch), and neutral space in which the eyes are not ?xated on any visible objects. These gesture and gaze coding, which was stored in Mac’s default XML structure, were converted to a custom XML format that was then loaded into Anvil for combination with the word, SU, and ?oor control annotations described previously. Given the combination of word-level information, SU and ?oor event annotations, and gesture and gaze markups, we have carried out an analysis of the two meetings describe previously.
Table 1. Basic ?oor properties of two VACE meetings. speaker C D E F G speaker C D E F G dur(sec) # words 337.32 1,145 539.13 2,027 820.51 3,145 579.42 2,095 352.92 1,459 dur(sec) 679.39 390.46 485.21 486.60 470.72 Jan07 meeting Control Challenge Backchannel Sidebar-Control Cooperative 299.58 (37) 5.33 (8) 12.84 (44) 14.9 (17) 4.67 (4) 465.54 (26) 3.4 (7) 5.31 (29) 64.88 (19) 0 (0) 763.31 (63) 7.67 (11) 29.82 (116) 17.02 (7) 2.61 (2) 523.16 (37) 4.73 (20) 11.8 (43) 32.39 (15) 7.34 (9) 296.31 (31) 5.66 (11) 11.78 (55) 39.16 (15) 0 (0) Mar18 meeting # words Control Challenge Backchannel Sidebar-Control Cooperative 2,095 648.73 (62) 1.89 (4) 28.02 (74) 0 (0) 0.75 (2) 1,285 359.75 (54) 4.23 (11) 21.78 (65) 0 (0) 4.7 (7) 1,380 465.03 (49) 10.24 (21) 18.41 (72) 0 (0) 0 (0) 1,467 481.70 (57) 1.43 (4) 3.47 (11) 0 (0) 0 (0) 1,320 422.49 (53) 0.87 (2) 36.76 (111) 0 (0) 2.14 (2)
2.3 Measurement Studies We have two reasons for carrying out measurement studies on the two VACE meetings described above. First, since there has been little research conducted on ?oor control in multiparty meetings, we need to gain a greater understanding of ?oor control in this setting. It is not clear whether the ?ndings from dialogs will hold for larger groups of participants. Second, our ultimate goal is to develop algorithms that utilize multimodal cues to automatically annotate the ?oor control structure of a meeting. Hence, measurement studies provide an opportunity to identify useful cues from the audio and visual domains to support our future system design. Some of the questions that we had hoped to answer in this investigation include questions related to speech, gaze, and gesture. How frequently do verbal backchannels occur in meetings? What is the distribution of discourse markers (e.g., right, so, well) in the meeting data? How are they used in the beginning, middle, and end of a control event? When a holder ?nishes his/her turn, are there some observable distributional patterns in his/her eye gaze targets? Does he/she gaze at the next ?oor holder more often than at other potential targets? When a holder takes control of the ?oor, are there some observable distributional patterns in his/her eye gaze targets? Does he/she gaze at the previous ?oor holder more often than at other potential targets? Do we observe the frequent mutual gaze breaks between two adjacent ?oor holders during ?oor change? How frequently does the previous ?oor holder make ?oor yielding gestures such as pointing to the next ?oor holder? How frequently does the next ?oor holder make ?oor grabbing gestures to gain control of the ?oor? The control of ?oor is not always intentionally transferred from one speaker to another speaker. Sometimesthe ?oor holder yields the control of ?oor and makes it open to all meeting participants. When the control of ?oor is open, someone may take the control without the explicit control assignment. In order to do a more accurate calculation over all ?oor transitions, based on words, audio, gaze and gesture cues in the context of ?oor transitions, ?rst author subjectively classi?ed all ?oor transitions into four categories: Change: there is a clear ?oor transition between two adjacent ?oor holders with some gap between adjacent ?oors. Overlap: there is a clear ?oor transition between two adjacent ?oor holders, but the next holder begins talking before the previous holder stops speaking. Stop: the previous ?oor holder clearly gives up the ?oor, and there is no intended next holder so the ?oor is open to all participants.
Self-select: without being explicitly yielded the ?oor by the previous holder, a participant takes control of the ?oor. For Change and Overlap ?oor transitions, the control of ?oor has been explicitly transferred from previous ?oor holder and the next ?oor holder. For Stop and Self-Select, there has no ?oor transition between two adjacent ?oor holders. Normally, the previous ?oor holder yields the control of ?oor and makes it available to all others. The next ?oor holder volunteers for taking the control of ?oor. By distinguishing these four transition types, we believe we will be able to obtain a deeper understanding of the multimodal behavior patterns. For example, when examining the previous ?oor holder’s gaze and gesture patterns, we do not consider Self-Select transitions. Basic Statistics: First, we provide some basic statistics related to ?oor control in the two meetings. Table 1 shows information about the Jan07 and Mar18 meetings. The table reports the total duration and the number (in parentheses) of each ?oor event type. It should be noted that these intervals contain pauses which ?gure into the duration. These meetings are clearly quite di?erent based on the information provided in the table, even though each is comprised of ?ve participants for a total duration of around forty minutes each. Figure 2 provides some basic statistics on ?oor transitions by meeting. We ?nd that there is a much larger number of Stop and Self-Select ?oor transitions in the Mar18 meeting than in the Jan07 meeting.
Fig. 2. Basic statistics on ?oor transitions for two VACE meetings
Speech Events: The ?rst speech event we consider is the verbal backchannel. Given the fact that participants can make non-verbal backchannels freely in the meetings, such as nodding, we are interested in seeing whether verbal backchannels are common. Hence, we calculated the percentage of backchannel SUs in the total SUs . Figure 3 shows that the backchannel percentage is 25.22% in Jan07 meeting and 30.8% in Mar18 meeting. Jurafsky et al.  reported a backchannel percentage of 19% on SWB corpus. Shriberg et al.  reported a backchannel percentage of 13% on ICSI meeting corpus. In their calculations, the calculation is on utterance level, which is de?ned as a segment of speech occupying one line in the transcript by a single speaker which is prosodically and/or syntactically signi?cant within the conversational context ; whereas, our’s are done at the SU level (which may contain more than one utterance). Since nods
made for a?rmation purposes in the meeting were annotated, we include them here for comparison.
Fig. 3. Statistics on verbal backchannel and nodding
Table 2. DM distribution on Control and Challenge events in the two VACE meetings. Jan07 meeting location # w/ DM #total # dur. (sec) frequency (Hz) challenge 22 57 26.79 0.82 short control 20 54 52.41 0.38 beginning 58 140 70 0.82 ending 13 140 70 0.18 middle 304 140 2155.50 0.14 Mar18 meeting location # w/ DM #total # dur. (sec) frequency (Hz) challenge 12 42 18.65 0.64 short control 42 110 111.04 0.38 beginning 73 165 82.5 0.88 ending 13 165 82.5 0.16 middle 184 165 2092.67 0.09
Another important speech related feature that deserves some consideration is the patterning of discourse markers (DMs) in our ?oor control events. A DM is a word or phrase that functions as a structuring unit of spoken language. They are often used to signal the intention to mark a boundary in discourse (e.g., to start a new topic). Some examples of DMs include: actually, now, anyway, see, basically, so, I mean, well, let’s see, you know, like, you see . For control and challenge events, we counted the number of times that the DMs in the above list appear. For control with durations exceeding 2.0 seconds, we count the number of discourse markers appearing in three locations, i.e., the beginning (the ?rst 0.5 seconds of the span), the end (the last 0.5 seconds of the span), and the middle (the remainder). If a span is shorter than 2.0 seconds, we count the number of discourse markers appearing in the entire span and dub the event a short control event. Since ?oor challenges tend to be short, we also count the number of discourse markers over the entire span. Table 2 shows the distribution of DMs for these events and locations. We calculated the frequency of DMs, which is de?ned as the ratio of number of locations with DM (# w/ DM) to the total duration of location (# dur. (sec)). DMs occur much
more frequently in challenges (0.82 Hz in Jan07 and 0.64 Hz in Mar18) and in ?oor beginnings (0.82 Hz in Jan07 and 0.88 Hz in Mar18) than in the other event spans. Gaze events: Figures 4 and 5 reports some statistics related to gaze targets of the previous ?oor holders and the next ?oor holders during a ?oor transition. The possible targets include the next holder, previous holder, the meeting manager (E in each meeting), other participants, no person (e.g., papers, object, whiteboard). In the Jan07 meeting, when the ?oor is transferred, the previous ?oor holder frequently gazes to the next ?oor holder (in 124 out of 160 transitions, giving 77.5%). In addition, the next ?oor holder frequently gazes at the previous ?oor holder (136 given 160 transitions or 85%) during a ?oor transition. In the Mar18 meeting, since participants often spend time reading information from papers and the whiteboard, we ?nd a much lower occurrence of these gaze patterns; the previous holder gazes to the next holder 65 times given 167 transitions (38.9%) and the next holder gazes to the previous holder 76 times given 167 transitions (45.5%).
Fig. 4. The previous ?oor holder’s gaze target distribution
In the Jan07 meeting, over all of the 160 ?oor transitions involving two holders (Change and Overlap), there were 70 mutual gaze breaks, giving a percentage of 43.75%, which is similar to the 42% reported by Novick . However, in the Mar18 meeting, over all 167 ?oor exchanges involving two holders (Change and Overlap), there were only 14 mutual gaze breaks. This suggests that in meetings that involve signi?cant interactions with papers and other types of visual displays, there is likely to be a lower percentage of mutual gaze breaks. All participants do not play equal roles in the meetings we analyzed; there was a meeting manager assigned for each meeting. The participants labeled E in both the Jan07 and Mar18 meetings are meeting managers who are responsible for organizing the meeting. Clearly, E in Jan07 plays an active role in keeping the meeting on track; this can be observed by simply viewing the meeting video but also from the basic ?oor statistics. E in the Jan07 meetings speaks the greatest number of words and backchannels the most. However, E in the Mar18 meeting plays a “nominal” meeting manager
Fig. 5. The next ?oor holder’s gaze target distribution
role. From the basic statistics of the meeting, we observe that C speaks more words than E, and G has the most backchannels. Given the special role played by a meeting manager, we analyzed whether the meeting manager a?ects ?oor change even when he/she is not a previous or next ?oor holder. In the Jan07 meeting, there were 53 cases that E is not either the previous or next ?oor holder in ?oor exchange (only Change and Overlap). In these 53 cases, E gazes at the next ?oor holder 21 times. However, sometimes the manager gazes to the next holder together with other participants. If we rule out such cases, E gazes to the next ?oor holder 11 times (20.75%). This suggests that an active meeting manager’s gaze target plays some role in predicting the next ?oor holders. In Mar18 meeting, there are 100 cases that E is not a ?oor holder. Within 100 cases, E gazes to the next ?oor holder only 6 times. In fact, it is the case that E gazes largely at his papers or the whiteboard. Gesture events: Gesture has been found to help coordinate ?oor control. During a ?oor transitions, the previous ?oor holder may point to a meeting participant to assign the ?oor. When a person desires to gain control of the ?oor, he/she may use some hand movements, such as lifting their hand or some object (e.g., pen) in accompaniment to speech in to attract attention to gain the ?oor. Here we consider whether gestural cues are helpful in a automatic ?oor control structure detection system. We calculated the number of occurrences of ?oor giving (fg) gestures used by the previous ?oor holder and the ?oor capturing (fc) gestures used by the next ?oor holder during ?oor transitions in the two meetings. As can be seen in Figure 6, there are many more ?oor capturing gestures in these two VACE meetings than ?oor giving gestures. When a speaker begins a Self-Select ?oor, he/she is apt to use an fc gesture. Therefore, in our automatic ?oor control prediction system, we may focus on concurrent ?oor capturing gestures as a source of useful cues from the gesture domain.
The ?oor control structure of a meeting provides important information for understanding that meeting. We presented a ?oor control annotation speci?cation and applied it to two di?erent meetings from the VACE meeting corpus. From an analysis of these markups, we have identi?ed some multimodal cues that should be helpful for predicting
Fig. 6. Gestures for grabbing or yielding ?oors made by ?oor holders
?oor control events. Discourse markers are found to occur frequently at the beginning of a ?oor. During ?oor transitions, the previous holder often gazes at the next ?oor holder and vice verse. The well-known mutual gaze break pattern in dyadic conversations is also found in Jan07 meeting. A special participant, an active meeting manager, is found to play a role in ?oor transitions. Gesture cues are also found to play a role, especially with respect to ?oor capturing gestures. Comparing the Jan07 and Mar18 meetings, we ?nd that implements (e.g., papers and white boards) in the meeting room environment impact participant behavior. It is important to understand the factors that impact the presence of various cues based on the analysis of a greater variety of meetings. In future work, we will re?ne our ?oor control annotation speci?cation and continue to annotate more meetings in the VACE collection, as well as in other meeting resources. Using knowledge obtained from these measurement studies, we will build an automatic ?oor control prediction system using multimodal cues.
We thank all of our team members for their e?orts in producing the VACE meeting corpus: Dr. Yingen Xiong, Bing Fang, and Dulan Wathugala from Virginia Tech, Dr. David McNeill, Dr. Susan Duncan, Jim Goss, Fey Parrill, and Haleema Welji from University of Chicago, Dr. Ron Tuttle, David Bunker, Jim Walker, Kevin Pope, Je? Sitler from AFIT. This research has been supported by ARDA under contract number MDA90403-C-1788 and by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023. Any opinions, ?ndings, and conclusions expressed in this paper are those of the authors and do not necessarily re?ect the views of ARDA and DARPA.
 M. Argyle and M. Cook. Gaze and Mutual Gaze. Cambridge Univ. Press, 1976.  L. Bosch, N. Oostdijk, and J. P. Ruiter. Durational aspects of turn-taking in spontaneous face-to-face and telephone dialogues. In Proc. of TSD 2004, pages 563–570, 2004.
 S. Burger, V. MacLaren, and H. Yu. The ISL meeting corpus: The impact of meeting type on speech type. In Proc. of Int. Conf. on Spoken Language Processing (ICSLP), 2002.  J. Cassell, J. Sullivan, S. Prevost, and E. Churchill. Embodied Conversational Agents. MIT Press.  L. Chen, T. Rose, F. Parrill, X. Han, J. Tu, Z. Huang, I. Kimbara, H. Welji, M. Harper, F. Quek, D. McNeill, S. Duncan, R. Tuttle, and T. Huang. VACE multimodal meeting corpus. In Proceeding of MLMI 2005 Workshop, 2005.  S. Duncan. Some signals and rules for taking speaking turns in conversations. Journal of Personality and Social Psychology, 23:283–292, 1972.  C. E. Ford and S. A. Thompson. Interactional units in conversation: syntactic, intonational, and pragmatic resources for the managment of turns. In T. Ochs, Scheglo?, editor, Interaction and Grammar. Cambridge Univ. Press, 1996.  J. Garofolo, C. Laprum, M. Michel, V. Stanford, and E. Tabassi. The NIST Meeting Room Pilot Corpus. In Proc. of Language Resource and Evaluation Conference, 2004.  N. Jovanovic and R. Akker. Towards automatic addressee identi?cation in multi-party dialogues. In Proceedings of SIGDial, 2004.  N. Jovanovic, R. Akker, and A. Nijholt. A corpus for studying addressing behavior in multi-party dialogues. In Proc. of SIGdial Workshop on Discourse and Dialogue, 2005.  D. Jurafsky, B. Rebecca, and et al. Automatic detection of discourse structure for speech recognition and understanding. In Proc. of IEEE Workshop on Speech Recognition and Understanding, 1997.  M. Kipp. Anvil: A generic annotation tool for multimodal dialogue. In Proc. of European Conf. on Speech Processing (EuroSpeech), 2001.  Y. Liu. Structural Event Detection for Rich Transcription of Speech. PhD thesis, Purdue University, 2004.  J. Local and J. Kelly. Projection and ’silences’: Notes on phonetic and conversational structure. Human Studies, 9:185–204, 1986.  I. McCowan, D. Gatica-Perez, S. Bengio, G. Lathoud, M. Barnard, and D. Zhang. Automatic analysis of multimodal group actions in meetings. IEEE Trans. on Pattern Analysis and Machine Intelligence, 27(3):305–317, 2005.  N. Morgan and et al. Meetings about meetings: Research at ICSI on speech in multiparty conversations. In Proc. of ICASSP, volume 4, pages 740–743, Hong Kong, Hong Kong, 2003.  D. G. Novick, B. Hansen, and K. Ward. Coordinating turn-taking with gaze. In Proc. of Int. Conf. on Spoken Language Processing (ICSLP), 1996.  E. Padilha and J. Carletta. A simulation of small group discussion. In Proc. of the sixth Workshop on the Semantics and Pragmatics of Dialogue (EDILOG 2002), pages 117–124, Edinburgh, UK, 2002.  E. Padilha and J. Carletta. Nonverbal behaviours improving a simulation of small group discussion. In Proceedings of the First International Nordic Symposium of Multi-modal Communication, 2003.  T. Rose, F. Quek, and Y. Shi. Macvissta: A system for multimodal analysis. In Proc. of Int. Conf. on Multimodal Interface (ICMI), 2004.  H. Sacks, E. Scheglo?, and G. Je?erson. A simplest systematics for the organisation of turn taking for conversation. Language, 50:696–735, 1974.  E. Shriberg, R. Dhillon, S. Bhagat, J. Ang, and H. Carvey. The ICSI meeting recorder dialog act (MRDA) corpus. In Proc. of SIGdial Workshop on Discourse and Dialogue, 2004.  O. Torres, J. Cassell, and S. Prevost. Modeling gaze behavior as a function of discourse structure. In Proc. of the First International Workshop on Human-Computer Conversations, Bellagio, Italy, 1997.  K. Weilhammer and S. Rabold. Durational aspects in turn taking. In International Congresses of Phonetic Sciences, 2003.  A. Wichmann and J. Caspers. Melodic cues to turn-taking in english: evidence from perception. In Proc. of SIGdial Workshop on Discourse and Dialogue, 2001.