Two Demonstrators are Better than One – A Social Robot that Learns to Imitate People with Different Interaction Styles (A)


Phoebe Liu, Dylan F. Glas, Takayuki Kanda, Member, IEEE, Hiroshi Ishiguro, Senior Member, IEEE

Abstract— With recent advances in social robotics, many studies have investigated techniques for learning top-level multimodal interaction logic by imitation from a corpus of humanhuman interaction examples. Most such studies have taken the approach of learning equally from a variety of demonstrators, with the effect of reproducing a mixture of their average behavior. However, in many scenarios it would be desirable to reproduce specific interaction styles captured from individuals. In this study, we train one deep neural network jointly on two separate corpuses collected from demonstrators with differing interaction styles. We show that training on both corpuses together improves performance in terms of generating socially appropriate behavior even when reproducing only one of the two styles. Furthermore, the trained neural network also enables us to synthesize new interaction styles on a continuum between the two demonstrated interaction styles. We discuss plots of the hidden layer activations from the neural network, indicating the types of semantic information that appear to be learned by the system. Further, we observe that the better performance with the synthesized corpus is not merely due to the increase of the sample size, as even with the same number of training examples, training on half the data from each corpus provided better performance than training on all the data from a single corpus. Index Terms— Human-robot interaction, learning by imitation, social robotics, service robots, proactive behaviors, learning interaction style.

I. INTRODUCTION

IN recent years, we have seen an upsurge of social robots being used commercially, specifically in the space of providing entertainment [1, 2], educating children [3, 4], or providing customer service [5, 6]. As more social robots gain traction in public, one promising approach for generating human-robot interaction logic is to automatically learn natural human behaviors by imitation from real human-human interaction. The arrival of the big data era reveals the feasibility of creating natural human-robot interactions empowered by data-driven approaches [7, 8, 9, 10]. Our previous work [7, 11], in which a shopkeeper robot learns to reproduce both reactive and proactive multimodal behaviors by means of abstraction from examples of human-human interactions, illustrates the idea that repeatable interaction data can be used to automatically infer interaction strategies for generating robot behaviors. While data-driven approaches can be an efficient method for reproducing robot behaviors, one possible downside is that individual behavior differences from person to person may end up confusing a learning-based robot regarding what behaviors it should reproduce. Traditionally, most data-driven approaches require training a unique model for each task, and since each model may require thousands of examples [12, 13], this approach may not be so scalable when it comes to learning the range of variations of social behaviors that arise from person to person.

For example, in a shop scenario, if a passive shopkeeper mainly let customers browse around, while on the other hand, a proactive shopkeeper took initiative to interact with the customer – which shopkeeper should a robot learn from, or is it possible for the robot to jointly learn from both shopkeepers, even when the shopkeeper behaves differently in the same situation? Thus, this notion of learning social interactions from two different people instead of just a single person is an attractive option for a learning-based robot. Learning from two people provides more example training data, as well as the possibility to learn the different interaction style of each person. Both of these points are important, as having more training data could potentially improve the quality of the learned robot behaviors, and the possibility of learning multiple interaction styles can better equip the robot to assume different language behavior and interaction style depending on scenario or situation at the moment [14]. In this work, we will attempt to move from the paradigm of a robot learning from just one person to jointly learning from two people who may behave differently given the same situation (i.e. passive and proactive). Since our goal is to jointly learn from both shopkeepers and reproduce their behavior in a robot, we propose a simple modification to our previous learning system, which is to append a style feature to the input of the Multilayer Perceptron (MLP). In addition, we will investigate the effect of training jointly from both shopkeepers, demonstrating that it can both improve the performance of the robot behavior and equip the robot with the ability to assume different interaction styles via the style feature, due to the MLP learning shared neural representations for some semantically similar interaction patterns. Lastly, we discuss that the performance improvement when jointly learning from two shopkeeper corpuses is not just because of the increase of the sample size, as even with the same number of training examples, training on half the data from each corpus provided better performance than training on all the data from a single corpus.

II. RELATED WORK

A. Learning from data for social robots For social robots, frameworks focused on crowdsourcing have been developed to enable learning of overall interaction logic from data collected from simulated environments, such as the Robot Management System framework [15] and The Mars Escape online game [8, 16]. Remote users are asked to interact in order to complete several search and retrieval tasks in an online game. The interaction data are logged and used to generate autonomous robot behaviors to also complete the same task. In Thomaz et al’s work, they developed a framework to enable other online users to administer feedback when teaching a Reinforcement Learning agent to perform tasks observed in a game [17, 18]. While our work complements these approaches by considering crowd-based data collected directly from human-human interaction using sensors in a physical environment, we are also interested in capturing individual style from humans and reproducing the differing styles in a robot. The use of real human interaction data collected from sensors for learning interactive behaviors has been investigated in numerous works. In a study by Nagai et al., a robot was developed with infant-like ability to learn from human parental demonstrations by using a model based on visual saliency to detect likely important locations in a scene without employing any knowledge about the actions or the environment [19]. The robot JAMES was developed to serve drinks in a bar setting, in which a number of supervised (i.e. dialog management) and unsupervised learning techniques (i.e. clustering of social states) were applied to learn social interaction [20]. Admoni and Scassellati proposed a model that uses empirical data from annotated human-human interactions to generate nonverbal robot behaviors in a tutoring applications. The model can simultaneously predict the context of a newly observed set of nonverbal behaviors, and generate a set of nonverbal behaviors given a context of communication [21]. While some of these studies do support learning interactive behaviors, we are not aware of any framework designed to simultaneously learn from human demonstrators who exhibit distinctively different behavior styles, in terms of both verbal expression and nonverbal motion, given the same situation.

B. Learning from multiple sources In robot manipulation tasks, most works have focused on learning a task-specific behavior [22, 23]. There have been some attempts to move from learning a task-specific model to jointly learning multiple robot tasks at the same time. Pinto and Gupta demonstrated how models with multi-task learning (i.e. grasp and push) tend to perform better than a task-specific model with the same amount of data [24]. They hypothesized that performance improvement is due to diversity of data and regularization in learning. Likewise, our study also considers the merit of jointly learning interactions from multiple people as a scalable solution for the robot to improve performance as well as learning different interaction styles at the same time. There have been some attempts to acquire verbal and nonverbal dialog behaviors for a robot learned from multiple demonstrators. In Leite et al.’s study, they proposed a semisituated learning method to crowdsource from multiple authors, resulting in a dataset consisting of a set of annotated, humanauthored dialog lines that are associated with the goal that generated them [25]. Their system blends together content created by multiple authors and rated by multiple judges to be used for generating robot speech. In contrast to their work, where input from multiple authors is manually created and merged together, our work learns directly from data and aims to preserve and reproduce the individual styles of the demonstrators. In the regime of natural language processing, work with Long Short Term Memory (LSTM) neural networks has demonstrated [26] that translation to multiple languages is possible from one source language, an approach which also showed benefits such as better training efficiency and smaller number of models. The translation task of generating honorific language from English to German was made possible by training with two sources of informal and polite German speech [27]. Simultaneously, word-graphs were constructed by using tweets collected from two different domains (i.e. politics and entertainment) to transform regular chatbot responses to the responses which mimic the speaking styles of those specific domains [28]. Similarly, we also want to jointly learn different interaction styles from two different shopkeepers, but in the problem domain of learning multimodal human-robot interactions from noisy sensor data collected in a physical environment.

III. DATA COLLECTION

A. Scenario We chose a camera shop scenario for this study so that repeatable behaviors consistent with either a proactive or passive interaction styles could be observed. We set up a simulated camera shop environment in an 8m x 11m experiment space with three camera models on display, each at a different location (Fig. 1). For each interaction, one shopkeeper participant interacted with one customer participant. In this environment, our goal was to collect data corresponding to the following two shopkeeper behavior patterns:

Fig. 1. Environment setup for our study, featuring three camera displays. Sensors on the ceiling were used for tracking human position, and smartphones carried by the participants were used to capture speech.

• A proactive shopkeeper takes initiative, either by introducing new camera features or presenting a new camera to the customer, while still answering the customer’s questions.

• A passive shopkeeper is preoccupied with other tasks in the shop and mostly lets the customer browse around the shop, though the shopkeeper should still be helpful by answering the customer’s questions. We chose these two interaction styles because we consider them to be particularly meaningful in HRI. For example, Baraglia et al. [29] have also investigated the importance of controlling a robot’s level of proactivity in collaborative tasks.

B. Sensors We recorded the participants’ speech and movement as they interacted with each other. We used a human position tracking system, consisting of 20 Microsoft Kinect 1 sensors arranged in rows on the ceiling, to capture the participants’ positions and motion in the room. Particle filters were used to estimate the position and body orientation of each person in the room based on point cloud data [30]. The speech of each participant was also captured by a handheld smartphone, and the Google speech recognition API1 was used to recognize utterances and send the text to a server via Wi-Fi. To detect the start and stop of speech activity, users were required to touch the mobile screen to indicate the beginning and end of their speech. Location data for the shopkeeper and the customer were recorded at a rate of 20 Hz. Speech data were recorded at the start and end of each speech event, as signaled by participants tapping on their Android phones.

C. Participants For the role of customers in our interaction, we recruited fluent English speakers as participants. They had varied levels of knowledge about cameras. We employed a total of 18 customer participants (13 male, 5 female, average age 32.8, s.d. 12.4).

Since our goal was to capture the natural interaction styles of the shopkeepers, we initially interviewed and recruited participants with various degrees of proactivity as shopkeepers and observed some trial interactions. After the trial interactions, we asked the customer participants to provide feedback on the shopkeepers in terms of how well they fit the descriptions for the target interaction styles. Thus, based on the interview and feedback results, we selected one participant (male, age 54) with a naturally outgoing personality, as the proactive shopkeeper. Comparably, we selected another participant (female, age 25), who had a quieter disposition, as the passive shopkeeper. They played the assigned roles in all interactions.

D. Procedure The shopkeepers were encouraged to act according to their type (i.e. passive or proactive), as described above. Both shopkeepers were instructed to wait by the service counter at the start of the interaction. They were also instructed to be polite, and to give socially-appropriate acknowledgements (i.e. greetings and farewells). To keep interactions interesting and to create variation in the interactions, customer participants were encouraged to play with the cameras and asked to role-play in different trials as advanced or novice camera users, and to ask questions that would be appropriate for their role. Some camera features were chosen to be more interesting for novice users (color, weight, etc.) and others were more advanced (High-ISO performance, sensor size, etc.), although they were not explicitly labeled as such. Customer participants were not given a specific target feature or goal for the interaction, as we were mostly interested in capturing the shopkeepers’ behavior. All participants were instructed to focus their discussion on the features listed on the camera spec sheet, ranging from 8 to 10 features for each camera, to minimize the amount of “offtopic” discussion. Each shopkeeper interacted with 9 different customer participants. For each customer participant, we conducted 24 interactions each (12 as advanced and 12 as novice) for a total of 216 interactions. 17 interactions were removed for proactive shopkeeper trials and 10 interactions were removed for passive shopkeeper trials, due to technical failures of the data capture system or participants who did not follow instructions. Table 1 presents the amount of data collected. This data set is available online2 .

E. Observed Behavior Overall, the customer participants followed our suggestions, though some customers have difficulty role-playing advanced

or novice camera users – for example, participants who had little knowledge about cameras were not easily able to think of the types of questions an advanced camera user would ask. Aside from this point, we observed a variety of behaviors captured, such as customers who spoke multiple topics in a single, long utterance and customers who only had direct questions. Since we encouraged the customers to play with different cameras, we observed that at times, a customer would be focused on a camera and would not speak or move for some time, thus creating a period of silence during the interaction. The shopkeeper participants behaved according to their assigned roles. For the passive shopkeeper, she mainly let the customer browse around the shop and only answered questions when asked. She gave short, concise answers and did not expound on her answers. In contrast, the proactive shopkeeper had much more variation in his responses, and he often spoke in long, descriptive utterances and volunteered extra information when answering questions. Here we describe four main differences we observed between the behaviors demonstrated by the two shopkeepers.

First, the proactive shopkeeper approached the customer when he or she entered the shop, whereas the passive shopkeeper waited by the service counter. Second, the proactive shopkeeper often explained about 2 or 3 features at the same time, whereas the passive shopkeeper usually only explained one feature at a time. Third, the proactive shopkeeper often volunteered more information, either by talking about a new feature or continuing his previous explanation, after some silence had elapsed or when the customer demonstrated a “backchannel” utterance (e.g. “oh, ok”). In this situation, the passive shopkeeper would usually remain silent. Fourth, the proactive shopkeeper would ask the customer questions, such as ‘what sort of pictures do you take?”, whereas the passive shopkeeper rarely asked the customer questions. Table 2 illustrates example interactions from the both the passive and proactive shopkeeper. Notice that the passive shopkeeper is quite reactive, and her responses are usually short and concise. She also moves back to the service counter or remains silent when the customer does not inquire about a camera. On the contrary, the proactive shopkeeper presents additional information about the camera, both when the customer asks a question and when the customer remains silent.

IV. PROPOSED TECHNIQUE

A. Overview In order to reproduce proactive or passive interaction styles in a robot, we used a collection of data-driven techniques that directly learn behaviors (i.e. utterances and motion) from examples of human-human interaction from noisy sensor data. These techniques closely follow the procedure followed in our previous work [7, 11], and additional details are presented in the Appendix. The key steps of the techniques are listed here: 1. Abstraction of training input and typical robot actions (Sec. IV.B): Continuous streams of interaction data captured from sensors are abstracted into typical behavior patterns, and the corresponding joint state vector and robot action are defined. 2. Learning with Multilayer Perceptron (MLP) Neural Network (Sec. IV.C): We applied a feed-forward MLP neural network to learn to reproduce robot behaviors. An “attention” layer is applied to the neural network to learn the relative importance of various steps of interaction history as inputs to the respective robot output actions. 3. Adding a target “interaction style” constraint (Sec. IV.D): In order to learn different interaction styles, this work extends the previous system by appending an extra token to the input of the neural network. The token is initialized to correspond to the respective human shopkeeper from the training examples. At runtime, it can be used to specify whether the outputted target robot action should mimic the interaction style of the proactive or passive shopkeeper. The techniques for Steps 1 and 2 were presented in our previous study [7, 11], while Step 3 constitutes the novel contribution of this work which enables behavior generation for multiple interaction styles. B. Abstraction of training input and target robot action In order to learn effectively despite the large variation of natural human behaviors and noisy inputs from the sensor system, the sensor data needs to be abstracted into common behavior patterns (i.e. common spatial states and common spoken utterances), which are then used to discretize a continuous stream of captured sensor data into behavior events. Here we briefly describe our techniques:

• Abstraction: To find common, typical behavior patterns in the training data, we used unsupervised clustering and abstraction to identify typical utterances, stopping locations, motion paths, and spatial formations of both participants in the environment.

• Action Discretization: To discretize continuous sensor data, we identified an action whenever a participant: (1) speaks an utterance (end of speech), and/or (2) changes their moving target, and/or (3) yields their turn by allowing a period of time to elapse with no action. An interaction is discretized into a sequence of alternating customer and shopkeeper actions. • Defining Input Features: For each action detected, the abstracted state of both participants at the time is represented as a joint state vector, with features consisting of their abstracted motion state and the utterance vector of the current spoken utterance.

• Incorporating History: To provide contextual information for generating robot shopkeeper actions, the n most recent joint state vectors are incorporated as interaction history. We chose n = 3 since this seemed to be a good balance for generating observed shopkeeper behaviors (e.g. presenting new features) in our scenario. This interaction history constitutes the input to our learning mechanism. • Defining robot action: The subsequent shopkeeper action to the interaction history is mapped to a robot action, consisting of a typical utterance (e.g. ID 5) and a target spatial formation (e.g. present Nikon). The number of typical utterances is obtained from hierarchical clustering, further detailed in the Appendix. When executed, this would cause the robot to speak the typical utterance, “It’s $68”, associated with utterance ID 5, and execute a motion to attain the formation of present Nikon. This robot action is used as the training target for our learning mechanism.

C. Learning with Multilayer Perceptron Neural Network

We are interested in automatically generating robot actions using only data observed from human-human interaction. To achieve this, we applied a multilayer perceptron neural network, which has the ability to learn the representation and the mapping between our training data and how it best relates to a robot action.

It attempts to generalize class assignments from examples in a dataset 𝒟. Our dataset is composed of (𝑋,𝑟𝑜𝑏𝑜𝑡 𝑎𝑐𝑡𝑖𝑜𝑛) interaction pattern pairs, where 𝑋 ∈ ℝ 3𝑚 is an interaction history consisting of the three most recent joint state vectors, 𝑋 = {𝑗𝑠𝑣𝑡−3 ,𝑗𝑠𝑣𝑡−2 ,𝑗𝑠𝑣𝑡−1 }, and robot action ∈ {0,1} 𝑑 is a target class assignment where 𝑑 is equal to the number of possible robot actions.

That is, if 𝑟𝑜𝑏𝑜𝑡 𝑎𝑐𝑡𝑖𝑜𝑛𝑖 = 1, observation 𝑋 maps to robot action 𝑖. Computation in the neural network is performed by artificial neurons, which are typically organized into layers. The activation value of neuron 𝑗 in layer 𝑙 is defined in (1) as 𝑎𝑗 (𝑙) = 𝜎(∑ 𝑤𝑗,𝑘 (𝑙) 𝑘 ∙ 𝑎𝑘 (𝑙−1) ) + 𝑏𝑗 (𝑙) ) (1) where 𝑏𝑗 (𝑙) , 𝑤𝑗,𝑘 (𝑙) ∈ ℝ are parameters optimized by the network using the backpropagation algorithm, 𝑎𝑘 (𝑙−1) is the activation (output) of neuron k in layer 𝑙 −1 , and 𝜎 is a nonlinear activation function. To learn context-dependent robot actions, we also included an attention layer in our neural network, as proposed by Raffel and Ellis [31], which has the ability to “attend” and learn which parts of the interaction history are important when predicting robot behaviors. The idea is once we have an activation value of neuron 𝑗 in layer 𝑙, 𝑎𝑗 (𝑙) , we can query each value asking how relevant they are to the current computation of the target class assignment. 𝑎𝑗 (𝑙) then gets a score of relevance which can be turned into a probability distribution that sums up to one via the softmax activation. We can then extract a context vector that is a weighted summation of the activation value in layer 𝑙 depending on how relevant they are to a target robot action. Fig. 2 shows the schematic of the neural network, which is identical to the architecture of our previous system [11].

It is composed of three sets of input neurons of size 𝑚 from the joint state vectors in the interaction history, followed by two leaky rectified hidden layers, an attention layer, and another leaky rectified hidden layer. The output layer is a softmax with the number of neurons equal to the number of possible robot actions, which represents the probability of a robot action given an interaction history input.

D. Adding style feature

To capture the differences in shopkeeper behavior from the human-human interactions, we propose one simple modification to our previous system, which is to introduce an extra style feature to the joint state vectors. The idea is that this style feature specifies which shopkeeper behaviors the current interaction corresponds to, which is provided to the neural network as an additional input feature. At training time, the correct style feature is set based on the source of the human-human interaction data, that is, style is set to be if the training data came from the proactive shopkeeper and if the training data came from the passive shopkeeper. This attribute is then concatenated as a style feature to the joint state vector. Next, we trained the neural network with the interaction data combined from both shopkeepers. When used in online operation, we assume that the style feature will be specified by a user who selects the desired level of proactivity in the robot action. While one could envision alternative architectures to incorporate the style feature (e.g. directly connecting it to all hidden layers or connecting it to an output layer [32]), we consider the addition of an input feature to the neural network to target a desired interaction style to be a simple and elegant approach, as it requires no modification to the existing architecture of the neural network. Similar approaches have been applied and shown to be effective in neural translation systems – for instance, an artificial token was introduced in the input sentence of the source language to specify the required target language trained from multiple languages [26] and a side constraint was added to the source text to control the level of honorific for English to German translation [27].

In summary, we consider this proposed style feature to be a simple and elegant solution to incorporate training data representing differing interaction styles into the learning system. During training, we only need to add one additional feature to each joint state vector in the interaction history to preserve the observed interaction style, and during operation, we can control the style feature to specify a desired target shopkeeper interaction style.

(TO BE CONTINUED)

NOTES

1 https://www.google.com/intl/en/chrome/demos/speech.html 12.4).

2 http://www.geminoid.jp/dataset/camerashop/dataset-camerashop.htm

This work was supported by JST ERATO Ishiguro Symbiotic Human-Robot Interaction Project, Grant Number JPMJER1401 and in part by JSPS KAKENHI Grant Number 25240042.

The authors are with the Advanced Telecommunications Research Institute International Hiroshi Ishiguro Laboratories (e-mail: phoebeccliu@gmail.com; dylan.f.glas@gmail.com) and Intelligent Robotics and Communication Laboratories (e-mail: kanda@atr.jp), Kyoto 619-0288, Japan Hiroshi Ishiguro belongs to the Intelligent Robotics Laboratory, Osaka University, Toyonaka, Japan. Email: ishiguro@sys.es.osaka-u.ac.jp

SOURCE http://www.dylanglas.com

Advertisements

About sooteris kyritsis

Job title: (f)PHELLOW OF SOPHIA Profession: RESEARCHER Company: ANTHROOPISMOS Favorite quote: "ITS TIME FOR KOSMOPOLITANS(=HELLINES) TO FLY IN SPACE." Interested in: Activity Partners, Friends Fashion: Classic Humor: Friendly Places lived: EN THE HIGHLANDS OF KOSMOS THROUGH THE DARKNESS OF AMENTHE
This entry was posted in SCIENCE=EPI-HISTEME and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.