









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
fffdsfsdfdsfdsfdsfdsfdsgsdgdsgsdfgsdgsd sgdsd gds gds gsd g sddg sdg sd gsdf gsdf gsd gsdfdffdfjkfjkk
Typology: Summaries
1 / 16
This page cannot be seen from the preview
Don't miss anything!
Devamanyu Hazarikaa, Soujanya Poriac,∗, Roger Zimmermanna, Rada Mihalceab aSchool of Computing, National University of Singapore bComputer Science & Engineering, University of Michigan, USA cInformation Systems Technology and Design, Singapore University of Technology and Design
Abstract Recognizing emotions in conversations is a challenging task due to the presence of contextual dependencies governed by self- and inter-personal influences. Recent approaches have focused on modeling these depen- dencies primarily via supervised learning. However, purely supervised strategies demand large amounts of annotated data, which is lacking in most of the available corpora in this task. To tackle this challenge, we look at transfer learning approaches as a viable alternative. Given the large amount of available conversational data, we investigate whether generative conversational models can be leveraged to transfer affective knowledge for the target task of detecting emotions in context. We propose an approach where we first train a neural dialogue generation model and then perform parameter transfer to initiate our target emotion classifier. Apart from the traditional pre-trained sentence encoders, we also incorporate parameter transfer from the recurrent components that model inter-sentence context across the whole conversation. Based on this idea, we perform several experiments across multiple datasets and find improvement in performance and robustness against limited training data. Our models also achieve better validation performances in significantly fewer epochs. Overall, we infer that knowledge acquired from dialogue generators can indeed help recognize emotions in conversations. Keywords: Emotion Recognition in Conversations, Transfer Learning, Generative Pre-training
Emotion Recognition in Conversations (ERC) is the task of detecting emotions from utterances in a conversation. It is an important task with applications ranging from dialogue understanding to affective dialogue systems [ 1 ]. Apart from the traditional challenges of dialogue understanding, such as intent-detection, contextual grounding, and others [ 2 ], ERC presents additional challenges as it requires the ability to model emotional dynamics governed by self- and inter-speaker influences at play [ 3 ]. Further complications arise due to the limited availability of annotated data–especially in multimodal ERC–and the variability in annotations owing to the subjectivity of annotators in interpreting emotions. In this work, we focus on these issues by investigating a framework of sequential inductive transfer learning (TL) [ 5 ]. In particular, we attempt to transfer contextual affective information from a generative conversation modeling task to ERC. But why should generative modeling of conversations acquire knowledge on emotional dynamics? To answer this question, we first observe the role of emotions in conversations. Several works in the literature have indicated that emotional goals and influences act as latent controllers in dialogues [ 6 , 7 ]. Fig. 1 provides
∗Corresponding author. Contributions: ideation (use of transfer learning in emotion recognition in conversation from generative conversation modeling) and organization of the paper. Email addresses: hazarika@comp.nus.edu.sg (Devamanyu Hazarika), soujanya_poria@sutd.edu.sg (Soujanya Poria), rogerz@comp.nus.edu.sg (Roger Zimmermann), mihalcea@umich.edu (Rada Mihalcea)
Preprint submitted to Journal of Elsevier October 11, 2019
You know, my son, our lord said… (neutral)
Quite... MY lord... said: "The Prayer of Faith shall have the sick.” (neutral)
HARALD
Your lord. (frustrated)
I hope the Dragon of the North Sea gets YOU AND your lord. (frustrated)
SNORRI
God, what a bunch of retards... (disgust)
I know, I liked her better when she was an alcoholic crack addict! She gets in one car wreck and all of a sudden she's Little Miss Perfect and everybody loves her. (disgust)
I thought Chipmunk-face was never going to shut up. (disgust)
It's totally sickening. (disgust)
ENID REBECCA
Do you want something to drink? ( neutral)
What do you mean “why”? (anger)
Why? ( neutral)
JOHN ENID
Figure 1: Samples from Cornell Movie Dialog Corpus [ 4 ] demonstrating the presence of emotional dynamics in conversations.
some examples demonstrating the existence of diverse emotional dynamics in such conversations. In the figure, conversation (a) illustrates the presence of emotional inertia [ 8 ] which occurs though self-influences in emotional states. The character Snorri maintains a frustrated emotional state by not being affected/influenced by the other speaker. Whereas, conversation (b) and (c) demonstrate the role of inter-speaker influences in emotional transitions across turns. In (c), the character John is triggered for an emotional shift due to influences based on his counterpart’s responses, while (b) demonstrates the effect of mirroring [ 9 ] which often arises due to topical agreement between speakers. All these examples demonstrate the presence of such emotional dynamics that are not just inherent in the conversations but also help shape them up [1]. To model such conversations, a generator would require the ability to 1) interpret latent emotions from its contextual turns and 2) model the complex dynamics governing them. In addition, it would also need to interpret other factors such as topic of the conversations, speaker personalities, intents, etc. Such a model would then be a perfect dialogue generator. We illustrate this in Fig. 1, where the model generating utterance uttt+ 1 would require to understand the emotions of the context arising from the utterances uttt, uttt− 1 , and so on. Thereby, we hypothesize that a trained dialogue generator would possess the ability to model implicit affective patterns across a conversation [ 10 ]. Consequently, we propose a framework that uses TL to transfer this affective knowledge into our target discriminative task, i.e., ERC. In our approach, we first pre-train a model on the source task of conversation modeling, which by being an unsupervised (or self-supervised) task, typically benefits from a large amount of data in the form of multi-turn chats. Next, we adapt our model to the target task (ERC) by transferring the inter-sentence context-modeling parameters from the trained source model. For sentence encoding, we choose the BERT model [11], which is pre-trained on masked language modeling and next sentence prediction objectives. Although we acknowledge that training a perfect dialogue generator is presently challenging, we demon- strate that benefits can be observed even with a popular baseline generator. In the bigger picture, our approach can enable the co-evolution of both generative and discriminative models for the tasks mentioned above. This is possible since improving an emotional classifier using a dialogue model can, in turn, be utilized to improve dialogue models with emotional intelligence further; leading to an iterative cycle of improvements for both the applications. Overall, our contributions are summarized as follows:
Sentence Encoder
Sentence Encoder
Context RNN
Context RNN
Decoder Decoder
x 1 x 2
x ̂ 2 x ̂ 3
Sentence Encoder
Sentence Encoder
Context RNN
Context RNN
Classifier (^) Classifier
x 1 x 2
y ̂ 1 y ̂ 2
{ θs enouc^ r ce , θBERT }
s ource task : Conversation Modeling
target task : Emotion Recognition in Conversations
Transfer
θ csoxut^ rce
Figure 3: Proposed framework for ERC using TL parameters.
2.2. Transfer Learning for Affect TL for affective analysis has gained momentum in recent years, with several works adopting TL-based approaches for their respective tasks. These works leverage diverse source tasks, such as, sentiment/emotion analysis in text [ 23 , 24 , 25 ], large-scale image classification in vision [ 26 ], sparse auto-encoding in speech [ 27 ], etc. To the best of our knowledge, our work is one of the first that explores TL in ERC.
2.3. Emotion Recognition in Conversations ERC is an emerging sub-field of affective computing and is developing into an active area of research. Current works try to model contextual relationships amongst utterances in a supervised fashion to model the implicit emotional dynamics. Strategies include modeling speaker-based dependencies using recurrent neural networks [ 28 , 29 ], memory networks [ 3 , 30 ], graph neural networks [ 31 , 32 ], quantum-inspired networks [ 33 ], amongst others. Some of these works also explore challenges such as multi-speaker modeling [ 34 ], multimodal processing [ 30 ], and knowledge infusion [ 35 ]. However, there is a dearth of works that consider scarcity issues for annotated data and leverage TL for the transfer of affective knowledge from generative models. Our work, thus, strives to fill this gap by providing a systematic study for TL in ERC.
Our proposed framework is summarized in Fig. 3. First, we define the source generative model trained as a dialogue generator, followed by a description of the target model, which performs hierarchical context encoding–for the task of ERC–using BERT-based sentence encoders and learnt context weights from the source model.
3.1. Source: Generative Conversation Modeling To perform the generative task of conversation modeling, we use the Hierarchical Recurrent Encoder- Decoder (HRED) architecture [ 36 ]. HRED is a classic framework for seq2seq conversational response generation that models conversations in a hierarchical fashion using three sequential components: encoder recurrent neural networks (RNNs) for sentence encoding, context RNNs for modeling the conversational context across sentences, and decoder RNNs for generating the response sentence. For a given conversation context with sentences x 1 , ... xt, HRED generates the response xt+ 1 as follows:
henct = f (^) θenc (xt, henct− 1 )
hcxtt = f (^) θcxt (henct , hcxtt− 1 )
pθ (xt+ 1 Sx≤t) = f (^) θdec (x S hcxtt ) = M i
f (^) θdec (xt+ 1 ,i S hcxtt , xt+ 1 ,<i)
With the ith^ conversation being a sequence of utterances Ci = [xi, 1 , ..., xi,ni ], HRED trains all the conversations in the dataset together by using the maximum likelihood estimation objective arg maxθ = ∑i log^ pθ (Ci). The HRED model provides the possibility to introduce multiple complexities in the form of multi-layer RNNs and other novel encoding strategies. In this work, we choose to experiment with the original version of the architecture with single-layer components so that we can analyze the hypothesis without unwanted contribution from the added complexities. In our source model, f (^) θenc can be any RNN function, which we model using the bi-directional Gated Recurrent Unit (GRU) variant [ 37 ] to encode each sentence. We call the parameters associated with this GRU function as θsourceenc. For both the context RNN (f (^) θcxt ) and decoder RNN, we use uni-directional GRUs – with paramters θcxtsource and θsourcedec , respectively – and complement the decoder with beam-decoding for generation 1.
3.2. Target: Emotion Recognition in Conversations The input for this task is also a conversation C with constituent utterances [x 1 , ..., xn]. Each xi is associated with an emotion label yi ∈ Y. We adopt a setup similar to the three components described for the source task, as in Poria et al. [38]. However, the decoder in this setup is replaced by a discriminative mapping to the label space instead of a generative network. Below, we describe the different initialization parameters that we consider for the first two stages of the network:
3.2.1. Sentence Encoding To encode each utterance in the conversation, we consider the state-of-the-art universal sentence encoder BERT [ 11 ], with its parameters represented as θBERT^. We choose BERT over the HRED sentence encoder (θsourceenc ) as it provides better performance (see Table 7). Also, BERT includes the task of next sentence prediction as one of its training objectives which aligns with the inter-sentence level of abstraction that we consider in this work. We choose the BERT-base uncased pre-trained model as our sentence encoder 2. Though this model contains 12 transformer layers, to limit the total number of parameters in our model, we restrict to the first 4 transformer layers. To get a sentential representation, we use the hidden vectors of the first token [CLS] across the considered transformer layers (see Devlin et al. [11]) and mean-pool them to get the final sentence representation.
3.2.2. Context Encoding We use a similar context encoder RNN as the source HRED model with the option to transfer the learnt parameters θcxtsource. For input sentence representation henct provided by the encoder RNN, the context RNN transforms it as follows:
zt = σ(V z^ henct + W z^ hcxtt− 1 + bz^ ) rt = σ(V r^ henct + W r^ hcxtt− 1 + br^ )
(^1) Model implementations are adapted from https://github.com/ctr4si/ (^2) https://github.com/huggingface/pytorch-pretrained-BERT
Dataset Dataset splits train validation test
Source
(^) Cornell #D 66,477 8,310 8, #U 244,030 30,436 30, Ubuntu #D 898,142 18,920 19, #U 6,893,060 135,747 139,
Target
(^) IEMOCAP #D^120 #U 5810 1,
Target
(^) SEMAINE #D^58 #U 4386 1, Dailydialog #D #U^ 11,11887,170^ 1,0007,740^ 1,0008,
Table 1: Table illustrates the sizes of the datasets used in this work. #D represents the number of dialogues whereas #U represents the total number of constituting utterances.
Table 1 provides the sizes along with split distributions for the above-mentioned datasets. For both IEMOCAP and SEMAINE, we generate the validation sets by random-sampling of 20% dialogue videos from the training sets.
4.1.3. Metrics We choose the pre-training weights from the source task based on the best validation perplexity score [ 40 ]. For ERC, we use weighted-F-score metric for the classification tasks on IEMOCAP and DailyDialog. For DailyDialog, we remove no emotion class from the F-score calculations due to its high majority (82.6%/81.3% occupancy in training/testing set) which hinders evaluation of other classes^3. For the regression task on SEMAINE, we take the Pearson correlation coefficient (r) as its metric. We also provide the average best epoch (BE) on which the least validation losses–across the multiple runs –are observed, and the testing evaluations are performed. A lower BE represents the model’s ability to reach optimum performance in lesser training epochs.
4.2. Model Size We consider two versions of the source generative model: HRED-small/large: with 256~1000-dimensional hidden state sizes. While testing the performance of both the models on the IEMOCAP dataset, we find the context weights from HRED-small (Cornell dataset) to provide better performance on average (58.5% F-score ) over HRED-large (55.3% F-score). Following this observation, and also to avoid over-fitting on the small target datasets due to increased parameters, we choose the HRED-small model as the source task model for our TL procedure.
4.3. Training Criteria We train our models on each target dataset for multiple runs (10:IEMOCAP, 5:DailyDialog, 5:SEMAINE). In each run, we evaluate performance on the testing set using the parameters which provide the least validation loss. Also, we use early stopping (patience 10) as the stopping criterion while training. We perform hyper-parameter search for different datasets and models, where we keep the model architecture constant but vary learning rate (1e-3, 1e-4, and 1e-5), optimizer (Adam, RMSprop [ 44 ]), batch size (2- videos/batch), and dropout ({0.0, 0.5}. BERT-parameters contain dropout of 0.1 as in Devlin et al. [11]). The best combination is chosen based on the performances in the respective validation sets. In the case of negligent difference between the combinations, we use the Adam optimizer [ 45 ] as the default variant with β = [ 0. 9 , 0. 999 ] and learning rate 1e − 4.
Initial Weight (^) Model Description sentenc cxtenc
1 ) - - Parameters from both sentence and context encoders are randomly initialized. 2 ) θBERT^ - Sentence encoders are initiliazed with BERT parameters. Context encoders are randomly initialized
3 ) θBERT^ θubuntucxt ~cornell
Sentence encoders are initialized with BERT parameters. Context encoders are initialized from generative models pre-trained on Ubuntu/Cornell corpus.
Table 2: Variants of the model used in the experiments.
Dataset: IEMOCAP Initial Weights 10% 25% 50% 100% sentenc cxtenc F-Score BE F-Score BE F-Score BE F-Score BE
θBERT
θubuntucxt 35.7 (^) ± 1. 1 14.2 45.9 (^) ± 2. 0 11.2 53.1 (^) ± 0. 7 †^ 7.8 58.8 (^) ± 0. 5 †^ 5. θcxtcornell 36.3 (^) ± 1. 1 †^ 17.0 46.0 (^) ± 0. 5 †^ 11.2 50.9 (^) ± 1. 5 8.2 58.5 (^) ± 0. 8 5.
Table 3: IEMOCAP results. Metric: Weighted-Fscore averaged over 10 random runs. BE = Best Epoch. Results span across different amount of available training data. Validation and testing splits are fixed across configurations. † represents significant difference with p < 0 .05 over randomly initialized model as per two-tailed Wilcoxon rank sum hypothesis test [46].
4.4. Model Variants We experiment on different variants based on their parameter initialization. A summary to these variants are provided in Table 2.
Table 3 and 4 provide the performance results of ERC on classification datasets IEMOCAP and DailyDialog, respectively. In both the tables, we observe clear and statistically significant improvements of the models that use pre-trained weights over the randomly initialized variant. We see further improvements when context-modeling parameters from the source task (θsourcecxt ) are transferred, indicating the benefit of using TL in this context-level hierarchy. Similar trends are observed in the regression task based on the SEMAINE corpus (see Table 5). For valence, arousal, and power dimensions, the improvement is significant. For expectation, the performance is marginally better but at a much lesser BE, indicating faster generalization. In the following sections, we take a closer look at various aspects of our approach that include checking robustness towards limited-data scenarios, generalization time, and questioning design choices. We also provide additional analyses that probe the existence of data-split bias, domain influence, and effect of fine-tuning strategies.
5.1. Target Data Size Present approaches in ERC primarily adopt supervised learning strategies that demand a high amount of annotated data. However, the publicly available datasets in this field belong to the small-to-medium range in the spectrum of dataset sizes in NLP. This constraint inhibits the true potential of systems trained on these
(^3) Evaluation strategy adapted from Semeval 2019 ERC task: www.humanizing-ai.com/emocontext.html
validation loss
Pre-trained weights
{ θ eBnEcR T } { θ eBnEcR T + θ ccxotr nell }
−
b) 10%
a) 100%
validation loss
epochs
epochs
100
100
Figure 4: Validation loss across epochs in training for different weight-initialization settings on the IEMOCAP dataset. Part a) represents results when trained on 100% training data b) 10% training data split. For fair comparison, learning rates of optimizers are fixed at 1e-4.
Dataset: IEMOCAP Initial Weight 10% 50% sentenc cxtenc split∗ 1 split 2 split 3 split 4 split∗ 1 split 2 split 3 split 4
θBERT
Table 6: Table to investigate if split randomness incurs bias in results. Comparisons are held between two limited training data scenarios comprising 10% and 50% available training data. For both the cases, 4 independent splits are sampled and compared against. Metric: Weighted-Fscore averaged over 10 random runs.
demonstrate that BERT provides better representations, which leads to better performance. Moreover, the positive effects of the context parameters are observed when coupled with the BERT encoders. This behavior indicates that the performance boosts provided by the context-encoders is contingent on the quality of sentence encoders. Observing this empirical evidence, we choose BERT-based sentence encoders in our final network.
5.4. Impact of Source Domain. We investigate if the choice of source datasets incur any significant change in the results. First, we define an emotional profile for the source datasets and observe whether any correlation is found between their emotive content versus the performance boost achieved by pre-training on them. To set up an emotional profile, we look at the respective vocabularies of both corpora. For each token, we check its association with any emotion by using the emotion-lexicon provided by Mohammad and Turney [47]. The NRC Emotion Lexicon contains 6423 words belonging to emotion categories: fear, trust, anger, sadness, anticipation, joy, surprise, and disgust. It also assigns two broad categories: positive and negative to describe the type of connotation evoked by the words. We enumerate the frequency of each emotion category amongst the tokens of the source dataset’s vocabulary. To compose the vocabulary of both the source datasets, we set a minimum frequency threshold of 5, which provides 13518 and 18473 unique tokens
Table 7: Table to analyze HRED encoder vs BERT. Metric: Weighted-Fscore averaged over 10 random runs. BE = Best Epoch (average).
0
400
800
1200
1600
negative^ positive
fear trust anger sadness anticipate^ disgust
joy surprise
Cornell Ubuntu
Negative reckless, trap, plight, timid, profanity
Positive (^) sweetheart, jackpotexciting, harmony,
Anger (^) accusation, storm, brazendepraved, revolt,
Joy praise, heavenly, satisfied, confidence, lucky
(a) (b)
# words
Figure 5: a) Frequency of emotive words from source datasets: Cornell and Ubuntu. b) randomly sampled words from Cornell associated to mentioned emotions.
for Cornell and Ubuntu, respectively. Each of the unique tokens is then lemmatized 4 and cross-referenced with the lexicon, which provides 3099 (Cornell) and 2003 (Ubuntu) tokens with associated emotions. Fig. 5 presents the emotional profiles, which indicate that the Cornell dataset has a higher number of emotive tokens in its vocabulary. However, the results illustrated in Table 3, 4, and 5 do not present any significant difference between the two sources. A possible reason for this behavior attributes to the fact that such emotional profile relies on surface emotions derived from the vocabularies. However, as per our hypothesis, response generation includes emotional understanding as a latent process. This reasoning leads us to believe that surface emotions need not necessarily correlate to performance increments. Rather, the quality of generation would include such properties intrinsically.
In this section, we enlist the different challenges that we observed while experimenting with the proposed idea. These challenges provide roadmaps for further research on this topic to build better and robust systems.
6.1. Adaptation Strategies We try two primary adaptation techniques used in inductive TL, freezed or fine-tuned. In the former setting, the borrowed weights are used for feature extraction, while in the latter, we train the weights along
(^4) https://www.nltk.org/_modules/nltk/stem/wordnet.html
Initial Weights IEMOCAP Dailydialog F-Score F-Score HRED 58.5 48. VHRED 58.6 48.
Table 9: Average performance on ERC with pre-trained weights: {θBERT^ + θcxtcornell } for VHRED, θcxtcornell contain additional parameters modeling the latent prior state.
Generative IEMOCAP Dailydialog Training F-Score F-Score Source 58.5 48. Source + Target 58.0 47.
Table 10: Average performance on ERC with pre-trained weights: {θBERT^ + θcornellcxt }.
6.4. In-domain Generative Fine-Tuning. We try in-domain tuning of the generative HRED model by performing conversation modeling on the ERC resources. Finally, we transfer these re-tuned weights for the discriminative ERC task. However, we do not find this procedure to be helpful ( Table 10 ). TL between generative tasks, especially with small-scale target resources, is a challenging task. As a result, we find sub-optimal generation in ERC datasets whose further transfer for the classification does not provide any improvement.
In this paper, we presented a novel framework of transfer learning for ERC that uses pre-trained affective information from dialogue generators. We presented various experiments with different scenarios to investigate the effect of this procedure. We found that using such pre-trained weights help the overall task and also provide added benefits in terms of lesser training epochs for good generalization. We primarily experimented on dyadic conversations both in the source and the target tasks. In the future, we aim to investigate the more general setting of multi-party conversations. This setting will increase the complexity of the task, as pre-training would require multi-party data and special training schemes to capture complex influence dynamics.
Acknowledgement
This research is supported by Singapore Ministry of Education Academic Research Fund Tier 1 under MOE’s official grant number T1 251RES1820. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of a Titan Xp GPU used for this research.
References
[1] S. Poria, N. Majumder, R. Mihalcea, E. H. Hovy, Emotion recognition in conversation: Research challenges, datasets, and recent advances, IEEE Access 7 (2019) 100943–100953. URL: https://doi.org/10.1109/ACCESS.2019.2929050. doi:10.1109/ACCESS.2019.2929050. [2] H. Chen, X. Liu, D. Yin, J. Tang, A survey on dialogue systems: Recent advances and new frontiers, SIGKDD Explorations 19 (2017) 25–35. URL: https://doi.org/10.1145/3166054.3166058. doi:10.1145/3166054.3166058. [3] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L. Morency, R. Zimmermann, Conversational memory network for emotion recognition in dyadic dialogue videos, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), 2018, pp. 2122–2132. URL: https://www.aclweb.org/anthology/N18-1193/. [4] C. Danescu-Niculescu-Mizil, L. Lee, Chameleons in imagined conversations: A new approach to understanding coordination of linguistic style in dialogs, in: Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics, Association for Computational Linguistics, 2011, pp. 76–87. [5] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (2010) 1345–1359. URL: https://doi.org/10.1109/TKDE.2009.191. doi:10.1109/TKDE.2009.191. [6] E. Weigand, Emotions in dialogue, Dialoganalyse VI/1: Referate der 6. Arbeitstagung, Prag 1996 16 (2017) 35. [7] J. Sidnell, T. Stivers, The handbook of conversation analysis, volume 121, John Wiley & Sons, 2012. [8] P. Koval, P. Kuppens, Changing emotion dynamics: individual differences in the effect of anticipatory social stress on emotional inertia., Emotion 12 (2012) 256.
[9] C. Navarretta, Mirroring facial expressions and emotions in dyadic conversations, in: Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portoroˇz, Slovenia, May 23-28, 2016., 2016. URL: http://www.lrec-conf.org/proceedings/lrec2016/summaries/258.html. [10] T. Shimizu, N. Shimizu, H. Kobayashi, Pretraining sentiment classifiers with unlabeled dialog data, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, Association for Computational Linguistics, 2018, pp. 764–770. URL: https://www.aclweb.org/anthology/P18-2121/. doi:10.18653/v1/P18-2121. [11] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: [51], 2019, pp. 4171–4186. URL: https://www.aclweb.org/anthology/N19-1423/. [12] R. K. Ando, T. Zhang, A framework for learning predictive structures from multiple tasks and unlabeled data, J. Mach. Learn. Res. 6 (2005) 1817–1853. URL: http://jmlr.org/papers/v6/ando05a.html. [13] S. Ruder, M. E. Peters, S. Swayamdipta, T. Wolf, Transfer learning in natural language processing, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2, 2019, Tutorial Abstracts, 2019, pp. 15–18. URL: https://www.aclweb.org/anthology/N19-5004/. [14] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., 2013, pp. 3111–3119. [15] B. McCann, J. Bradbury, C. Xiong, R. Socher, Learned in translation: Contextualized word vectors, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, R. Garnett (Eds.), Ad- vances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Sys- tems 2017, 4-9 December 2017, Long Beach, CA, USA, 2017, pp. 6294–6305. URL: http://papers.nips.cc/paper/ 7209-learned-in-translation-contextualized-word-vectors. [16] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, in: M. A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Association for Computational Linguistics, 2018, pp. 2227–2237. URL: https://www.aclweb.org/anthology/N18-1202/. [17] A. M. Dai, Q. V. Le, Semi-supervised sequence learning, in: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015, pp. 3079–3087. URL: http://papers.nips.cc/paper/5949-semi-supervised-sequence-learning. [18] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, CoRR abs/1906.08237 (2019). URL: http://arxiv.org/abs/1906.08237. arXiv:1906.08237. [19] J. Howard, S. Ruder, Universal language model fine-tuning for text classification, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Association for Computational Linguistics, 2018, pp. 328–339. URL: https://www.aclweb.org/anthology/P18-1031/. doi:10.18653/v1/P18-1031. [20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [21] L. Chen, A. Moschitti, Transfer learning for sequence labeling using source model and target data, arXiv preprint arXiv:1902.05309 (2019). [22] M. Qiu, L. Yang, F. Ji, W. Zhou, J. Huang, H. Chen, B. Croft, W. Lin, Transfer learning for context-aware question matching in information-seeking conversations in e-commerce, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 208–213. [23] J. Yu, L. Marujo, J. Jiang, P. Karuturi, W. Brendel, Improving multi-label emotion classification via sentiment classification with dual attention transfer network, in: [ 52 ], 2018, pp. 1097–1102. URL: https://www.aclweb.org/anthology/D18-1137/. [24] G. Daval-Frerot, A. Bouchekif, A. Moreau, Epita at semeval-2018 task 1: Sentiment analysis using transfer learning approach, in: Proceedings of The 12th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2018, New Orleans, Louisiana, USA, June 5-6, 2018, 2018, pp. 151–155. URL: https://www.aclweb.org/anthology/S18-1021/. [25] A. Bouchekif, P. Joshi, L. Bouchekif, H. Afli, Epita-adapt at semeval-2019 task 3: Detecting emotions in textual conversations using deep learning models combination, in: Proceedings of the 13th International Workshop on Semantic Evaluation, 2019, pp. 215–219. [26] H. Ng, V. D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recognition on small datasets using transfer learning, in: Z. Zhang, P. Cohen, D. Bohus, R. Horaud, H. Meng (Eds.), Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, Seattle, WA, USA, November 09 - 13, 2015, ACM, 2015, pp. 443–449. URL: http://doi.acm.org/10.1145/2818346.2830593. doi:10.1145/2818346.2830593. [27] J. Deng, Z. Zhang, E. Marchi, B. W. Schuller, Sparse autoencoder-based feature transfer learning for speech emotion recognition, in: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, Geneva, Switzerland, September 2-5, 2013, IEEE Computer Society, 2013, pp. 511–516. URL: https://doi.org/10.1109/ ACII.2013.90. doi:10.1109/ACII.2013.90. [28] A. V. Gonz´alez-Gardu˜no, V. P. B. Hansen, J. Bingel, I. Augenstein, A. Søgaard, Coastal at semeval-2019 task 3: Affect classification in dialogue using attentive bilstms, in: Proceedings of the 13th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2019, Minneapolis, MN, USA, June 6-7, 2019, 2019, pp. 169–174. URL: https:
[51] J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019. URL: https: //www.aclweb.org/anthology/volumes/N19-1/. [52] E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Association for Computational Linguistics, 2018. URL: https://www.aclweb.org/anthology/volumes/D18-1/. [53] S. Kraus (Ed.), Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019. URL: https://doi.org/10.24963/ijcai.2019. doi:10.24963/ijcai.