,
Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
Abstract: Deep multi-modal learning, a rapidly growing field with a wide range of practical applications, aims to effectively utilize and integrate information from multiple sources, known as modalities.Despite its impressive empirical performance, the theoretical foundations of deep multi-modal learning have yet to be fully explored.In this paper, we will undertake a comprehensive survey of recent developments in multi-modal learning theories, focusing on the fundamental properties that govern this field.Our goal is to provide a thorough collection of current theoretical tools for analyzing multi-modal learning, to clarify their implications for practitioners, and to suggest future directions for the establishment of a solid theoretical foundation for deep multi-modal learning.
Key words: multi-modal learning, machine learning theory, optimization, generalization
Our perception of the world is based on different modalities, e.g.sight, sound, movement, touch, and even smell(Smith et al., 2005).Inspired from the success of deep learning(Krizhevsky et al., 2017; He et al., 2016),deep multi-modal research is also activated, which covers fields like audio-visual learning(Chen et al., 2020a;Wang et al., 2020), RGB-D semantic segmentation(Seichter et al., 2020; Hazirbas et al., 2017) and Visual Question Answering(Goyal et al., 2017; Anderson et al., 2018).Deep multi-modal learning has undoubtedly made a significant impact on a plethora of fields, resulting in remarkable performance gains.However, despite its practical success, the machine learning community still lacks a comprehensive understanding of the underlying theoretical principles.Despite the many advancements that have been made, there remains a significant amount of work to be done in order to gain a deeper understanding of why deep multi-modal learning methods work and when they may fail.This includes further theoretical research into the various components that make up these models,such as the neural networks, the multi-modal data inputs, and the optimization techniques used to train them.Additionally, it is important to explore the different types of data that these models can handle, as well as the different types of tasks they can be applied to.Ultimately, by gaining a deeper understanding of these models and their underlying principles, we can continue to improve upon their performance and make them more widely applicable to a variety of fields.
Typically, when developing a theory for deep multi-modal learning, we must address two fundamental problems that are crucial to understanding machine learning:
1) Generalization, which ensures that the difference between the training error and the test error is minimal.This means that the model is able to accurately predict new data, even if it has not seen before;
2) Optimization, which refers to the effectiveness of algorithms during the training process in solving various learning tasks.This involves finding the optimal solution for the problem at hand, so that the model can perform well on different types of data.
Consequently, we have conducted a thorough review of the literature on multi-modal learning theory, primarily utilizing the aforementioned taxonomy as a guide.A significant portion of multi-modal learning algorithms that are supported by theoretical guarantees are derived from a generalization perspective.Based on the types of learning tasks, we have classified the existing theoretical work on the generalized properties of multi-modal learning into three primary categories: (i) supervised learning, (ii) semi-supervised learning, and (iii) self-supervised learning.
On the other hand, from an optimization perspective, in addition to traditional convergence analysis based on convex objectives, the theory of deep multi-modal learning must contend with the non-convex landscape of neural networks, which is a challenging task that lacks proper theoretical tools.In light of recent developments in deep learning theory, we have also reviewed the optimization of multi-modal learning theory from the perspectives of convergence guarantees and feature learning.Furthermore, we have also examined theoretical work on multi-modal learning that cannot be explicitly categorized within the aforementioned frameworks.The basic organization of this survey is summarized in Fig.1.
Fig.1 The basic organization of this survey
The objective of this survey is to re-examine the theoretical advancements in multi-modal learning and to succinctly summarize the significant obstacles encountered in current analysis.Additionally, we aim to present our perspectives on potential future directions within this field.Our expectation is that this survey will furnish researchers with robust tools for the theoretical examination of the performance of multi-modal learning methods,thereby enabling them to gain a deeper understanding of this area and ultimately promote its advancement.
There are several related studies, which reviewing the development of multi-modal learning from different perspectives (Sun, 2013; Xu et al., 2013; Ramachandram et al., 2017; Baltrušaitis et al., 2019; Li et al., 2019;Guo et al., 2019; Gao et al., 2020; Liang et al., 2022).However, most of these literature surveys just focused on the applications of the representative multi-modal approaches or simply discussed the underlying theoretical foundations.Among them, the close efforts to our survey are Sun(2013) and Liang et al.(2022).Below, we elaborate the above key differences between these two surveys and our work.
Sun(2013) provided a comprehensive review of theories on multi-view learning and classified them into four categories: CCA, effectiveness of co-training, generalization error analysis for co-training, and generalization error analysis for other multi-view learning approaches.They mainly investigated the multi-view supervised/semisupervised methods and their analyses are largely relied on conventional statistical learning and convex optimization wisdom.However, the research progress in multi-modal learning has grow rapidly over the last decade, especially in the area of self-supervised learning since Sun(2013) has been published.Moreover, the traditional wisdom cannot directly apply to the deep multi-modal setting.In comparison, our survey considers the recent advances of self-supervised multi-modal methods and also takes the up-to-date theoretical techniques in deep learning into consideration.
Recently, Liang et al.(2022) conducted an in-depth review of multi-modal learning by defining two fundamental principles of modality and presenting a categorization of six crucial technical difficulties.Furthermore,they delved into some theoretical advancements by utilizing this framework as a lens.In contrast to the comprehensive taxonomy provided by Liang et al.(2022), which covers all aspects of multi-modal learning, our survey focused specifically on the theoretical domain of multi-modal learning.The categories we formulated were derived from a technical perspective in theory, offering novel perspectives for multi-modal theoretical research.We hope that our survey will serve as a catalyst for further related research in this field.
Supervised multi-modal learning considers integrating information from multiple modalities supervised by a common signal.Conventional statistical theory provides various approaches to tackle the generalization error,which include the PAC-Bayes analysis(McAllester, 1999), different complexity measures-based bounds(e.g.VC dimension (Blumer et al., 1989) Rademacher complexity(Koltchinskii et al., 2000)) and uniform stable(Bousquet et al., 2002).Researchers have utilized such traditional statistical wisdom to explore the theoretical underpinnings of supervised multi-modal learning.
PAC-Bayes bounds (McAllester, 1999) are one of prevailing tools for studying the generalization properties of learning algorithms by providing probably approximately correct(PAC) guarantees.The adavantage of PACBayes analysis is that one can obtain tight bounds since it provides data-dependent guarantees.Sun et al.(2017)adopted a novel PAC-Bayes analysis encoding the relationship between the two views through the data distribution dependent priors, and provided various PAC-Bayes bounds for supervised multi-view learning.
Another approach for analyzing generalization is Rademacher complexity(Koltchinskii et al., 2000), a distribution-dependent complexity measure for real-valued functions.For multi-modal learning, Farquhar et al.(2005) proposed a novel single optimisation method termed SVM-2K, by combinging the classic kernel Canonical Correlation Analysis(KCCA) and Support Vector Machine(SVM), and provided performance guarantees in term of empirical rademacher complexity.Amini et al.(2009) derived a rademacher complexity based generalization bound for classification with multiple artificially generated views, which identify a a trade-off between the number of views, the size of the training set and the quality of the view generating functions.
The above multi-view analysis typically assumes that each view alone is sufficient to predict the target accurately, which may not hold in current multi-modal setting.For instance, it is difficult to build a classifier just using a weak modality with limited labeled data, e.g., depth modality in RGB-D images for object detection task(Gupta et al., 2016).Based on view insufficiency assumption, Xu et al.(2015) presented a learning algorithm Multi-view Intact Space Learning (MISL), to explore the latent intact representation of the data, and they also proposed a novel concept ofmulti-view stability, which is together with the rademacher complexity to bound the generalization error of MISL.Recently, Huang et al.(2021) proved that learning with multiple modalities achieves a smaller population risk than only using its subset of modalities utilizing the rademacher complexity analysis under a most popular multi-modal fusion framework.To the best of our knowledge, this is the first theoretical treatment to explain the superiority of multi-modal from the generalization standpoint.Besides, there is a by-product implication in Zhang et al.(2019) that multi-view representation can recover the same performance as only using the single-view observation by constructing the versatility with strict assumptions on the relationship across different modalities.However, their result is derived by simply analyzing the optimal solution of certain loss function, which neither belongs to typical generalization nor optimization analysis.
Uniform stablity(Bousquet et al., 2002) is also a classic notion used to derive high probability generalization error bounds.Zantedeschi et al.(2019) theoretically analyzed a landmark-based SVMs in a multi-view classification setting by applying uniform stability framework.More recently, Sun et al.(2022) introduced a viewconsistency regularization and utilize the analyzed technique from uniform stability to deduce a tighter stabilitybased PAC-Bayes bound for multi-view algorithms.
Supervised multi-modal learning approaches, which utilize both multi-modal and label information, have achieved satisfactory performance, however, they are faced with a significant challenge.The collection of largescale, well-annotated multi-modal training data is both prohibitively expensive and time-consuming.To address the issue of limited labeled data in multi-modal learning, semi-supervised (Guillaumin et al., 2010; Cheng et al.,2016) and self-supervised(Lee et al., 2019; Alayrac et al., 2020) methods that make less use of label information and rely on exploiting correlation between modalities are proposed.Semi-supervised multi-modal learning considers the setting that each modality of data may only contain a small number of labeled data and a large number of unlabeled data.The majority semi-supervised multi-modal learning methods can be divided into two branches:co-training (Blum et al., 1998) style algorithms and co-regularize (Brefeld et al., 2006) style algorithms.Therefore, in this section, we will primarily discuss existing theories and techniques for these two branches of semisupervised multi-modal learning methods.
Co-training is one of the most representative semi-supervised multi-modal approaches, which utilizes initial small set of labeled two-modal data to learn a weak predictor on each modality and then enables them to label confident instances for each other for further training.
Under the PAC-learning(Valiant, 1984) framework, Blum et al.(1998) showed that for any initial weak predictors, co-training can boost their performances to arbitrarily high accuracy by only using unlabeled data samples.Dasgupta et al.(2001) proved that the generalization error of a classifier from each modality is upper bounded by the disagreement rate of the classifiers from the two modalities.For a special case of co-training that the hypothesis class is the class of linear separators, Balcan et al.(2005) proved that with polynomially many unlabeled examples and a single labeled example, there exists a polynomial-time algorithm to efficiently learn a linear separator under proper assumptions.
However, the above analysis made strong assumption that the true classifiers may not correlate to make predictions, i.e.the two modality is conditional independent given the class label.Abney(2002) showed that weak dependence can also guarantee the success of co-training by generalizing the error bound in Dasgupta et al.(2001) with weaker restrictions that classifiers from different modalities are weakly dependent and nontrivial.Later, Balcan et al.(2004) relaxed the conditional independence to a much weaker expansion assumption, which is proven to be sufficient for iterative co-training to succeed given appropriately strong PAC-learning algorithms on each modality.They also showed that such expansion assumption is necessary to some extent.
It is worthy noting that all above theoretical analyses on co-training rely on a crucial assumption that each of the modality alone is sufficient to correctly predict the label, which cannot meet in many real applications.Wang et al.(2013) proved that large diversity between two modalities can also lead to good performance of co-training when neither modality is sufficient.Wang et al.(2017) further summarized the key issues for co-training and disagreement-base methods and provided theoretical analyses to tackle such issues, serving as a theoretical foundation of co-training and disagreement-based approaches.
The main idea of co-regularize style approaches is to directly minimize the disagreements over different modal predictions.Rosenberg et al.(2007) provided the empirical Rademacher complexity for the Sindhwani et al.(2008) function class of co-regularized least squares and then derived the generalization bound.Later, Sindhwani et al.(2008) constructed a novel Reproducing Kernel Hilbert Spaces(RKHSs), where the reproducing kernel for this RKHS can be utilized to simplify the proof for Rademacher complexity results in.Furthermore, they gave more refined generalization bounds by such techniques based on localized Rademacher complexity.Rosenberg et al.(2009) extended such analysis to the setting that more than two modalities are considered.Since information theory provides the natural language to illustrate the relationship between different modalities, Sridharan et al.(2008) attempted to theoretically understand co-regularization from the information theoretical perspective.They showed that the e excess error between the output hypothesis of co-regularization and the optimal classifier is bounded by the termwhereεinfo<1 measures the different information provided by the two modalities.
Canonical Correlation Analysis(CCA)(Hotelling,1992) based algorithms are also seen as co-regularization method, which aim to compute a shared representation of both sets of variables through maximizing the correlations among the variables among these sets.CCA and its variants, such as Sparse CCA(Witten et al., 2009), Kernel CCA(Akaho, 2006), Cluster CCA(Chaudhuri et al., 2009) have been widely used in multi-modal learning.Therefore, existing theoretical analyses for CCA(Bach et al., 2003; Kuss et al., 2003) can provide theoretical justifications for the performance of multi-modal CCA-based algorithms.Recently, a survey of multi-modal CCA(Guo et al., 2019) provided an overview of many representative CCA-based multi-modal learning approaches and also discussed their theoretical foundations.
It is worthy mentioning that all above theoretical analyses on co-regularization are based on the assumption that different modalities can provide almost the same predictions.Unfortunately, such assumption is hardly to meet in practice, since typically there exist divergences among different modalities.
Szedmak et al.(2007) characterized the generalization performance of semi-supervised version of SVM-2K,which has been discussed in Section 2.1.Sun et al.(2010) proposed a sparse semi-supervised learning algorithm named sparse multi-view SVMs, where they provided the generalization error analysis.Sun et al.(2011) further considered manifold regularization and introduced multi-view Laplacian SVMs.They also derived the generalization error bound and empirical Rademacher complexity for the proposed method.Recently, Sun et al.(2020) proposed a novel information-theoretic method, called Total Correlation Gain Maximization (TCGM) by maximizing the total correlation Total Correlation Gain(TCG) over classifiers of all modalities.They showed that the the optimal classifiers for such a TCG are equivalent to the Bayesian posterior classifiers given each modality under some permutation function, which theoretically guarantee the success of TCGM.
Self-supervised learning adopts self-defined signals to learn representations from unlabeled data.A large amount of prevailing self-supervised approaches naturally leverage the property of multi-view data, where the input (e.g.the original image) and the self-supervised signal (e.g.image with data augmentation) can be treated as two views of the same data.In practice, incorporating the view from different modalities as supervision into such methods leads to remarkable successes(Zhang et al., 2020; Desai et al., 2021; Radford et al., 2021) in multi-modal self-supervised learning.Therefore, seminal works that established theoretical foundations of selfsupervised learning in general multi-view setting are conducive to our understanding of self-supervised multimodal learning.The methods and techniques in this literature may be adapted or extended to more realistic multimodal setting.
Arora et al.(2019) provided a theoretical guarantee of contrastive learning which aims to minimizeL(ϕ) =whereX+andX-are corresponding positive and negative views for dataX.Under the class conditional independence(CI) assumption, i.e.X+⊥X-, they show that contrastive learning yields good representationsϕfor downstream tasks.Lee et al.(2021) considered reconstruction-based self-supervised methods, where two views (X1,X2) are available for each data point, and the learning objective is to minimize the reconstruction error ofX2based on a function ofX1:L(ϕ) = E‖X2-ϕ(X1)‖2.They established a good performance guarantee for the learnedϕon downstream tasks under a relaxed approximately conditional independence(ACI) assumption for each view on the label.
In work concurrent with Lee et al.(2021), Tosh et al.(2021b) shows guarantees for contrastive object with a multi-view redundancy assumption, which is analogous to ACI.Tosh et al.(2021a) also studied the problem of contrastive learning similar to Tosh et al.(2021b).Specifically, they considered the topic modeling setting and showed that when the two views of the data point correspond to random partitions of a document, contrastive learning recovers information related to the underlying topics that generated the document.Arora et al.(2019)and Lee et al.(2021) both assume strong independence between the views conditioning on the downstream tasks.Tsai et al.(2020) and Tosh et al.(2021b) considered a weak version of such assumption where they only assume independence between the downstream task and one view conditioning on the other view, while Tsai et al.(2020)mainly focuses on the mutual-information perspective.Meanwhile, motivated by the information bottleneck principle(Tishby et al., 2000), Federici et al.(2020) proposed an unsupervised multi-modal method, where they provided a rigorous theoretical analysis of its application, but they also relied on the strong assumption that each modality provides the same task-relevant information.
Xu et al.(2015) theoretically showed that the proposed Iteratively Reweight Residuals(IRR) technique for their multi-view intact space algorithm will lead to a local-minimized solution under mild assumptions.Arora et al.(2016) developed stochastic approximation approaches for Partial Least Squares(PLS) problem with two views in un-supervised setting and they provided iteration complexity bounds for the proposed methods.Mou et al.(2017) formulated a novel multi-view based PLS framework and proposed algorithms to optimize the objective function.They provided a rough convergence analysis of proposed method.Seminal work(Fukumizu et al.,2007; Hardoon et al., 2009; Cai et al., 2011) provided the convergence analyses for kernel CCA.
In recent years, great efforts have been made to explore the learning dynamic of neural networks.Among this, a line of work studied how multi-view or multi-modal features can be learned by different algorithms and architectures in deep learning.Allen-Zhu et al.(2020) developed a theory to explain how ensemble and knowledge distillation work when the data has "multi-view" structures.Wen et al.(2021) studied how contrastive learning with two view generated by data augmentations learns the feature representations, and Wen et al.(2022) showed the mechanism of how non-contrastive self-supervised learning such as SimSiam(Chen et al., 2021) can still learn competitive representations.Specific to the multi-modal setting, Huang et al.(2022) aim to demystify the performance drop in naive multi-modal joint training that the best uni-modal network outperforms the jointly trained multi-modal network by understanding the feature learning process.Remarkably Huang et al.(2022) took the training process of neural networks into consideration, which is the first theoretical treatment towards the degenerating aspect of multi-modal learning in neural networks.Du et al.(2021) also studied such negative aspect of multi-modal learning, where they proved that with more modalities, some hard-to-learn features cannot be learned.While they attempted to analyze the feature learning process, they did not fully address the optimization challenges encountered during this process.
As per our comprehensive review above, it is evident that within the current framework of multi-modal learning theory, there exist two crucial assumptions that play a vital role in shaping the overall understanding: (i)each modality of itself, is capable of providing sufficient information for specific tasks, i.e.the target functions from all modalities, as well as the combined modalities, maintain label consistency on every example.(ii) Each modality is conditionally independent given the class label.
However, in practice, different modalities are of various importance under specific circumstance (Ngiam et al., 2011; Liu et al., 2018; Gat et al., 2020).It is common that information from one single modality may be incomplete to build a good classifier(Yang et al., 2015; Gupta et al., 2016; Liu et al., 2018).For instance, it is difficult to build a classifier just using a weak modality with limited labeled data, e.g., depth modality in RGB-D images for object detection task.Moreover, in real world application, different modalities (e.g.text, image, audio) are often closely related and contain information that can be used to infer the other modalities (Reed et al.,2016; Wu et al., 2018).Specifically, in image-text tasks, the text description can provide information about the objects and actions present in an image, while the image can provide information about the context and background of the scene.
Consequently, to continue to progress the theoretical foundations of deep multi-modal learning, it is necessary to re-examine the current assumptions underlying the modeling of relationships between modalities and tasks, as well as the capture of dependencies among different modalities.This would involve relaxing the assumptions of sufficiency and conditional independence, in order to more accurately reflect the intricacies of realworld multi-modal data.
We are now discussing several important future directions which could lead to new and exciting developments in the field of multi-modal learning theory.
Despite some theoretical exploration of multi-modal learning from various perspectives, there has been relatively limited theoretical analysis of optimization-related aspects specifically.Theoretically understanding the optimization dynamic of multi-modal learning in modern deep learning can be quiet challenging due to the nonconvex optimization landscapes and complexity of multi-modal data.Moreover, it is important to understand the role of the training dynamics in terms of generalization in deep multi-modal learning, which is beyond the scope of convex optimization and conventional statistical learning.Recent developments in analysis techniques for deep learning theory, such as those proposed by Allen-Zhu et al.(2020); Li et al.(2021); Wen et al.(2021), have aimed to address optimization dynamics and non-convex landscapes.Building on this research, it would be worthwhile to further investigate multi-modal learning from an optimization perspective.
Among the recent advances of deep multi-modal learning, self-supervised approaches such as contrastive learning have gained significant attention and success(Radford et al., 2021; Jia et al., 2021).These methods have shown to be effective in learning representations for multiple modalities in a unified manner by leveraging the inherent relationships of them, which may be difficult or impossible to annotate.However, the underlying mechanisms behind how they learn to explore the relationships between different modalities remains unclear from a theoretical perspective.Future theoretical studies for multi-modal learning on this topic are in high demand,which have the potential to provide practical insights to improve performance in various multi-modal tasks.
Foundation models(Bommasani et al., 2021), pre-trained on a large and diverse pool of data and able to transfer the learned knowledge to a wide range of tasks, have revolutionized the way models are built in various areas(Devlin et al., 2018; Brown et al., 2020; Chen et al., 2020b), including multi-modal learning(Radford et al., 2021; Ramesh et al., 2022).Theoretical understanding and mathematical tools for foundation models are generally lacking and some recent attempts towards this direction mostly focus on extracting features from one module(Arora et al., 2019; HaoChen et al., 2021; Wei et al., 2021; Kumar et al., 2022).Can existing analysis be extended beyond the single module? How do foundation models learn useful representations from different modules? As the multi-modal learning is becoming an increasingly popular paradigm in modern machine learning, we believe the community will significantly benefit from more theoretical insights towards addressing these questions.
In this survey, we have endeavored to comprehensively review the theoretical analysis of multi-modal learning from the generalization and optimization perspectives.We organized the literature into technical categories to provide a systematic reference of mathematical tools for future theory research.Additionally, we discussed the challenges faced in existing theoretical work and highlighted several promising areas for future research.The goal of this survey is to provide a comprehensive understanding of the theoretical foundation of multi-modal learning and to inspire further exploration in this field.With the increasing popularity of multi-modal learning, it is vital for the community to continue to deepen their understanding of the theoretical foundations in order to unlock the full potential of this powerful approach.In conclusion, we believe that the multi-modal learning has a lot of potential to improve the performance of various machine learning tasks, and that future theoretical studies will be essential to fully realize this potential.