LU Jianguo ,ZHENG Qingfang
(1.State Key Laboratory of Mobile Network and Mobile Multimedia Technology,Shenzhen 518055,China;2.ZTE Corporation,Shenzhen 518057,China)
Abstract: Video conferencing systems face the dilemma between smooth streaming and decent visual quality because traditional video com‐pression algorithms fail to produce bitstreams low enough for bandwidth-constrained networks.An ultra-lightweight face-animation-based method that enables better video conferencing experience is proposed in this paper.The proposed method compresses high-quality upperbody videos with ultra-low bitrates and runs efficiently on mobile devices without high-end graphics processing units (GPU).Moreover,a vi‐sual quality evaluation algorithm is used to avoid image degradation caused by extreme face poses and∕or expressions,and a full resolution im‐age composition algorithm to reduce unnaturalness,which guarantees the user experience.Experiments show that the proposed method is effi‐cient and can generate high-quality videos at ultra-low bitrates.
Keywords: talking heads;face animation;video conferencing;generative adversarial network
During the COVID-19 Pandemic,video conferencing systems have become indispensable tools for individu‐als to keep in touch with friends and for enterprises and organizations to connect with customers.Inside these systems,video compression technologies play critical roles in the efficient representation and transportation of video data.Great progress has been achieved in past years in repre‐senting high-fidelity videos with low bitrates;e.g.,the highefficiency video coding (HEVC)[1]was designed with the goal of allowing video content to have a data compression ratio up to 1 000:1.However,video conferencing systems still face the dilemma between smooth streaming and decent visual quality because current video compression technologies fail to pro‐duce bitstreams low enough for bandwidth-constrained net‐works due to a large number of concurrent users.
Recently,some novel talking-head video compression meth‐ods[2–5]based on face animation have been proposed,which can significantly cut down the bandwidth usage of video con‐ferences.These face animation methods usually consist of two parts: encoder and decoder.The encoder is a motion extractor to derive a compact motion feature representation from the driving video frame,and the decoder is an image generator to synthesize photorealistic images according to the motion fea‐ture.Due to its extreme compactness,the extracted face fea‐ture can be used to reduce the bandwidth of video conferences and hence improve user experience in bandwidth-constrained networks.However,most of the talking-head video compres‐sion methods are too complicated to run in real time without the support of high-end graphics processing units (GPUs),let alone on mobile devices.For example,the model size of the First Order Motion Model (FOMM)[6]is 355 MB and the com‐putation complexity is 121 G multiply-accumulate operations (MACs).Aiming at practical applications,we propose an ultralightweight motion extractor to obtain effective motion repre‐sentations from the driving video and an animation generator to synthesize high-quality face videos accordingly.
We find out that the face animation method may sometimes fail,which is usually caused by extreme head poses and∕or fa‐cial expressions.To tackle the problem,we propose an effi‐cient visual quality evaluation method to reject the synthe‐sized images that are visually unacceptable.We also notice that only displaying face without context regions looks unnatu‐ral and weird to users.To cope with it,we composite fullresolution images by stitching face regions with other body parts and backgrounds.These two mechanisms effectively pre‐vent user experience degradation during a conference.
Our main contributions are as follows:
• An ultra-lightweight motion extraction algorithm is pro‐posed to derive effective facial motion features from driving videos,which is efficient enough to run on mobile devices without high-end GPUs.
• An efficient visual quality evaluation algorithm is pro‐posed to select visually acceptable generated images and an image composition algorithm to generate full-resolution vid‐eos,which ensures consistent and natural user experience dur‐ing conferences.
• A practical video conferencing system is built to integrate the best parts of face-animation-based methods and traditional video-compression-based methods,which significantly re‐duces uplink bandwidth usage and ensures decent user experi‐ence even when the network bandwidth is constrained.
Due to the space limitation,we only review previous works about face animation and deep video compression that are most related to ours.
Face animation is an image-to-image translation task,which transfers the talking-head motion of a person in an im‐age to persons in other images.The former image is called the driving image,while the latter image is called the source im‐age.Face animation has become a popular topic since the generative adversarial network (GAN)[7]was proposed by GOODFELLOW et al.Most recently published face animation methods can synthesize photo-realistic images with the help of GANs.
Some works[8–12]were proposed to solve the face animation task with the prior knowledge of the 3D Morphable Model (3DMM)[13].However,the traditional 3D-based works[8–10]failed to render details of talking heads,such as hair,teeth and accessories.Ref.[11] allowed fine-scale manipulation of any facial input image into a new expression while preserving its identity with the help of a conditional GAN.To improve the realism of the rendering,Ref.[12] designed a novel spacetime GAN to predict photorealistic video frames from the modified 3DMM directly.
Contrary to 3D-based models,2D-based models synthesize talking heads directly without any prior knowledge of 3DMM.They can be classified into warping-based models and warping-free models.
Warping-free models[14–19]directly synthesize images with‐out any warping.Few-shot vid2vid[16]learned to transform landmark positions into realistically looking personalized pho‐tographs with the help of meta-learning.Ref.[19] decomposed a person’s appearance into a pose-dependent coarse image and a pose-independent texture image.LI-Net[20]decoupled the face landmark image into pose and expression features and reenacted those attributes separately to generate identitypreserving faces with accurate expressions and poses.
Warping-based methods[21–25]predicted dense motion fields to warp the feature maps extracted from the source images and inpaint the warped feature maps to generate photorealistic im‐ages.X2Face[22]used an encoder-decoder architecture to learn the latent embedding to encode pose and expression and re‐cover the dense motion fields from it.Many works attempted to predict the dense motion field from sparse object keypoints.The key to those methods is how to represent motions with sparse object keypoints.Monkey-Net[23]was proposed to learn pure keypoints to describe motions in an unsupervised man‐ner.Although it cannot describe subtle motions,Monkey-Net provided a strong baseline for further improvements.FOMM[6]represented sparse motion with some keypoints along with lo‐cal affine transformations.Motion representations for articu‐lated animation (MRAA)[24]defined the motion with regions us‐ing the motion estimation based on principal component analy‐sis (PCA),rather than keypoints,to describe locations,shapes and pose.The thin-plate spline (TPS) motion model[25]esti‐mated thin-plate spline motion to produce a more flexible opti‐cal flow.Ref.[5] extended the baseline to 3D optical flows to produce 3D deformations.The above mentioned methods ex‐tracted compact motion representations,which showed great potential in lowering the bitrate of video conferencing.
For decades,researchers have made great efforts to transmit higher quality videos with lower bitrates.Recently several ap‐proaches based on deep learning were explored.
For general-purpose video compression,some works[26–27]at‐tempted to reduce the bandwidth by making a balance between the cost of transferring the region of interest (ROI) and back‐ground.Compared to traditional codecs,such methods can achieve better visual quality with the same bitrate.Other works[28–29]focused on enhancing the visual quality of low bi‐trate videos by image super-resolution and image enhancement.
For the compression of talking-head videos,great progress has been achieved.In Ref.[30],the encoder detected and transmitted keypoints representing the body pose and the face mesh information,and the receiver displayed the motion in the form of puppets.However,this method failed to produce photorealistic images.Inspired by the promising results achieved by face animation models,many works demonstrated the effectiveness of video compression based on face anima‐tion.VSBNet[3]reconstructed original frames from face land‐marks with a low bitrate of around 1 kB∕s.Ref.[5] proposed a neural talking-head video synthesis model and set up a video conferencing system that achieves the same visual quality as the commercial H.264 standard with only one-tenth of the bandwidth.Ref.[2] introduced an adaptive intra-refresh scheme to address the problem of reconstruction quality that might rapidly degrade due to the loss of temporal correlation as frames get farther away from the initial one.Ref.[4] evalu‐ated the advantages and disadvantages of several deep genera‐tive adversarial approaches and designed a mobile-compatible architecture that can run at 19 f∕s on iPhone 8.However,those methods can hardly run in real time without the support of high-end GPUs.What’s more,they could only generate nearfrontal faces,looking unnatural and weird when faces were not near-frontal.In this paper,we specifically focus on improving the efficiency and visual quality of video compression based on face animation.
The overall pipeline of our video conference system is shown in Fig.1.Each user provides an avatar image to the sys‐tem and uses its animation during a conference for ensuring privacy and elegant presence.When the system starts run‐ning,videos of users are captured and the face region in each video frame is cropped out by the face detection algorithm.Face images are then encoded by the keypoint detector and represented as the keypoints described in Section 3.2.Before the encoded data are sent out,the visual quality of the face im‐age that will be reconstructed by a decoder according to these keypoints is evaluated to prevent unnatural results.It is high‐lighted here that the visual quality evaluation method in Sec‐tion 3.3 requires no actual reconstruction of the face image but executes on encoded data,for the sake of efficiency.
Upon receiving the encoded keypoint data from the sender,the conference server calls the image generator to synthesize the face image animated from the keypoints,as described in Section 3.2.The decoded face image replaces the face region in the avatar image by our method in Section 3.4 to create a full-resolution video frame,which is then encoded by H.264 or HEVC and sent to the receiver.The receiver simply de‐codes the video stream and displays it on the screen,which can usually take advantage of the hardware accelerator in the device’s chip.
▲Figure 1.Proposed video conference system consists of three parts: the sender on mobile devices,video generator on servers,and receiver on mobile devices.In the encoder part,the motion encoder ex⁃tracts keypoints from the driving images.The feature-based image quality evaluation filters out unnatu⁃ral images.The decoder synthesizes images from the keypoints and reconstructs full-resolution images,which are encoded by H.264 or H.265 and sent to the receiver.The receiver decodes the video stream and shows it on the phone screen
With the prevalence of mobile phones,the demand for run‐ning video conferencing on mobile devices is growing.In most commercial video conference systems,mobile devices account for a significant portion of all terminals.For better compatibil‐ity with existing commercial video conference systems,our system and algorithms here are intentionally designed to make the sender∕receiver module deployable on mobile de‐vices and to keep their computational burdens to a minimum,thus reducing power consumption and extending the working time of mobile devices.
Giving a source imageSof the target person,a driving video can be denoted as {D1,D2,D3,…,DN},whereDiis thei‑th frame in the sequence andNis the total number of frames in the video.The output images can be denoted as {O1,O2,O3,…,ON},whereOiis thei‑th frame of the output sequence.The outputOishares the same identity withSand the same face motions withDi.We adopt the face animation model simi‐lar to FOMM,which consists of a keypoint detectorK(en‐coder) and a generatorG(decoder).First,face landmarks are estimated fromSandDiseparately byK,whose locations serve as the sparse motion information.Second,dense motion fields and occlusion maps are predicted byG.Finally,Gwarps the feature map extracted fromSwith the dense motion fields and the warped feature map is masked by the occlusion maps to generate the output imageOi.Following the idea of FOMM,we extract 10 keypoints and their corresponding Jacobian matrices from the face image.
We design our model to be light‐weight and can generate an image with excellent visual quality.For the decoder,we adopt the same ar‐chitecture as the generator model in FOMM but cut down the chan‐nels of the model by half.We de‐note the simplified generator asGsim.For the encoder,we replace the hourglass network in FOMM,which brings about high computa‐tional cost,with a greatly simpli‐fied version of MobileNetV2[31].However,it is very difficult to train the proposed model from scratch since the training process often fails to converge.We come up with a training strategy described as fol‐lows to solve the problem.
1) Step 1: model distillation.We use the original encoderKfommin FOMM as the teacher model and our proposed en‐coderKproas the student model.The loss function consists of distillation lossLdisand equivariance lossLeq,which can be written as Eq.(1).
whereIis the training sample andTis a thin plane spline de‐formation.The distillation loss ensures that the student en‐coder extracts the same motion representation as the teacher encoder.And the equivariance loss ensures the consistency of the motion representation when random geometric transforma‐tions are applied to the images.
2) Step 2: iterative model pruning and distillation.Since the encoder has to extract motion representation from every video frame,it should be as lightweight as possible to reduce computational costs.In our attempt to further simplify the en‐coder,we find out most of the complexity comes from the last several convolutional layers.Therefore,we drop the last con‐volutional layer in the encoder model and retrain it following Step 1.This step can be repeated several times until we ob‐tainKbestthat strikes a balance between the model complexity and accuracy.
3) Step 3: generator fine-tuning.Due to the simplification made to the generator,we train the simplified generatorGsimalong with the keypoint detectorKfommof the original FOMM to make a good initialization ofGsim.
4) Step 4: overall fine-tuning.Once the encoder modelsKbestandGsimare determined,we fine-tuneKbestandGsimaccord‐ingly in an unsupervised manner.Finally,KbestandGsimact as the encoder and the decoder in our system respectively.
Although video conferences based on face animation can re‐sult in a very high video compression rate,the visual quality of a reconstructed image may sometimes degrade in the follow‐ing two cases (Fig.2).First,due to current algorithmic limita‐tions,most of the face animation models may generate inaccu‐rate expressions and visual artifacts on faces with large poses and∕or extreme expressions.Second,with the increase of the frame distance,the temporal correlation weakens,and hence the quality of generated video deteriorates.This phenomenon becomes particularly obvious when faces are occluded.The degraded image brings inconsistent experience to users.In or‐der to alleviate the problem,Ref.[2] introduced an adaptive intra-refresh scheme using multiple source frames.Before sending the features to the decoder,the sender reconstructs the image first and evaluates the generated image to avoid de‐graded images.However,this scheme not only incurs large computational costs which makes it impossible to run it on mo‐bile devices,but also leads to significant time delay at the re‐ceiving end.What’s more,frequent scene switching also re‐quires the system’s frequent sending of source frames,mak‐ing the system lose its advantage of reducing video bandwidth.
We propose here an adaptive degraded frame filter method by an efficient image quality evaluation algorithm directly based on the extracted features.We find out that when a large head pose and∕or extreme facial expression happens,most of the regions in the generated image are inpainted by the gen‐erator,which degrades the image quality.The difference be‐tween the driving image and the source image can be mea‐sured by analyzing the dense motion field,which is predicted from the sparse motion field in our setting.Therefore,instead of using the decoder to synthesize the generated image,we de‐cide to evaluate image quality based on the relative motion.The lossL2in the algorithm can be formulated as follows.
wherev1iis the value of thei⁃th keypoint in the first frame,v2iis the value of thei⁃th keypoint in the second frame,J1iis the Jacobian of thei⁃th keypoint in the first frame,J2iis the Jaco‐bian of thei⁃th keypoint in the second frame,and hyperparam‐etersαandβcontrol the weight of each part.In our experi‐ments,we set the hyperparameters to 2 and 1 respectively.
▲Figure 2.Examples of face animation failure.The first row shows a result caused by large-pose;the face area becomes blurred and there are some artifacts on the hair of the woman.The second row shows a de⁃graded image caused by weak temporal correlation and the recon⁃structed image looks terrible and weird
In the proposed scheme,the balance between image quality and robustness is controlled by a thresholdτ.Although the identity of the people in the driving images and the source im‐age are the same,the two images may look different.For better visual quality,we adopt a relative motion transfer method,as described in Ref.[6].We first find a driving image that has a similar pose to the source image,which is called the initial im‐ageDI.Then,we extract keypoints from the source imageSand the initial imageDI,which can be denoted asKsandKI.The source keypoints are sent to the receiver.For every frameDt,we estimate keypointsKtfrom the frame,and compare the relative motion betweenKtandKsand that betweenKIandKs.If the former is smaller,we set this driving keypoint as an ini‐tial image.Finally,we compare the relative motion betweenKtandKIwith the thresholdτ.If the former is smaller,it means the relative motion is suitable for robust image generation.The relative motion is sent to the server.If the latter is smaller,the default motion is sent to avoid freezing in video streams.The default keypoints can be motions of some natural expressions,such as blinking and smiling.In this way,the de‐graded frames are replaced by frames of natural expressions.Compared to the method proposed in Ref.[2],our method can greatly reduce the computation cost at the sender and the de‐lay at the receiver.
The face animation described above cannot be directly used in video conferences due to two facts.Face animation cannot synthesize face images with a size up to video resolution (at least 1 280×720) because computational complexity grows ex‐ponentially with the image size.Also,only displaying the fa‐cial region on the screen without other body parts such as the neck and shoulder looks unnatural and weird.In order to make our face animation method applicable,instead of gener‐ating full-resolution images,we propose to generate a facial re‐gion with a size of no more than 384×384 and stitch it with other body parts and background regions in the source frame to form a full-resolution image.The problem is that there will be a sharp blocky artifact between the head region and body region be‐cause the head region moves while the body region may remain station‐ary.We find that the keypoints spread over the talking-head area and each keypoint is responsible for the local transformation of its neighborhood.To reduce the arti‐fact,we fix the keypoints related to the shoulder part.As a result,the dense motion field predicted by the generator will stay stationary near the shoulder region and have a smooth transition from the head re‐gion to the shoulder region,which makes the composite image look more natural.We show the ex‐ample images in Fig.3 for compari‐son.
1) Datasets.We train and evaluate our face animation model on the VoxCeleb dataset and an in-house dataset.Vox‐Celeb[32]is a dataset of interview videos of different celebri‐ties.We crop the videos and resize them to 256×256 for a fair comparison with the original FOMM and 384×384 for the generation of high-resolution images according to the bound‐ing boxes of faces.The in-house dataset consists of 4 124 Chinese people videos collected from the Internet and is used to reduce bias towards Western people.We fine-tune our model on the in-house dataset to make better adaptations to Chinese.
2) Evaluation metrics.We evaluate the models using the L1 error,average keypoint distance (AKD) and average Euclid‐ean distance (AED).The L1 error is the mean absolute differ‐ence between pixel values in the reconstructed images and the ground-truth images,which measures the reconstruction accu‐racy.AKD and AED stand for semantic consistency.AKD is the average distance between the face landmarks extracted from the ground-truth images and the reconstructed images re‐spectively by the face landmark detector[33],which measures the pose difference between the two images.AED measures identity preservation,which is the L2 distance of the corre‐sponding features extracted by a pre-trained re-identification network[34].
3) Hardware.In our video conference system,we implement a conferencing APP on a ZTE A30 Ultra mobile phone with Snapdragon 888 System on a Chip (SoC) and conferencing server software on a computer with Nvidia Tesla V100 GPU.
▲Figure 3.Qualitative comparisons with state-of-the-art methods.The first three rows are images from the VoxCeleb dataset and the following four rows are images from our in-house dataset.Our method produces competitive results
1) Efficiency of the proposed face animation algorithm
First,we compare our encoder,i.e.,the face motion extrac‐tor,with that of the original FOMM.We convert the encoder to the mobile neural network (MNN)[35]model and calculate the model size.As listed in Table 1,our encoder model is only 600 kB in size with theoretical computation complexity of 14.62 M MAC,both of which are about 1% of FOMM.Our en‐coder processes every frame in 3.5 ms on Snapdragon 888,which is 16.3 times faster than FOMM.
Second,we compare our decoder,i.e.,the generator to syn‐thesize a 384×384-resolution face image,also with FOMM.For the generator,we convert the model to TensorRT[36]model and calculate the model size.As listed in Table 1,our decoder model is 81.77 MB in size with theoretical computation com‐plexity of 31.42 G MAC,and these two values are 26.0% and 27.3% of FOMM respectively.Our encoder runs in 5 ms on Tesla V100,which is 4 times faster than FOMM.
2) Effectiveness of the proposed face animation algorithm
We compare the visual quality of face images generated by our method with other face animation methods.For quantita‐tive comparison,we evaluate our model with existing studies on the VoxCeleb dataset for an image generation task.For a fair comparison,we generate images with the resolution of 256×256.The first frame of each test video is set as the source image,while the subsequent frames are set as the driving images.Evaluation metrics are computed for every frame and our result is the mean value of all frames.The re‐sults are summarized in Table 2,which clearly shows the pro‐posed method outperforms X2Face and Monkey-Net.Com‐pared to FOMM,our method can generate competitive re‐sults,even though our model is much lighter than FOMM.For a qualitative comparison,we list some example images in Fig.3 for visual comparisons.
▼Table 1.Efficiency comparison between our face animation method and FOMM
▼Table 2.Visual quality comparison among different face animation methods on VoxCeleb dataset
The avatar images provided by a user are usually not faceonly,but with other upper body parts.When head regions in the avatar images are cropped and animated by our method,they should be stitched back into original images to form new images with predefined resolutions,e.g.,1 280×720.Special treatment should be given to the point where the head region and body region connect because these regions move nonrigidly and disproportionately.As shown in the top two rows in Fig.4,simply replacing the head region in an avatar image with a new animated head region will result in visual disconti‐nuities.As comparisons,the bottom two rows show results of the proposed method described in Section 3.4.Our method successfully eliminates discontinuities and makes whole im‐ages visually natural.
As described in Section 3.1,our video conference system is comprised of server software running on the cloud server and application software,with the sender module and receiver module,running on the mobile phone.The most important dif‐ference between our sender module and those inside other video conference systems is we encode captured videos into compact keypoint motion information,rather than traditional H.264 or HEVC streams,which greatly cuts down the uplink bandwidth usage.For example,when encoded in H.264,720 p conference videos are typical of bitrates between 1 Mbit∕s and 2 Mbit∕s.By comparison,each video frame is encoded by our sender module as 10 keypoint information,each of which in‐cludes a position (2 floating points) and a Jacobian matrix (4 floating points).We empirically determine the half precision floating point format (FP16) is enough for data representation and thus reaches the bitrate of 6×16×10×30=28.8 kbit∕s,which is only less than 3% of H.264 encoding.We note the keypoint information can be compressed by the entropy en‐coder for further bandwidth usage saving.
▲Figure 4.Results of full-resolution image generation.The first row shows images generated by simply replacing the head region in the source image with the new animated head region.The third row shows image results by our method in Section 3.4.In the second and fourth rows,connections between head regions and body regions are zoomed in for clearer comparison
In our real-world user studies,reducing the uplink bitrate can greatly improve the conference user experience.For one thing,since wireless bandwidth is not evenly allocated for up‐link and downlink data transportation,a smaller uplink bitrate can result in less congestion and faster upward transmission.For another thing,more aggressive schemes can be applied when Forward error correction (FEC) is used to tackle data loss in transmission,leading to less data retransmission,which brings about lower remote interaction latency and more real-time engagement.
The server software in our system runs on a cloud server with Nvidia GPUs because the image generator in face anima‐tion is much more computationally expensive than the key‐point extractor,as demonstrated in Section 4.1.Although our simplified image generator can be deployed on some flagship mobile phones with powerful GPUs,we choose server-side de‐ployment to make our application software lightweight enough to run on most mobile phones and consume less power to ex‐tend working time,which is also critical to user experience.
In this paper,we propose a face-animation-based method to greatly reduce bandwidth usage in video conferences,com‐pressing face video frames by using only 60 FP16 data to rep‐resent the face motion.We design an ultra-lightweight face motion extraction algorithm that runs on mobile devices,as well as an efficient visual quality evaluation algorithm and a full-resolution image composition algorithm to ensure consis‐tent and natural user experience.We also build a practical system to enable user communication using animated avatars.Experimental results demonstrate the efficiency and effective‐ness of our methods and their superiority over previous stud‐ies.However,one limitation of our current work is that our method is only applicable to upper-body videos.A full-body animation method should be our next work to cover more realworld scenarios.Another improvement to our system will be saving downlink bandwidth by reconstructing videos on mo‐bile devices,which requires further research in GAN accelera‐tion to meet real-time constraints on mobile devices.