Ma Siwei gave a keynote speech.
Ma Siwei believes that smart coding is mainly for the analysis and understanding of visual content, using coding methods based on features or semantics. At present, data-driven and increased computing power are all promoting the rapid development of smart coding. From object model coding, to knowledge, semantic model model coding, and now deep learning coding, the development of intelligent coding has always been closely related to the development of artificial intelligence technology.
The following is the transcript of Ma Siwei’s speech, the content has been edited and slightly deleted:
I am very happy to have the opportunity to report on my work here. Image encoding can be said to be an old technical problem. In fact, encoding began when digital images were born. This is the first image on the computer. The image resolution is 176×176×1bit. A pixel is either black or White.
The image in the middle is called the patron saint of JPEG, and everyone in image compression or image processing should know that the image resolution at this time is 512×512. The scanner in 1972 can only scan this resolution. The latter image is the more 4K and 8K ultra-high-definition resolution images that we are talking about today. On the 8K resolution image, we can see the dog’s hair, one by one, fine. The 8K resolution is 7680×4320, and the bit width is 10bit. This data volume is more than 10,000 times that of 1957.
Even compared with the images in the 1970s, the amount of data has increased by nearly 200 times. It can be said that the increase in resolution has brought about a huge increase in the amount of data. Image acquisition is a more precise and accurate recording of time and space information, and image coding is to reduce the storage bandwidth of the data volume. On the one hand, smart coding is to improve compression efficiency, and on the other hand, it supports more convenient and smarter image processing.
The earliest known image coding technology is JPEG, and now digital cameras are still in use. JPEG was launched in 1988 and became a standard in 1992. It is now nearly 30 years ago. Later, there is also JPEG2000, which has a performance improvement of nearly 30% compared to JPEG. Because the technology patent problem has not been widely used, we have seen that many technologies are successful, but the actual successful application is affected by many factors. Later, although I heard less about image coding standards, the intra-frame coding of these video coding standards is also usually used for image coding.The generation of standards like h.264 was formulated in 2003, and h.265/HEVC was formulated in 2013. , BPG image is based on h.265 coding technology. The latest one is h.266, which was formulated in 2020, like HEIF images are based on H.266. In terms of image compression, since 1992, the compression efficiency has been improved less than twice in the past 30 years. Just mentioned that the amount of data has increased by hundreds of times, so it can be said that compression technology is difficult, and coding compression is very demanding.
What is the coding problem? This is the general method of current coding technology. One is to predict this piece to reduce the spatial redundancy of the data, and to decompose the signal from the spatial domain to the frequency domain to remove high-frequency information. At present, to improve coding efficiency is to use a lot of transformation cores, and many prediction techniques are combined to select the best one. The process of selecting the best is very complicated. It is usually based on the rate-distortion optimization theory to make decisions and select coding parameters. This optimization Coding is relatively limited. It is difficult to achieve the optimal coding effect with some simple linear prediction or linear transformation, because the actual data situation is too much and too complicated.
Therefore, there are more nonlinear prediction and transformation coding technologies that are now done, that is, the more neural network coding and deep learning coding that are now talked about, which can reduce data redundancy through more complex nonlinear prediction transformations. This is a new research direction. Briefly speaking, the deep learning neural network is used for predictive coding. In our traditional coding, several pixels are usually used for weighted prediction. Generally, several sets of filters are fixed and one of them is selected. However, the actual signal combination is too complicated and it is difficult to rely on several A simple filter can solve this problem. In contrast, neural networks can do more complex optimization predictions. Usually in coding, we like the number 0.5. 0.5 is very simple. It is (A+B)÷2 when calculating. We also know that 0.5 is definitely not optimal, but in the end it is 0.1, 0.2, 0.7, 0.8. With more choices, optimization becomes difficult to solve, so the neural network is used to solve more complex optimization coding problems, which can better process the underlying signal features and improve coding efficiency. This is a deep learning coding job. The mechanism behind.
Optimizing coding based on neural networks is mainly due to two aspects.On the one hand, it is to encode as little information as possible, and to encode fewer elements, such as the output of neural network, as few features as possible, so that it is more intuitive to reduce the code rate, such as encoding 8. It is much more than the consideration of 1 number, and the information entropy of the coding element itself is lower. This is the basic optimization idea.
Based on these optimization ideas, the performance of image editing based on neural networks can now be seen. These works were in 2016, when the performance exceeded JPEG2000, and the performance exceeded h.265 encoding in 2018. The last thing is the result of 2020, its performance exceeds the latest VVC. As mentioned earlier, coding is a very difficult problem. The coding efficiency has only doubled in 30 years. Now these methods based on neural network coding are better than the previous work accumulated for decades, but there are also problems. The dependence of neural network coding data and the complexity are still relatively big problems.
Earlier we saw that using deep learning to solve more complex optimization coding problems can improve coding performance. Now there is another change. Originally, in order to improve compression efficiency, traditional encoding is mainly for people to watch movies and TV shows, including videos on our mobile phones, which are all watched by people.
The purpose of encoding this type of video is to save storage space and bandwidth, but now more and more videos are not only watched by people, but more and more machines must analyze and process these videos and images. There is not much consideration for traditional coding. Therefore, intelligent coding is proposed, which adopts semantic or feature-based coding methods, which can better face content analysis and understanding.
The original image and video are recorded by the machine and presented to us humans. It is a form of interaction between humans and machines. Before the machine or between the human and the machine, there can be more advanced communication methods, such as nerve impulse, not necessarily image video.
Traditional coding and intelligent coding have a big difference in data acquisition and presentation processing. First of all, the object of traditional coding is image video, which is based on CCD and CMOS array refresh to process and encode pixel blocks. Subsequent analysis and understanding based on image and video are currently based on deep learning algorithms, some aspects are better than humans, and the efficiency is relatively low compared with human visual systems. For example, let the machine see a lot of cats, and then a cat is recognized as a dog. This is a common problem. For a child, two or three cats will be recognized by him as a cat.
There are also many studies on the human visual system. For example, there are many very basic theoretical studies early on. Basically, one of the more theories used in our coding is multi-channel, that is, multi-channel processing of color, contrast sensitivity, etc. . There is also non-linearity. Non-linearity can better represent the image and video content. This is the mechanism behind it.
Let’s compare traditional coding and smart coding. Traditional coding is based on the processing of pixel blocks, and predictive transform coding is performed on the pixel blocks. When processing, we don’t know what is inside. They are all the same. It means that the data has large variances. A little bit or a little bit smaller are the underlying features of the signal level. We people look at the content from the edge structure features, and then to the outline, the object, it is such an information processing mode. Looking at things is actually a coding process. It can be seen that there is a big difference between understanding-oriented coding and signal-guaranteed coding.
So is it possible to use more feature levels for coding? That is to say, smart coding. In fact, the concept of intelligent coding is not new, and it has been more than 20 years, but its progress has not reached the intelligence we want. For example, model-based coding, which was mentioned a lot in the early years, segmented and coded the content of the image. MPEG-4 proposes object-based coding, but it relies on very fine segmentation of objects in order to achieve object interaction. Now let’s go back and think if we want to realize the analysis and understanding of the content, do we need accurate segmentation? No, as long as the characteristics of the object are enough. Later, there was coding based on knowledge and semantics, probably from the late 1980s to the mid-1990s, and later, intelligent coding that integrated signals and visual systems was proposed. In fact, there are still a lot of work closely related to intelligent coding, such as the description of visual objects defined in MPEG-7, as well as CDVS and CDVA. By adding some feature information to images and videos, image retrieval can be performed based on features. The original image needs to be processed again. Recently, there is also a coding for machine vision called video coding for machine, or VCM for short, which is also a coding for machine-oriented analysis and processing.
Let’s get an in-depth understanding of the relationship between deep learning coding and feature coding. For pixel-level coding, DCT transform is used, and the transform is decomposed into different signals for coding, which is a very low-level feature. The higher-level feature is the edge contour, which can also be obtained through learning, which can simulate the processing of visual information like a human. At the upper level, we see features similar to human faces. It can be seen that the coding based on deep learning actually contains a lot of visual feature information in it.
Here we propose a hierarchical coding framework for visual information. One is the structural level, such as the contour information, and the texture layer, which is the color and other information, and the semantic layer, and finally the residual signal layer. Based on the fusion of these levels of information, intelligent coding is realized, and the corresponding features are used to do more intelligent processing.
This is a specific implementation of the network framework, sometimes the amount of feature data is also very large, this part is responsible for removing semantic redundancy and reducing the code rate, and here is the decoding and reconstruction process. For this encoding method, we tested it on some large-scale image data sets.
First of all, in terms of coding efficiency, compared with VVC, the compression efficiency can be almost increased to 2-3 times under the same visual amount. I just said that smart coding is not only to improve compression efficiency, but also has advantages in content analysis.
This job is to train the network not only for compression, but also for the analysis and recognition tasks of image segmentation and face attribute prediction. The feature information extracted by network coding can support the analysis and recognition tasks while achieving compression, without the need for decoding image reconstruction. There is no need to go back to the pixel level.
Hierarchical coding of visual information can also do some other interesting work. Like here, the image structure and texture are separated, and the outline information of one image can be combined with the color information of another image, which can quickly change more image content and even other things. These tasks are similar to image stylization. , But here is mainly to define the underlying data representation, and you can do more processing based on this representation.
Another is the work we are doing now, to open up the link between smart coding and more smart processing. One is the generation of video, which can be generated from a pair of images, and the other is to change the effect between images, which are still in progress.
Finally, talk about the future trend of intelligent coding. The top level is the theoretical basis of coding, including information theory, visual representation theory, etc. I think the human visual system is a very good encoder. This one has the corresponding physiological intelligence foundation. It was the early perceptron coding. In fact, the neural network just came out with a lot of coding at the beginning, and the process of people looking at things is also coding. Model coding is followed, and deep learning coding, including recent concept compression, generation compression, and so on. From here, we can see that data-driven and computing power have improved, which has promoted the development of smart coding. Intelligent coding is not a new thing, it has been updated with more computing resources and the development of artificial intelligence technology. It also reflects the spiral development from object to content, and more and more semantics.
Similar to the traditional coding definition codec method to achieve the purpose of compression, the goal of smart coding is to define a more efficient data representation and provide a high-level interface for smarter digital media processing. I think this is the future work direction of smart media coding. .