まず1927年のオリジナルスコアはGottfried Huppertzという作曲家によるもので、ワグナーやストラウスなどの影響を受けて作られたと言われています。壮大なシンフォニーで表現される世界観は私からするとドラマチックすぎる印象を受けますが、今から100年も前の作品であることを考えると当然かも知れません。この時代の作品は今回の学習データに入っていなかったこともあり、このようなマッチをVideo2Musicに期待するのは現時点では難しいところです。

この映画への関心が再び高まり、再リリースされた1984年にサントラを担当したのは、エレクトロニック・ダンスミュージックのオリジネーターとして知られるGiorgio Moroderでした。映画の尺も80分と短く編集された本バージョンは音楽もぐっとモダンなテイストとなり、81年生まれの私からすると「わかりやすい」マッチだと感じました。また、Video2Musicが選んだ候補曲Aとも近似しています。

数あるMetropolisのスコアの中からもう一つ取り上げるとすると、2000年に発表されたミニマル・テクノアーティストのJeff Millsのバージョンがあります。本バージョンは映画のリリースのために作られたというよりは、映画をインスピレーションとして音楽に落とし込んだ作品です。ミルズの作品はもともとアブストラクトでダークな雰囲気が特徴で、Metropolisの陰鬱な作風にも合っているようでもありますが、逆に支配的な印象を与えているようにも感じます。Video2Musicの選んだ候補曲だとBが近いようです。






今回のモデル化のアーキテクチャーでは、動画と音楽それぞれを「特徴表現学習 (Representation Learning)」した事前学習モデルを使い、2つの潜在空間を「対照学習 (Contrastive Learning)」という手法で比較学習しアラインすることに成功しました。事前学習モデルによる特徴表現モデルは対象とするそれぞれのメディア(今回は映像と音楽)から、人間で言うところの「印象」のようなものを抽出してくれるプロセスです。これは潜在空間と呼ばれる架空の空間におけるベクトル値として表され、対照学習はその2つの空間の共通点を見つけ出し、画像から受ける印象と音から見つける印象の共通点を見つけるのです。今回のVideo2Musicでは、音楽や映像のように時系列性のあるデータをうまくモデル化するためにトランスフォーマーを使うなどの発展を加えています。上図はVideo2Musicが対象曲を探してくる仕組みを示しています。


深層学習による特徴表現学習の技術が広く認識されたのは2013年に発表されたWord2Vecでした。文章の穴埋め問題を解かせるという非常にシンプルな課題設定に基づく学習から、単語を新しい方法でベクトル化する「単語埋め込み」方法を示しました。Word2Vecの興味深い特徴として、ベクトル演算が成立することがあります。例えば (フランス – パリ) + ロンドン = イギリスとか、(女王 – 女性) + 男 = 王 などの関係性がWord2Vecの潜在空間の中で計算できることが示されたのです。このような特徴を持った埋め込み手法を応用し、例えば文書の分類やレコメンデーションなどへの活用が行われています。また、私も以前Nikkei BPさんとの分析企画で使ったように、データ分析における応用も行われています。





Hello 2023! I was fortunate enough to achieve a lot last year including Neutone, which I wrote about last year. In this article, I’m reviewing one of our new inventions called Video2Music.

AI is used to solve problems that humans do to make decisions that would otherwise be made only by experts. In business, targets of AI predictions are problems that ultimately have a correct answer, like demand forecasting, credit risk, fault detection, and so on. On the other hand, there are many problems in the real world for which there is more than one answer, and it is an interesting and important challenge to see how AI can solve such problems. After all, there are probably more problems without answers in the world.

Find music that “feels right” for a video

Towards the end of last year, my team was finally able to release Video2Music, which we have been working on since the beginning of 2022. First, take a look at what we’ve accomplished:


Everyone has some sense of whether or not the music fits the video, and you probably have an experience of being moved by the music when watching a movie. However, it is tricky to explain in words why the music fits with the visual. Finding the right music for a given video is a difficult task unless you have knowledge of music and experience in selecting music. It is also a complicated problem, as there is no single right answer, and sometimes a slight deviation can lead to a good effect. To solve these problems even partially with AI, we repeated various experiments and succeeded in developing an AI that can suggest candidate music that “feels right” for a wide range of videos.

Not included in the video above, the result of the Video2Music selection for the old silent film, Metropolis, was pretty awesome. Here is a portion of the film we used as an input, along with the top three suggested songs:

Candidate Song A

Candidate Song B

Candidate Song C

Metropolis, is a 1927 German film that some consider “the first and best science fiction film.” I won’t go into its interesting plot here, but the film has a long and complicated history, with many editions, and as for the music, a number of artists have made a soundtrack for this movie (Wikipedia lists about 20)

First of all, the original score from 1927 is by a composer named Gottfried Huppertz, who is said to have been influenced by Wagner and Strauss for this soundtrack. The world view expressed in the magnificent symphony seems too dramatic to me, but that may be because the work is now 100 years old. It is difficult at this point to expect such a historical match from Video2Music, as works from this period were not included in the training data this time.

Interest in the film was rekindled, and in 1984, when it was re-released, the soundtrack was created by Giorgio Moroder, the artist known as the originator of electronic dance music. The length of the film was shortened to 80 minutes and the music was edited to a more modern taste. I found it easy to understand. It is also close to the candidate song A chosen by Video2Music.

Last of the many Metropolis scores that I featured here, is a version by a minimal techno artist, Jeff Mills, released in 2000. This version was not created for a film’s release, but rather was created with the film as an inspiration. Mills’ work has always been abstract and dark in nature, which seems to fit the brooding style of Metropolis, but it also gives a dominating impression. It is somewhat similar to the selection B by Video2Music

As you can see, Video2Music’s AI model can select songs based on a wide range of interpretations for a given video. Of course, the videos and the music data used at the training time were different from those used at the inference time.

Behind the scenes of Video2Music development

Let’s take a look behind the scenes of the development of this model. First of all, we used music promotional videos on YouTube as the primary source of training data; YouTube has a wide variety of music videos of different genres and nationalities. For example, here are some examples:

Training data example 1
Training data example 2

Music videos are made to accompany the music, so they appeared to be perfect material for this assignment. However, on YouTube, there are many useless videos that simply show a still image of an album cover, so obtaining a large amount of quality training data required some trick. Watching many of them, we also noticed these videos have a lot in common: intros to start, scenes of musicians singing, heavily edited footages, few mins duration etc. We therefore had some doubts about the general applicability the model. The result, however, was a pleasant surprise for the development team, as we were able to present a suitable choice from multiple angles for a wider range of videos than we had imagined.

The modeling architecture used pre-trained representation models for video and music, and we aligned the two latent spaces using a method called Contrastive Learning. Representation models can extract something akin to human “impression” from each of the target media. This is represented as a vector value in an imaginary space called the latent space, and Contrastive Learning finds the relationships between the two spaces; in other words, this is the process to match the impressions “felt” from the videos and the impressions “felt” from the music. We made further developments to the architecture such as the use of transformers to successfully model time-series data that are music and video. The figure above shows how Video2Music finds candidate songs from the library.

Modeling sensory knowledge

The first widely recognized deep learning Representation Learning technique was Word2Vec, published in 2013. One of the interesting features of Word2Vec was that vector operations can be performed in the “meaning space”. For example, it was shown that relationships such as (France – Paris) + London = England or (Queen – Woman) + Man = King can be computed in the Word2Vec latent space. Embedding techniques with these characteristics are being applied to document classification and recommendation, for example. It is also being applied in data analysis, as I used it in a previous analysis project with Nikkei BP.


The development of such embedding technology is occurring in all fields. For example, at Qosmo, we often use a basic music feature representation model to vectorize input data, and then build actual models for tagging instruments used in a song, estimates BPM etc. This approach allows us to reach good accuracy much more efficiently than training models from scratch. Furthermore, as mentioned in my recent blog post, it has been shown that this technology can be applied to feature modeling of brain waves to predict what a subject is looking at from the measured brain waves.

The release of the CLIP representation model from OpenAI in 2021 laid a foundation for the series of image-generation AI revolutions in 2022. An approach is being established in which generic pre-trained models constitute a kind of “middleware” to underpin more specific applications. This approach, called “transfer learning,” in which generic pre-trained models learned from a wider range of data are fine-tuned with additional data, has greatly accelerated the development of AI applications in recent years.

As a result, we can expect to see many more cases in which AI will be applied to problems without clear-cut answers. In the video/music modeling that we worked on this time, the model was able to select effective candidate songs for a very wide range of videos, including plays, home videos, racing videos, etc, though the model was trained on a fairly specific set of videos. This was a surprising result even for us and one that make us wonder farther about the possibilities of AI in general.