Competition for AI Tickets: Chinese Companies Battling for GPUs

Competition for AI Tickets: Chinese Companies Battling for GPUs
Photo by Artiom Vallat / Unsplash

In 2022, along with the trend of generative AI, the renowned venture venture a16z surveyed numerous AI startups. They showed a surprising inclination towards investing in cloud computing platforms, as if they were paying an "AI tax".

The modeling capabilities and training services provided by cloud computing platforms have become a new market. Every sizable company needs to rent GPUs from cloud computing platforms.

This is especially the case for large Chinese Internet companies, which urgently need to purchase GPUs to support their business. Take ByteDance for example, this year they ordered over 1 billion worth of GPU.

The intensified demand for GPU procurement by Chinese tech giants mainly stems from previous cost reductions leading to purchasing decreases and concerns about potential future purchasing restrictions. This also presents enormous market opportunities for GPU suppliers, like NVIDIA.

Before the end of last year, the demand for GPUs by large tech companies in China was still in its early stages.Among these big companies, GPUs have two main uses: one is to support internal business operations and conduct cutting-edge artificial intelligence (AI) research; the second is to place them on cloud computing platforms for purchase by others.

An insider at ByteDance revealed that after OpenAI released GPT-3 in June 2020, they attempted to train a generative language model with tens of billions of parameters. However, due to the limitation of the parameter scale, the generative capabilities of the model were average, and they didn't see any prospects for commercial applications at the time.

Alibaba was also actively purchasing GPUs between 2018-2019, some of which were used for AI research and development. However, the Chinese market's demand for AI did not meet the initial expectations of Chinese cloud computing companies like Alibaba.

However, the emergence of ChatGPT at the beginning of 2022 changed everyone's perspective. Large-scale models are no longer an optional concept, but a huge opportunity that must be seized.

Every company began to pay attention to the progress of large models. ByteDance’s founder, Zhang Yiming, started reading AI papers, and Alibaba's chairman stated that all industries, applications, software, and services needed to be reshaped through large models.

Since then, ambitious enterprises have not only began developing their own large models but also plan to introduce services offering large models as cloud services. This is an enormous and unignorable market.

Microsoft's cloud service Azure was not very active in China, but now, as it is OpenAI's sole cloud distributor, its customers have to wait in line. At its cloud summit in April, Alibaba announced that Model-as-a-Service (MaaS) is the future trend of cloud computing.

All these services depend on high-performance GPUs, such as NVIDIA's A100 or H100. The commodity supply and demand situation has sparked a scramble for orders among Chinese tech giants.

However, before a new batch of GPUs arrives, these companies have already started reallocating resources internally to prioritize the development of large models. Large companies may also cut some businesses to free up more resources. Therefore, while cannibalizing resources can secure a certain amount of GPUs, the majority of what is needed for large model training is still dependent on the company's past accumulation and the arrival of new GPUs.

Crazy global competition for computing power

No matter where you are, the race for NVIDIA's data center GPUs is happening on a global scale. However, overseas giants have been more daring and quicker in buying GPUs on a large scale and have maintained continuous investment in recent years.

In 2022, Meta and Oracle heavily invested in A100. For example, Meta teamed up with NVIDIA to create the RSC supercomputing cluster, covering 16,000 A100s, in January of last year. In the same year, Oracle bought tens of thousands of A100 and H100 units in November to build a new data center. Today, the computing center has deployed over 32,700 A100s and is continuously launching new H100s.

Microsoft, which started investing in OpenAI in 2019, has already provided OpenAI with tens of thousands of GPUs. In March of this year, Microsoft announced that it set up a new computing center for OpenAI, containing tens of thousands of A100s. In May of this year, Google launched a computing cluster called Compute Engine A3, equipped with 26,000 H100 units, mainly intended for companies wishing to train large models by themselves.

Compared to overseas giants, larger companies in China act more urgently. Take Baidu as an example, the GPU orders placed to NVIDIA this year reached tens of thousands, on par with the scale of companies like Google. Although Baidu's scale is much smaller than Google- its revenue last year was 123.6 billion yuan, just 6% of Google's.

ByteDance, Tencent, Alibaba, and Baidu, the four companies that invested most in AI and cloud computing in China, have all bought tens of thousands of A100s in the past few years. Among them, ByteDance has the most A100s, and even excluding this year's new orders, the total number of A100s and previous-generation product V100s reaches nearly 100,000 units.

SenseTime, a growing company, announced this year that its "AI Megadevice" computing cluster has deployed 27,000 GPUs, including 10,000 A100s. The many GPUs that Chinese large companies purchased in the past were mainly used to support existing business, sold on cloud computing platforms, and were not all used to support the development and demand for large models from customers.

GPUs for training large models are no longer plentiful. If Chinese companies are to invest in the research and development of large models in the long term, and benefit from the "selling shovels", they will need to continuously increase GPU resources.

OpenAI, a leader in the industry, is already facing this problem. In mid-May, OpenAI CEO Sam Altman expressed in a small-scale exchange with a group of developers that due to a scarcity of GPUs, OpenAI's API services were not stable enough, and speed could not be increased.

If Chinese large companies not only value training and launching a big model but also hope to create more products with big models, and even further support more customers in training more large models on the cloud, then they need to stock up on more GPUs in advance.

Why is everyone vying for these four graphics cards?

In the realm of training large AI models, A100, H100, and the A800, H800 versions specifically designed for China, have remained irreplaceable choices. According to calculations by  Khaveen Investments, NVIDIA's data center GPU market share was as high as 88% last year, with the remaining shared by AMD and Intel.

The irreplaceability of NVIDIA's GPUs is due to the training mechanism of large models, where the core steps consist of pre-training and fine-tuning. Among them, pre-training is establishing the fundamentals, similar to receiving all basic education up to graduation from university; fine-tuning involves optimization for specific scenarios and tasks, improving job performance.

Currently, only A100 and H100 can provide the efficiency required for pre-training. Although their prices may seem high, they are actually the most economical choice. The cost directly affects whether a service is feasible in the early stages of AI in business.

Using more low-performance GPUs to form computing power can no longer meet the massive computational demands of large models. Because during the training with multiple GPUs, data needs to be transferred between different chips, syncing parameter information, some GPUs will be idle, unable to work continuously. Therefore, the lower the performance of a single card and the more cards used, the more computing power is wasted.

In addition to having high single-card computing power, A100 and H100 also have high bandwidth to enhance data transfer between cards. When compared to competitor AMD's MI250, the performance of H100 is over four times higher. It also provides efficient data transfer capabilities, minimizing idle computing power as much as possible.

Despite this, the performance of A800 and H800 still outperforms other similar products from major companies and startups. Due to limitations on performance and more specialized architecture, the AI chips or GPU chips currently introduced by companies mainly serve for AI calculations and struggle with pre-training large models.

Another advantage of NVIDIA is its strong software ecosystem. As early as 2006, NVIDIA introduced the computing platform CUDA, which has become the infrastructure for AI, where mainstream AI frameworks, libraries, and tools are all built upon.

Because of this, the only factor that could impact NVIDIA's data center GPU sales in the short term might be production capacity issues at TSMC.

Large models are a significant breakthrough in the layer of models and algorithms. For those companies wanting to create large models or provide cloud computing capabilities for large models, they must obtain enough advanced computing power as soon as possible. Before the first wave of hype leaves the first group of companies excited or disappointed, the scramble for GPUs will not stop.