China’s Moore Threads adds support for 10k GPU clusters • The Register

Chinese GPU vendor Moore Threads says its datacenter-focused AI systems can now support clusters of up to 10,000 accelerators – a tenfold increase from tech it offered last year.

The shift, revealed in a statement to the South China Morning Post, marks Moore Threads’ latest effort to create products that offer Chinese buyers an alternative to tech that can’t be sold in the Middle Kingdom thanks to US export restrictions.

Moore Threads is subject to those restrictions. In 2023 the four-year-old startup was added to the US Entities list, which effectively prevented it acquiring American technologies without a special license.

The outfit is therefore trying to become a leader in AI accelerators for the Chinese market, without tapping US tech. In December 2023, it unveiled its MTT S4000 chips, which come equipped with 8,192 vector cores and 128 tensor cores capable of up to 100 teraFLOPS of FP16/BF16 up to 200 TOPS of Int8 perf. More importantly for AI inferencing workloads, each card comes equipped with 48GB of vRAM, good for 768GB/sec of bandwidth.

Up to eight of the cards could be interconnected via Moore Threads’ 240GB/sec MTLink interconnect in a single MCCX D800 server. At the time, Moore Threads said it could support clusters of up to a thousand units. Now it appears to have scaled out.

To achieve the claimed 10,000 GPU cluster, 1,250 MCCX D800 servers would need to be interconnected on a high speed network. According to Moore Threads each server in its KUAE clusters comes equipped with dual 400Gbit/sec InfiniBand. However, considering the organization’s position on the US Entities list, we wouldn’t be surprised if the NICs end up being swapped for a homegrown alternative once existing inventories run dry.

Assuming Moore Threads has scaled its compute clusters to encompass 10,000 accelerators, such a system should be capable of an exaflop of FP16/BF16 performance.

While a considerable amount of compute, this still falls far behind Nvidia’s flagship components. A single H200 boasts nearly three times the memory and ten times the dense floating point performance of Moore Threads kit when using the FP16 or BF16 data types common in AI training.

Despite the Chinese outfit’s technological disadvantage, it can be argued that US trade restrictions have leveled the playing field in some respects by preventing Nvidia sending its mightiest accelerators to the Middle Kingdom. Nvidia has responded by making less powerful products that comply with US sanctions and can be exported to China.

Moore Threads’ S4000 falls somewhere between Nvidia’s L2 and L20 accelerators that we looked at back in November 2023.

For larger scale deployments involving inferencing and training, however, it appears Nvidia’s made-for-China cards still outperform the S4000. The H20 just falls under export limits with 296 teraFLOPS of FP8 performance and 96GB of speedy high bandwidth memory (HBM) capable of 4TB/sec of bandwidth.

Memory bandwidth is a major consideration for AI inferencing – particularly when it comes to large language models – as it has a direct impact on data throughput. In general, the higher the bandwidth, the more tokens – words, punctuation, or phrases – that can be generated each second.

As China lacks a domestic high bandwidth memory supplier and Moore Threads is subject to sanctions, the accelerator-maker will be hard-pressed to beat the performance of Nvidia’s cards.

This could soon change, however. ChangXin Memory Technologies, aka CXMT, is reportedly deploying equipment to produce HBM stacks in China.

While it seems Moore Threads has a long way to go to catch up with US chipmakers, its progress has helped it secure another 2.5 billion yuan ($343.9 million) in financing, according to local media, and sign deals with state-run telecoms, including China Mobile, China Unicom, China Energy Engineering Corp., and to build a series of compute clusters.

News of Moore Threads’ scaling feats will be welcome in Beijing, as China looks to boost its national compute capacity by 30 percent to 300 exaFLOPS by 2025. ®️

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top