Cerebras access twenty billion parameters in workloads on a single chip
The artificial intelligence model trained by Cerebras climbed to a unique and remarkable twenty billion parameters. Cerebras completed this action without having to scale the workload across numerous accelerators. Cerebras’ triumph is critical for machine learning in that the infrastructure and complexity of the software requirements are reduced compared to previous models. The Wafer Scale Engine-2 is engraved in an individual 7 nm wafer, equalling hundreds of premium chips on the market, and features 2.6 trillion 7 nm transistors. Along with the wafer and transistors, the Wafer Scale Engine-2 incorporates 850,000 cores and 40 GB of integrated cache with a 15kW power consumption. Tom’s Hardware notes that “a single CS-2 system is akin to a supercomputer all on its own.” The benefit for Cerebras utilizing a 20 billion-parameter NLP model in an individual chip allows for the company to reduce its overhead in the cost of training thousands of GPUs, hardware, and scaling requirements. In turn, the company can eliminate any technical difficulties of partitioning various models across the chip. The company states this is “one of the most painful aspects of NLP workloads, […] taking months to complete.” It’s a tailored issue that’s unusual not only to each processed neural network, GPU specifications, and the overall network combining all the components, which researchers must take care of before the first section of training. The training is also solitary and cannot be used on multiple systems. Currently, we have seen systems that perform exceptionally well with having to use fewer parameters. One such system is Chinchilla, which continually exceeds GPT-3 and Gopher’s 70 billion parameters. However, Cerebras’ accomplishment is exceptionally significant in that researchers will find that they will be able to calculate and create gradually elaborate models on the new Wafer Scale Engine-2 where others cannot. — Andrew Feldman, CEO and Co-Founder, Cerebras Systems The technology behind the vast amount of workable parameters uses the company’s Weight Streaming technology, allowing researchers to “decouple compute and memory footprints, allowing for memory to be scaled towards whatever the amount is needed to store the rapidly-increasing number of parameters in AI workloads.” In turn, the time taken for setting up the learning will be reduced from months to minutes with only a few standard commands, allowing to switch flawlessly between GPT-J and GPT-Neo. News Source: Tom’s Hardware — Dan Olds, Chief Research Officer, Intersect360 Research