Anyone who has taken a basic computer science class has undoubtedly spent time devising a sorting algorithm—code that will take an unordered list of items and put them in ascending or descending order. It’s an interesting challenge because there are so many ways of doing it and because people have spent a lot of time figuring out how to do this sorting as efficiently as possible.
Sorting is so basic that algorithms are built into most standard libraries for programming languages. And, in the case of the C++ library used with the LLVM compiler, the code hasn’t been touched in over a decade.
But Google’s DeepMind AI group has now developed a reinforcement learning tool that can develop extremely optimized algorithms without first being trained on human code examples. The trick was to set it up to treat programming as a game.
It’s all a game
DeepMind, among other things, is notable for having developed software that teaches itself how to play games. That approach has proven highly effective, conquering games as varied as chess, Go, and StarCraft. While the details vary depending on which game it’s tackling, the software learns by playing itself and discovers options that allow it to maximize a score.
Because it isn’t trained on games humans play, the DeepMind system can discover approaches to the games that humans haven’t thought of. Of course, since it’s always playing against itself, there are cases where it has developed blind spots that humans can exploit.
This approach is very relevant to programming. Large language models write effective code because they have seen plenty of human examples. But because of that, they’re unlikely to develop something that humans haven’t done previously. If we’re looking to optimize well-understood algorithms, like sorting functions, then basing something on existing human code is, at best, going to get you equivalent performance. But how do you get an AI to identify a truly new approach?
The people at DeepMind took the same approach as they had with chess and Go: They turned code optimization into a game. The AlphaDev system developed x86 assembly algorithms that treated the latency of the code as a score and tried to minimize that score while ensuring that the code ran to completion without errors. Through reinforcement learning, AlphaDev gradually develops the ability to write tight, highly efficient code.
Inside AlphaDev
Saying that the system optimizes for latency is very different from explaining how it operates. Like most other complex AI systems, AlphaDev consists of several distinct components. One of them is a representation function, which tracks the overall performance of the code as it’s developed. This includes the general structure of the algorithm, as well as the use of x86 registers and memory.
The system adds assembly instructions individually, chosen by a Monte Carlo tree search—again, an approach borrowed from game-playing systems. The “tree” aspect of this approach allows the system to quickly narrow in on a limited area of the large range of potential instructions, while the Monte Carlo adds a degree of randomness to the precise instruction that gets chosen from that branch. (Note that “instruction” in this context includes things like the specific registers chosen to create a valid and complete assembly.)
The system then evaluates the state of the assembly code for latency and validity and assigns it a score, comparing that to the score of the previous one. And, through reinforcement learning, it hangs on to information about how going down different branches of the tree work, given the program’s state. Over time, it “learns” how to achieve a winning game state—a completed sorting—with a maximum score, meaning a minimum latency.
The main benefit of this system is that its training doesn’t have to involve any code examples. Instead, the system generates its own code examples and then evaluates them. In the process, it hangs on to information about combinations of instructions that are effective in sorting.
Useful code
Sorting in complex programs can handle large and arbitrary collections of items. But at the level of standard libraries, it’s built from a large collection of highly specific functions that handle just one or a few situations. For example, there are separate algorithms for sorting three items, four items, and five items. And there’s another set of functions that can handle an arbitrary number of items up to a limit—meaning you can call one that sorts up to four items, but no more.
DeepMind set AlphaDev on each of these functions, but they operate very differently. For the functions that handle a specific number of items, it’s possible to write code without any branches where you execute different code based on the state of a variable. As a result, the performance of this code generally scales with a number of instructions required. AlphaDev was able to shave an instruction each off sort-3, sort-5, and sort-8, and even more off sort-6 and sort-7. There was only one (sort-4) where it didn’t find a way to improve the human code. Repeated runs of the code on actual systems showed that fewer instructions did lead to better performance.
Sorting a variable number of entries does involve branching in the code, and different processors have different amounts of hardware dedicated to handling these branches. So for these, the code was evaluated based on its performance on 100 different machines. Here again, AlphaDev found ways of squeezing out additional performance, and we’ll take a look at how it did this in one situation: a function that sorts up to four items.In the existing implementation in the C++ library, the code does a series of tests to see how many items it needs to sort and calls the dedicated sorting function for that number of items. The revised code does something much weirder. It tests if there are two items and calls out to a separate function to sort them if needed. If it’s greater than two items, the code calls out to sort the first three items. If there are three items, it returns the results of that sort.
If there are four items to sort, however, it runs specialized code that is extremely efficient at inserting a fourth item into the appropriate place within a set of three sorted items. This sounds like a weird approach, but it consistently outperformed the existing code.
In production
Since AlphaDev did produce more efficient code, the team wanted to get these incorporated back into the LLVM standard C++ library. The problem here is that the code was in assembly rather than C++. So, they had to work backward and figure out the C++ code that would produce the same assembly. Once that was done, the code was incorporated into the LLVM toolchain—the first time some of the code had been modified in over a decade.
As a result, the researchers estimate that AlphaDev’s code is now executed trillions of times a day.