Eliminating Smart Contract Bugs with TLA+ by DFINITY The Internet Computer Review Apr, 2023

Temporal Logic of Actions (TLA+), a general toolkit that can exhaustively test software designs, also works on ckBTC smart contracts. Say goodbye to reentrancy bugs, and hello to more secure BTC <> ICP transfers.

Written by Ognjen Maric

In June 2016, the excitement was running high. “The DAO”, built on the Ethereum , was about to become the world’s first fully digital and decentralized investment fund. But within weeks, it became a poster child for DeFi hacks instead, after an attacker drained $50 million worth of Ether from it.

To hack The DAO, the attacker exploited a so-called reentrancy bug, a type of concurrency bug where one method of a particular smart contract is called while another one is still executing. And The DAO was just the first in the long progression of smart contracts plagued with such bugs. Prominent successors include Uniswap/Lendf.Me with $25 million stolen in 2020, CREAM was drained of $18 million, 2021, and Fei of $80 million in 2022. The reason why these bugs persist is that they can involve unexpected interactions of code scattered throughout a smart contract, or sometimes even in different smart contracts altogether. The number of such code interactions may be huge, and thus very difficult for humans to detect and weed out the unwanted ones. Moreover, these interactions are usually difficult to test automatically and systematically.

TLA+: the Exterminator
Enter the Temporal Logic of Actions (TLA+), a language for specifying and verifying complex systems. TLA+ can not only find bugs, it can also verify their absence. How is this possible, you ask? TLA+ comes with a set of tools for lightweight formal verification in the form of so-called model checking. Through model checking, it exhaustively explores all possible concurrent interactions of a model of the code — exactly the domain that is difficult to test — and finds bugs. Just picture an exterminator who sheds light on every nook and cranny to expose and eliminate unwanted pests hidden in those dark corners often missed or overlooked. Importantly, after building the model of the code, model checking runs with virtually no further human input, making it highly cost-effective. To illustrate with some made-up numbers: if the industry standard practices (such as testing and security reviews) eliminate 80% of the bugs, and “heavyweight” formal verification eliminates 99%, with TLA+ you can eliminate 90% with a fraction of the effort of the heavyweight verification.

TLA+ and The Internet Computer
Our engineering teams use TLA+ to analyze various aspects of the Internet Computer, including the security-critical smart contracts that run on the platform. While Ethereum and the Internet Computer have different execution models, reentrancy bugs can appear in the smart contracts of both blockchains. In fact, due to its high throughput, the Internet Computer allows multiple concurrent calls to a single smart contract, requiring smart contract authors to take even more care to eliminate unwanted concurrency interactions. As security is vital to the success of the Internet Computer, TLA+ has been used on a number of smart contracts to detect bugs during the development phase. A recent TLA+ analysis was conducted on the newly released Chain-Key Bitcoin (ckBTC) canisters (Internet computer’s smart contracts), the results of which I will share below.

ckBTC Smart Contracts

To talk about the TLA+ analysis of ckBTC, let’s start with a high-level overview of ckBTC. ckBTC is an Internet Computer native token that’s securely backed 1:1 by Bitcoin (BTC). The ckBTC ledger is a canister smart contract on the Internet Computer that keeps track of how much ckBTC each end user owns. This same ledger enables end users to transfer their ckBTC to other end users faster and cheaper than they could transfer BTC on the Bitcoin network. To convert ckBTC to and from BTC, the end users interact with a different smart contract, namely the ckBTC minter.

At a high level, the conversion from BTC to ckBTC operates as illustrated below:

Steps to obtain ckBTC:

An end user first transfers some BTC to a user-specific Bitcoin deposit address. All deposit addresses are completely controlled by the code of the ckBTC minter smart canister, thanks to chain-key ECDSA signatures . This creates a new UTXO owned by the deposit address.The end user notifies the ckBTC minter smart contract of the new deposited UTXO, by asking the minter to update the user’s balance.The minter smart contract then uses the Internet Computer’s Bitcoin integration to retrieve all the UTXOs owned by the minter-controlled, but user-specific deposit address. The minter then picks out the new UTXOs from the Bitcoin network, by cross-checking them against its internal bookkeeping list of UTXOs it has already processed.Finally, the minter instructs the ledger to mint new ckBTC for all new UTXOs at a 1:1 ratio, and once this is done, it adds the newly found UTXO to the list of already processed UTXOs.

*Disclaimer: the process above omits the KYT (know your transaction) process for simplicity

ckBTC Smart Contracts Meet TLA+

Alas, the high-level picture presented above actually hides a subtle concurrency problem. The good news is that we can detect the problem using the TLA+ toolkit. But before TLA+ can detect anything, we have to provide it with two things first:

Create a model of our system in the TLA+ language.Specify the TLA+ property that we expect our system to achieve.

The model is a simplified, but still fully defined version of the system. In our case, it includes the minter smart contract, but also all of the other relevant parts of the system, namely the ledger contract and the Bitcoin network. For each component, we model its state, and also how the relevant actions that end users could take (i.e., transferring BTC or ckBTC, depositing BTC, or calling the different operations that the minter exposes) change the state. All these elements are simplified to some extent in the model; the parts the analysis focuses on (such as the minter code) are modeled in more detail, and the other parts (such as the Bitcoin network) in less detail. In fact, the TLA+ model is similar to a detailed design specification of the system.

Intuitively, the main property that we want to achieve is to ensure that our ckBTC stays fully backed. In other words, no end user should be able to “double spend” their deposited BTC to get more ckBTC than what they deposited. But to analyze such a property, we have to make this intuition very precise by expressing it formally in the TLA+ language. Our formal definition says that the total supply of ckBTC (i.e., the sum of the ckBTC balances of all end users, as stored by the ckBTC ledger) does not exceed the total amount of BTC controlled by the minter canister (i.e., the sum of the values of all UTXOs owned by some deposit address).

Detecting and Fixing the Problem

Once we have created the model and the property, the TLA+ toolkit can analyze the model. To run the analysis, we first define the setting in which the analysis runs. For example, our setting for ckBTC includes only two different end users, and a small total supply of bitcoin (e.g., 5 Satoshi). While this setting is very limited, empirical studies show that a vast majority of problems in computer systems can be found with a small number of entities. The small setting allows the analysis to systematically explore all possible sequences of actions defined by the model, and check that the stated properties hold under any such sequence. Moreover, the analysis runs fully automatically, that is, without needing any human input. If the properties don’t hold, it gives us the exact sequence of actions that violated the property.

For example, the analysis could find that we can violate our “no unbacked ckBTC” property in the scenario as illustrated in the figure below:

Here:

The end user deposits BTC, same as earlierHowever, now the end user next instructs the ckBTC minter to update the balance twice in quick succession. On the Internet Computer, these updates can concurrently interact with other smart contracts.The two concurrently running updates ask the Bitcoin network for all UTXOs deposited by the end user, and both receive the same UTXO deposited in step 1 in the response; as the UTXO is not yet on the “already processed” list, both updates deem the UTXO to be new.Both updates instruct the ledger to mint the corresponding amount of ckBTC in step 4.

Thus, at the end of step 4, we are left in a state where the ckBTC supply is double the sum of UTXOs in the deposit addresses, violating our “no unbacked ckBTC” property. Note that this bug arises only because of concurrent execution of updates to the end user balance — it’s in fact a re-entrancy bug, the same flavor of bug that took The DAO down.

One way to fix this problem is to prevent the balance updates from running in parallel for the same end user. We can do this by having the minter store the end user in a list of “locked” users when a balance update starts, and keeping them on this list for the duration of the update. If the end user tries to initiate another concurrent balance update, the update will first check and find this particular end user on the list of locked users, and abort.

Note that we simplified the attack here a bit: locking is in fact a fairly common pattern for IC smart contracts, and the ckBTC minter deployed it from the start. However, the locking was improperly executed, opening the door to the above attack. TLA+ was able to detect this. Additionally, we also found a few more edge cases violating the “no unbacked ckBTC” and other important properties. Once we had applied all the fixes, TLA+ could verify and confirm that there were no more property violations in our defined setting.

Should You Use TLA+ for Your Smart Contracts?

Absolutely! OK, OK, maybe I should provide a more nuanced answer. Our R&D team uses it for the critical smart contracts on the Internet Computer whenever there is non-trivial concurrency involved. For example, the TLA+ analysis uncovered subtle concurrency bugs in the Internet Computer governance and ledger canisters that were missed in earlier manual security reviews and could have caused significant problems. Case and point — TLA+ definitely shines in finding such tricky issues.

There is, of course, a price to pay in terms of acquiring the know-how, and then building the models. But speaking from experience, this price is reasonable, especially considering the high-stakes nature of security. And one shouldn’t be intimidated by the TLA+ language itself. Despite the scary name, it is fairly simple to access its powerful capabilities. In fact, a TLA+ cheat sheet describing nearly all of TLA+ features roughly fits on a one-pager.

What’s more, the effort it takes to build a model is comparable to a standard manual security review. For example, for the ckBTC minter, understanding the design, and creating the initial model and the properties took 1 person around 3 weeks in total. The resulting TLA+ model is much more concise than the implementation, both because it’s simplified compared to code, but also due to the high-level nature of TLA+. The model of the ckBTC minter takes ~250 lines of code, compared to thousands of lines of the Rust implementation. Perhaps surprisingly, one of the most difficult parts, in terms of effort, is specifying the desired properties, as one often finds gaps in the intuitive specification. Once the model is completed, the analysis (model checking) is done automatically. The analysis may run for a long time — for example, for ckBTC, it took over 20 hours to finish. Finally, an additional benefit of TLA is that the modeling and analysis can be done at the design stage, before the code is even implemented. This can help eliminate large re-factorings due to design errors later down the line.

Where TLA+ Is Not Helpful
On the negative side, TLA+ is less helpful for complicated sequential programs, as opposed to concurrent ones. It’s also not great at handling complex cryptographic protocols. There are other tools (such as Tamarin) that are more useful for such protocols. Finally, there is the problem of keeping the TLA+ model and the actual implementation in sync; as the implementation code evolves over time, it can happen that it no longer corresponds to the model. This can be difficult to catch without tool support. In fact, our R&D team is experimenting with addressing this — I hope we’ll be able to share some tooling with you soon!

I hope that this overview has sparked your interest in using TLA+ for your own critical smart contracts. If so, stay tuned — a follow-up post with a more in-depth tutorial on how to use TLA+ for smart contracts is coming soon, accompanied by the actual TLA+ models developed by our R&D team.

Learn more: