Dismantling Uniswap V3 data cleaning process
Written by: Zelos
Introduction
In the last issue, we calculated the net worth and rate of return of users on uniswap from the perspective of user addresses. This time, our goal remains the same. But the cash held by these addresses must be counted. Get a total net worth and rate of return.
There are two pools for this statistical object, including
-
polygon 上的 usdc-weth(fee:0.05), pool address: 0x45dda9cb7c25131df268515131f647d726f50608[1], 这也是上次分析所用的池子
-
usdc-eth(fee:0.05) on ethereum, pool address: 0x88e6A0c2dDD26FEEb64F039a2c41296FcB3f5640[2], because this pool contains native tokens, it brings some troubles to data processing
The final data obtained is hourly level data. Note: The data in each row represents the value at the last moment of the hour.
Overall process
-
Get uniswap data
-
Get user cash data
-
Calculate the price sequence, which is the price of eth.
-
Get every minute, how much handling fee is obtained on each tick
-
Get a list of all positions within the statistical period
-
Get the corresponding relationship between address and position
-
Calculate the rate of return for each position
-
Based on the corresponding relationship between position and address, calculate the return rate of each user address as LP
-
Combine the user's cash and LP, and calculate the overall rate of return
1. Get Uniswap data
In order to provide data sources for demeter, we developed the demeter-fetch tool. This tool can obtain Uniswap pool logs from different channels and parse them into different formats. Supported data sources are:
-
ethereum rpc: The standard rpc interface of the eth client. The efficiency of obtaining data is relatively low. More threads need to be opened.
-
Google BigQuery: Download data from BigQuery's data set. Although it is updated once a day, it is easy to use and cheap.
-
Trueblocks chifra: The Chifra service can scrape transactions on the chain and reorganize them. This allows you to easily export transactions, balances and other information. But this requires building your own nodes and services.
Output formats include:
-
minute: Resample uniswap swap transaction information into minute-by-minute data. Used for backtesting
-
tick: records every transaction in the Pool. Including swap and liquidity operations
This time we mainly obtain tick data, which is used to count position information, including fund amount/income per minute/life cycle/holders, etc.
These data are obtained through the event log of the pool. Such as mint, burn, collect. swap. However, the log of the pool does not contain the token id. This makes it impossible for us to locate which position the operation of the pool is for.
In fact, the equity of uniswap LP is through nft to manage, and these nft The manager of the token is proxy contract, the token id only exists in the event log of the proxy. Therefore, if you want to obtain the complete LP position information, you must obtain the event log of the proxy, and then combine it with the event log of the pool.
Taking this transaction [3] as an example, we need to pay attention to the two logs with log index 227 and 229. They are pool respectively. contractThrowing mint and proxy contractIncreaseLiquidity thrown. The amount (that is, liquidity) between them, amount0 and amount1 are the same. This can be used as the basis for correlation. By correlating these two logs, we can get the tick range, liquidity of this LP behavior , token id, and the amount corresponding to the two tokens.
For advanced users, especially some funds, they will choose to bypass the proxy and directly operate the pool contract. In this case, the position will not have a token id. In this case, we will useaddress-LowerTick-UpperTick
format, create an id for this LP position.
For burn and collect, you can also use this method to find the corresponding position id for the pool event. But there is a trouble here. Sometimes the amounts of the two events are not the same, and there will be a slight deviation. For example, this transaction
There will be a small difference between his amount0 and amount1. Although this situation is rare, it is also very common. So when we match burn and collect, we leave some room for error in the values.
The next question to be dealt with is who initiated this transaction. For liquidation, we will use the receipt in the collect event as the holder of the position. For mint, we can only get the sender( See picture with mint event).
If the user is operating the pool contract, the sender will be the LP provider, but if it is an ordinary user operating the contract through a proxy, the sender will be the address of the proxy. This is because the funds are indeed transferred from the proxy to the pool. But the good news is proxy will have nft The generation of token. And this nft token will definitely be transferred to the LP provider. Therefore, by detecting the transfer of the proxy contract (that is, the contract of nft token), you can find the LP provider corresponding to this mint.
In addition, if nft is transferred, the holder of the position will change. We have made statistics on this, and this situation is rare. To simplify, we did not consider the transfer of nft after mint.
2. Get the cash held by the address
The goal of this stage is to obtain the number of tokens held by an address at each moment during the statistical period. To achieve this goal, two aspects of data need to be obtained,
-
The balance of the address at the starting time
-
Transfer records of the address during the statistical period.
By adding and subtracting the balance using the transfer records, the balance at each moment can be inferred.
The balance at the starting time can be queried through the rpc interface. When using achievement node, you can set the height in the query parameters to get the balance at any time. The balance of native token and erc20 can be obtained in this way. .
Obtaining erc20 transfer records is relatively easy and can be obtained through any channel (big query, rpc, chifra).
The transfer records of eth need to be obtained through transactions and traces. The transactions are okay, but the query and processing of traces are very computationally intensive. Fortunately, chifra provides the function of exporting the eth balance. A record can be output when the balance changes, although It can only record quantity changes, but cannot record transfer objects, but it can still meet the requirements. This is the lowest-cost method that meets the requirements.
3. Obtain price
Uniswap is aexchange, if a token exchange occurs, a swap event will be generated, and we can get the price of the token from the sqrtPriceX96 field. Get the total liquidity at that time from the liquidity field.
Since our pool has a stable currency, it is very easy to obtain the price of u. But this price is not absolutely accurate. First of all, it is affected by the frequency of transactions. If there is no swap transaction, the price will lag. In addition, When the stablecoin is unanchored, there will also be a gap between this price and the price for u. But under normal circumstances, this price is accurate enough, and there is no problem for market research.
Finally, the token price is resampled to obtain a price list per minute.
In addition, since the liquidity field of the event also contains the total liquidity of the current pool, we also add the total liquidity in. Finally, a table is formed as follows:
4. Handling fee statistics
Handling fees are the main source of income for position. Every time a user performs a swap operation on the pool, the corresponding position can receive a certain handling fee (that is, lower and upper include the position of the current tick). The amount of income is proportional to the liquidity. Ratio, pool fee rate, and tick range are related.
In order to count the user's fee income, we can record the amount of swap that occurred in the pool on which tick every minute. Then calculate the fee income on this tick in the current minute:
Finally, it becomes a table like this
This statistical method does not take into account the situation when the current tick liquidity is exhausted during swap. However, since our statistical goal is LP, we use tick range for statistics. This error can be alleviated to a certain extent.
5. Get position list
To obtain a list of positions, you must first specify the identifier of the position.
-
For LP invested through Proxy, each position will have an nft, which means it will have a token id, which can be used as the id of the position.
-
For LPs who directly operate pool investments, we will make up an ID for them in the format of
address_LowerTick_UpperTick
. In this way, all positions have their own identifiers.
Through this identifier, we can integrate all operations of LP to form a list describing the entire life cycle of position. For example
However, it should be noted that the object of this statistics is during 2023, not from the creation of the pool. Inevitably, for some positions, we cannot obtain their operations before January 1, 2023. This requires We speculate how much liquidity this position has at the beginning of statistics. We adopt an economical way to speculate:
-
Add the liquidity of mint and burn to get a number L
-
If L>0, that is, mint>burn, it is considered that there is some liquidity before the statistics start, and a mint operation will be compensated at the moment when the statistics start (2023.1.1 0:0:0).
-
If L
This method can avoid downloading data before 2023, thereby saving costs. However, it will facesunk liquidityThe problem is that if the LP does not perform any operations during this year, the LP cannot be found, but this problem is not serious. Since the statistical period is one year, we assume that users will generally adjust LP during this period. Because In the span of a year, the price of eth will change greatly, and users have many reasons to adjust their LP. For example, if the price exceeds the tick range, invest funds in other DEFI, etc. Therefore, as an active user, They will definitely adjust their LP according to the price. For those who deposit their funds in the pool and never adjust it, we consider this user to be inactive and not included in the statistics.
Another more troublesome situation is that the position mint some liquidity before 2023, and then performed some mint/burn operations during the cycle. By the end of the statistics, all the liquidity was not burned. Therefore, we only A part of the liquidity can be counted. In this case, the sunk liquidity will have an impact on the position's fee estimate, causing abnormal returns. The specific reasons will be discussed later.
In the final statistics, polygon has a total of 73,278 positions, while ethereum has 21,210 positions. There are no more than 10 abnormal returns for each chain, proving that this assumption is credible.
6. Obtain the corresponding relationship between address and position
Since the ultimate goal of our statistics is the income of the address, we also need to obtain the corresponding relationship between the address and position. Through this association, we can get the user's specific investment behavior.
In step 1, we did some work to find the associated users of the fund operation (mint/collect). Therefore, as long as we find the sender of mint and the receiver of collect, we can find the corresponding relationship between position and address.
7. Calculate the net value and rate of return of the position
In this step, we need to calculate the net value of each position, and then calculate the rate of return based on the net value
net worth
The net value of Position consists of two parts. One is the liquidity of LP, which is equivalent to the principal of market making. After the user invests funds in Position, the amount of liquidity will not change, but the net value will fluctuate as the price changes. The other part Handling fee income, this part is independent of liquidity, is stored separately in two fields, fee0 and fee1. The net value of handling fees increases with time.
Therefore, at any minute, liquidity is combined with the price of this minute to get the net value of the principal part. The calculation of the handling fee requires the use of the handling fee table calculated in the fourth step.
First, divide the liquidity of this position by the total liquidity of the current pool as the sharing ratio. Then add the handling fees of all ticks included in the tick range of this position to get the handling fee income for this minute.
Expressed as:
Finally, add the handling fees of fee0 and fee1 to get the net handling fee. Then add it to the net value of liquidity to get the total net worth.
When calculating net worth, we divide the life cycle of the position based on mint/burn/collect transactions.
-
When mint trades occur, let liquidity increase
-
When a burn transaction occurs, the liquidity is reduced. And the value of the liquidity is converted to the fee field (the code of the pool contract also operates in this way)
-
When a collect transaction occurs, a calculation will be triggered. The calculation range is from the last collect to the current time. We will calculate the equity and fee income every minute and get a list.
Finally, summarize the net value list obtained from each collect. Then perform resample and other statistics to obtain the final result.
In addition, in order to improve accuracy, we have made two optimizations.
First, for the hour when a transaction occurred (mint/burn/collect), we performed minute-level statistics, and for the hour when no transaction occurred, we performed hour-level statistics. Finally, the results were resampled into hour-level statistics.
Secondly, in the collect event, we can get the sum of liquidity + handling fee. Therefore, we can compare the actual collect value with our theoretical calculation value to get the difference between the theoretical handling fee and the actual handling fee (actually this difference The value also includes the difference of lp principal, but the error of the difference of lp principal is very small and can basically be considered as 0). We will compensate the handling fee difference to each row. In order to improve the accuracy of handling fee estimation (also These are the fee_modify0 and fee_modify1 fields in the table above).
Notice:
-
When backfilling, the allocation of handling fees must be weighted based on the liquidity of the current hour, otherwise the handling fees for this hour will be too high.
-
Since the statistical data is for the whole year of 2023, rather than complete data, there is the phenomenon of sunk liquidity mentioned in Section 5. This will make the actual handling fees much higher than the theoretical handling fees, making the yield rate extremely high. .
Since each row is the data at the last moment of this hour, for the position that has been completely closed, the net value will be 0. In this case, the net value at the time of the position close will be lost. In order to retain this net value, a line is created at the end of the file The data at the time of 2038-1-1 00:00:00 stores the net value and other data at the time of position close. To prepare for the statistical needs of other projects.
rate of return
Normally, the rate of return is calculated by dividing the starting equity by the ending equity. But this does not apply here. The reasons are as follows:
-
The rate of return here needs to be refined to every minute,
-
Since the position will have funds transferred in and out in the middle, simply dividing the net value at the beginning and end cannot reflect the income.
For question 1, we can divide the net value per minute to get the rate of return per minute, and then multiply the rate of return per minute to get the total rate of return.
But this algorithm has a serious problem. If there is a data calculation error in the rate of return per minute, it will lead to a large deviation in the total rate of return. In this way, the statistical process becomes a tightrope walk, and no mistakes can be made. But good On the one hand, this leaves no room for statistical errors.
Regarding question 2, if there is a transfer of funds in and out this minute, directly dividing by the rate of return will still result in a very outrageous rate of return. Therefore, it is necessary to refine the rate of return algorithm per minute.
The first attempt we made was to break down the changes in net worth in detail, and then eliminate the changes in funds. We split the changes in net worth into several parts. 1 is the change in principal brought about by the price. 2 is The accumulated handling fees for this minute. 3 is the inflow and outflow of funds. Obviously, 3 must be excluded from the statistics. For this, we have formulated the following calculation method:
-
Specify that the current minute is n and the previous minute is n-1
-
Assume that all transfer operations in the current minute occur at the n:0.000th second. Then in the remaining time, the net value of LP is unchanged, that is to say, the net value at the n:0.001st second is equal to the net value at the n:59.999th second.
-
The accumulation of handling fees occurs at the end of this minute, which is the n:59.999th second.
-
The price and handling fee at the end of the previous minute (n-1:59.999) are the prices and handling charges at the beginning of this minute (n:0.000)
Based on the above assumptions, the rate of return per minute is to divide the liquidity/price/handling fee at the end by the liquidity/price/handling fee at the end, expressed as follows, where f refers to converting liquidity to Net worth algorithm.
This method looks very good. It perfectly avoids changes in liquidity. And reflects the impact of prices and fees on net worth. This is exactly what we expect. However, in practice, it will occur in some rows Big yield. After investigation we found that the problem occurs when withdrawing liquidity. Recall our rules: the time represented by each row is the end of the minute/hour. This provides uniformity for the statistics of the data scale, but it should be noted that the meaning of each column is different:
-
Xiaobai Navigation
For the net value column, it is the instantaneous value, which is the last value of the current minute/hour.
-
The handling fee column is the cumulative value. That is, the handling fees accumulated during the current minute/hour.
So for that hour of burn liquidity
-
When the LP is burned and the token is transferred away, the net value will be 0 at the end of this hour
-
As for the handling fee, since it is cumulative, at the end of this hour, the handling fee will be greater than 0.
This reduces the above formula to:
This situation will not only occur at the end of the position life cycle, but also when part of the liquidity is burned, it will also cause a change in the ratio of the increase in handling fees to the LP's net worth.
For the sake of simplicity, when the net value of LP changes, we set the rate of return to 1. This will bring errors to the calculation results of the rate of return. But for a normal continuous investment position, the hour of transaction generation is relative to The entire life cycle is still very small. Therefore, the impact is not significant.
8. Calculate the total LP income of the address
With the rate of return of each position, plus the corresponding relationship between position and address, we can get the rate of return of the user address in each position.
The algorithm here is relatively simple. The positions of this address in different periods are connected in series. There is no investment period in the middle, the net value is set to 0, and the rate of return is set to 1 (because the net value before and after is 0, there is no change, so the rate of return is 1.)
If there are multiple positions in the same period, add the net values in the overlapping parts to get the total net value. When merging the returns, we will weight the merger according to the net value of each position.
9. Combined total cash and LP returns
Finally, as long as the cash held by the user address and the LP investment are combined, the final result can be obtained.
The merging of net worth is simpler than the previous step (merging positions). As long as you find the time range on the LP net worth, then look up the cash held in the corresponding time range, and then find out the price of eth, you can get the total net worth. .
For the rate of return, we also use the algorithm of finding the rate of return per minute and then multiplying it. At the beginning, we used the error rate of return algorithm mentioned in Section 7. This requires that the fixed part of this minute (including cash) The amount of cash, liquidity in LP) and the variable part (price changes, fee accumulation, fund transfers in and out) are separated. Compared with the statistics of position, its complexity is much higher, because for the inflow and outflow of uniswap funds , just pay attention to the mint and collect events. The traceability of cash is very troublesome. We have to distinguish whether the funds are transferred to LP or transferred to the outside. If it is transferred to LP, the principal part can remain unchanged. If it is transferred to the outside, , to correct the amount of principal. This requires tracking the transfer destination address of erc20 and eth. This work is very troublesome. First of all, during mint/collect, the transfer address may be a pool or a proxy. What is more complicated is the eth one. For transfers, since eth is a native token, some transfer records can only be found through trace records. However, the amount of trace data is too large and exceeds our processing capabilities.
The last straw that broke the camel's back was when we discovered that the net value of each line is the instantaneous value of this hour, and the handling fee is the cumulative value of this hour, which cannot be directly added in a physical sense. This problem was indeed discovered very late.
Therefore, we gave up this algorithm. Instead, we used the net value of the next minute, divided by the net value of the previous minute. This method is much simpler. But there is also a problem with this method. That is when funds are transferred in and out. , the rate of return will still be unreasonable. Through the above discussion, we know that it is very difficult to separate the flow of funds. Therefore, here we sacrifice some accuracy and set the rate of return when there is a fund transfer to 1.
The remaining question is, how to identify the inflow and outflow of funds in the current hour? The algorithm I thought of at the beginning was very simple. Using the token balance of the previous hour and the current price, we can calculate the net value of this hour if we hold these tokens. What will it be? Then just subtract the estimated value from the actual value. When the difference is not equal to the actual value, it means there is a transfer of funds in and out. The formula is expressed as:
However, this algorithm ignores the complexity of uniswap LP. In LP, the number of tokens will change as the price changes, and the net value will also change accordingly. And this method does not take into account changes in handling fees. Ultimately, the estimated value will be different from the actual value. The value has an error of about 0.1%.
In order to improve the accuracy, the structure of the funds is refined, the value change of LP is calculated separately, and the handling fee is also taken into account.
In this way, the error of the estimated value can be controlled within 0.001%.
In addition, we limit the decimal of the data to avoid division by too small numbers (usually below 10^-10). These small numbers are errors accumulated from various calculations and resamples. If the direct correlation is not processed, Dividing will cause the error to be amplified, seriously distorting the rate of return.
Other questions
native token
In this statistics, the usdc-eth pool on ethereum has been added, where eth is a native token and requires some special processing.
eth cannot be used in defi , must be converted to weth. Therefore, this pool is actually the pool of usdc-weth. For users who directly operate the pool, just transfer weth in and out of this pool. This is the same as an ordinary pool.
For users who add LP through proxy, they need to bring eth into the value of the transaction and transfer it to the proxy contract. The contract then converts these eth into weth and then puts it into the pool. When collecting, usdc can be used directly Transfer it to the user, but eth cannot be transferred directly to the user. It needs to be transferred from the pool to the proxy first, then converted into eth by the proxy contract, and finally sent to the user through internal transfer. For an example, see this transaction [4].
Therefore, the only difference between the usdc-eth pool and the ordinary pool is the transfer in and out of funds. This only affects the matching position and address. In order to solve this problem, we pulled all the nft transfer data from the creation of the pool, and then Find the holder of the corresponding position through the token id.
missing position
In the statistics, some positions did not enter the final list. These positions have certain special features.
A large part of them are MEV transactions. MEV are pure arbitrage transactions and are not normal investors, so they are not included in our statistical scope. In addition, it is difficult to count them in actual statistics, which requires the use of trace level. data. Here we use a simple strategy to filter mev transactions, which isLess than a minute from start to finish, In fact, since the highest accuracy of our data is 1 minute. If a position exists for less than one minute, it cannot be counted.
Another possibility is that there is no collect transaction in this position. As can be seen from step 7, our calculation of income is triggered by collect. Without the collect operation, the previous net value and rate of return will not be calculated. Under normal circumstances Under this situation, users will choose to harvest the income or principal of LP in a timely manner. However, some special users are not excluded, that is, they must store their assets in fee0 and fee1 of uniswap pool. For such users, we also consider them to be special users and are not included in the statistical scope. .
The article comes from the Internet:Dismantling Uniswap V3 data cleaning process
Sun Yuchen said, "I don't want my role to be too big" and it is best to be decentralized to "let everyone feel at ease." Contributing author: Li Ye Sun Yuchen’s long-term exposure to the spotlight is more likely the result of his lack of concealment. He rarely hides his extreme pursuit, extreme ambition, and willingness to do whatever it takes to achieve his goals. These traits are often the key to success. The true but dark path to success, and these...