Dismantling Uniswap V3 data cleaning process

All articles1年前 (2024)发布 #beeloverE3RI6hhshHE

139 0 0

We calculated the user net worth and rate of return on Uniswap from the perspective of user addresses.

Written by: Zelos

Introduction

In the last issue, we calculated the net worth and rate of return of users on uniswap from the perspective of user addresses. This time, our goal remains the same. But the cash held by these addresses must be counted. Get a total net worth and rate of return.

There are two pools for this statistical object, including

usdc-weth(fee:0.05) on polygon, pool address: 0x45dda9cb7c25131df268515131f647d726f50608[1], this is also the pool used in the last analysis
usdc-eth(fee:0.05) on ethereum, pool address: 0x88e6A0c2dDD26FEEb64F039a2c41296FcB3f5640[2], because this pool contains native tokens, it brings some troubles to data processing

The final data obtained is hourly level data. Note: The data in each row represents the value at the last moment of the hour.

Overall process

Get uniswap data
Get user cash data
Calculate the price sequence, which is the price of eth.
Get every minute, how much handling fee is obtained on each tick
Get a list of all positions within the statistical period
Get the corresponding relationship between address and position
Calculate the rate of return for each position
Based on the corresponding relationship between position and address, calculate the return rate of each user address as LP
Combine the user's cash and LP, and calculate the overall rate of return

拆解 Uniswap V3 数据清洗流程

1. Get Uniswap data

In order to provide data sources for demeter, we developed the demeter-fetch tool. This tool can obtain Uniswap pool logs from different channels and parse them into different formats. Supported data sources are:

ethereum rpc: The standard rpc interface of the eth client. The efficiency of obtaining data is relatively low. More threads need to be opened.
Google BigQuery: Download data from BigQuery's data set. Although it is updated once a day, it is easy to use and cheap.
Trueblocks chifra: The Chifra service can scrape transactions on the chain and reorganize them. This allows you to easily export transactions, balances and other information. But this requires building your own nodes and services.

Output formats include:

minute: Resample uniswap swap transaction information into minute-by-minute data. Used for backtesting
tick: records every transaction in the Pool. Including swap and liquidity operations

This time we mainly obtain tick data, which is used to count position information, including fund amount/income per minute/life cycle/holders, etc.

These data are obtained through the event log of the pool. Such as mint, burn, collect. swap. However, the log of the pool does not contain the token id. This makes it impossible for us to locate which position the operation of the pool is for.

In fact, the equity of uniswap LP is through nft 来管理的, 而这些 nft token 的管理人是 proxy contract, the token id only exists in the event log of the proxy. Therefore, if you want to obtain the complete LP position information, you must obtain the event log of the proxy, and then combine it with the event log of the pool.

Taking this transaction [3] as an example, we need to pay attention to the two logs with log index 227 and 229. They are pool respectively. contract抛出的 mint 和 proxy 合约抛出的 IncreaseLiquidity. 他们之间的 amount( 也就是 liquidity ), amount0 和 amount1 是一样的. 这可以作为关联的依据. 通过将这两个 log 关联起来, 我们可以得到这个 LP 行为的 tick range, liquidity, token id, 以及两种 token 对应的金额。

拆解 Uniswap V3 数据清洗流程

For advanced users, especially some funds, they will choose to bypass the proxy and directly operate the pool contract. In this case, the position will not have a token id. In this case, we will useaddress-LowerTick-UpperTickformat, create an id for this LP position.

For burn and collect, you can also use this method to find the corresponding position id for the pool event. But there is a trouble here. Sometimes the amounts of the two events are not the same, and there will be a slight deviation. For example, this transaction

拆解 Uniswap V3 数据清洗流程

There will be a small difference between his amount0 and amount1. Although this situation is rare, it is also very common. So when we match burn and collect, we leave some room for error in the values.

The next question to be dealt with is who initiated this transaction. For liquidation, we will use the receipt in the collect event as the holder of the position. For mint, we can only get the sender( See picture with mint event).

如果用户是操作 pool 合约, 这个 sender 就会是 LP provider, 但如果是普通用户, 通过 proxy 操作合约, 这个 sender 会是 proxy 的地址, 这是因为资金确实是从 proxy 转到 pool 的. 但好在 proxy 会有 nft token 的产生. 而这个 nft token, 一定会转移给 LP provider. 因此, 检测 proxy 合约 ( 也就是 nft token 的合约 ) 的 transfer, 就可以查找到这个 mint 所对应的 LP provider

In addition, if nft is transferred, the holder of the position will change. We have made statistics on this, and this situation is rare. To simplify, we did not consider the transfer of nft after mint.

2. Get the cash held by the address

The goal of this stage is to obtain the number of tokens held by an address at each moment during the statistical period. To achieve this goal, two aspects of data need to be obtained,

The balance of the address at the starting time
Transfer records of the address during the statistical period.

By adding and subtracting the balance using the transfer records, the balance at each moment can be inferred.

The balance at the starting time can be queried through the rpc interface. When using achievement node, you can set the height in the query parameters to get the balance at any time. The balance of native token and erc20 can be obtained in this way. .

Obtaining erc20 transfer records is relatively easy and can be obtained through any channel (big query, rpc, chifra).

The transfer records of eth need to be obtained through transactions and traces. The transactions are okay, but the query and processing of traces are very computationally intensive. Fortunately, chifra provides the function of exporting the eth balance. A record can be output when the balance changes, although It can only record quantity changes, but cannot record transfer objects, but it can still meet the requirements. This is the lowest-cost method that meets the requirements.

3. Obtain price

Uniswap is aexchange, if a token exchange occurs, a swap event will be generated, and we can get the price of the token from the sqrtPriceX96 field. Get the total liquidity at that time from the liquidity field.

Since our pool has a stable currency, it is very easy to obtain the price of u. But this price is not absolutely accurate. First of all, it is affected by the frequency of transactions. If there is no swap transaction, the price will lag. In addition, When the stablecoin is unanchored, there will also be a gap between this price and the price for u. But under normal circumstances, this price is accurate enough, and there is no problem for market research.

Finally, the token price is resampled to obtain a price list per minute.

In addition, since the liquidity field of the event also contains the total liquidity of the current pool, we also add the total liquidity in. Finally, a table is formed as follows:

拆解 Uniswap V3 数据清洗流程

4. Handling fee statistics

Handling fees are the main source of income for position. Every time a user performs a swap operation on the pool, the corresponding position can receive a certain handling fee (that is, lower and upper include the position of the current tick). The amount of income is proportional to the liquidity. Ratio, pool fee rate, and tick range are related.

In order to count the user's fee income, we can record the amount of swap that occurred in the pool on which tick every minute. Then calculate the fee income on this tick in the current minute:

拆解 Uniswap V3 数据清洗流程

Finally, it becomes a table like this

拆解 Uniswap V3 数据清洗流程

This statistical method does not take into account the situation when the current tick liquidity is exhausted during swap. However, since our statistical goal is LP, we use tick range for statistics. This error can be alleviated to a certain extent.

5. Get position list

To obtain a list of positions, you must first specify the identifier of the position.

For LP invested through Proxy, each position will have an nft, which means it will have a token id, which can be used as the id of the position.
For LPs who directly operate pool investments, we will make up an ID for them in the format ofaddress_LowerTick_UpperTick. In this way, all positions have their own identifiers.

Through this identifier, we can integrate all operations of LP to form a list describing the entire life cycle of position. For example

拆解 Uniswap V3 数据清洗流程

However, it should be noted that the object of this statistics is during 2023, not from the creation of the pool. Inevitably, for some positions, we cannot obtain their operations before January 1, 2023. This requires We speculate how much liquidity this position has at the beginning of statistics. We adopt an economical way to speculate:

Add the liquidity of mint and burn to get a number L
If L>0, that is, mint>burn, it is considered that there is some liquidity before the statistics start, and a mint operation will be compensated at the moment when the statistics start (2023.1.1 0:0:0).
If L

This method can avoid downloading data before 2023, thereby saving costs. However, it will facesunk liquidityThe problem is that if the LP does not perform any operations during this year, the LP cannot be found, but this problem is not serious. Since the statistical period is one year, we assume that users will generally adjust LP during this period. Because In the span of a year, the price of eth will change greatly, and users have many reasons to adjust their LP. For example, if the price exceeds the tick range, invest funds in other DEFI, etc. Therefore, as an active user, They will definitely adjust their LP according to the price. For those who deposit their funds in the pool and never adjust it, we consider this user to be inactive and not included in the statistics.

Another more troublesome situation is that the position mint some liquidity before 2023, and then performed some mint/burn operations during the cycle. By the end of the statistics, all the liquidity was not burned. Therefore, we only A part of the liquidity can be counted. In this case, the sunk liquidity will have an impact on the position's fee estimate, causing abnormal returns. The specific reasons will be discussed later.

In the final statistics, polygon has a total of 73,278 positions, while ethereum has 21,210 positions. There are no more than 10 abnormal returns for each chain, proving that this assumption is credible.

6. Obtain the corresponding relationship between address and position

Since the ultimate goal of our statistics is the income of the address, we also need to obtain the corresponding relationship between the address and position. Through this association, we can get the user's specific investment behavior.

In step 1, we did some work to find the associated users of the fund operation (mint/collect). Therefore, as long as we find the sender of mint and the receiver of collect, we can find the corresponding relationship between position and address.

7. Calculate the net value and rate of return of the position

In this step, we need to calculate the net value of each position, and then calculate the rate of return based on the net value

net worth

The net value of Position consists of two parts. One is the liquidity of LP, which is equivalent to the principal of market making. After the user invests funds in Position, the amount of liquidity will not change, but the net value will fluctuate as the price changes. The other part Handling fee income, this part is independent of liquidity, is stored separately in two fields, fee0 and fee1. The net value of handling fees increases with time.

Therefore, at any minute, liquidity is combined with the price of this minute to get the net value of the principal part. The calculation of the handling fee requires the use of the handling fee table calculated in the fourth step.

First, divide the liquidity of this position by the total liquidity of the current pool as the sharing ratio. Then add the handling fees of all ticks included in the tick range of this position to get the handling fee income for this minute.

Expressed as:

拆解 Uniswap V3 数据清洗流程

Finally, add the handling fees of fee0 and fee1 to get the net handling fee. Then add it to the net value of liquidity to get the total net worth.

When calculating net worth, we divide the life cycle of the position based on mint/burn/collect transactions.

When mint trades occur, let liquidity increase
When a burn transaction occurs, the liquidity is reduced. And the value of the liquidity is converted to the fee field (the code of the pool contract also operates in this way)
When a collect transaction occurs, a calculation will be triggered. The calculation range is from the last collect to the current time. We will calculate the equity and fee income every minute and get a list.

Finally, summarize the net value list obtained from each collect. Then perform resample and other statistics to obtain the final result.

拆解 Uniswap V3 数据清洗流程

In addition, in order to improve accuracy, we have made two optimizations.

First, for the hour when a transaction occurred (mint/burn/collect), we performed minute-level statistics, and for the hour when no transaction occurred, we performed hour-level statistics. Finally, the results were resampled into hour-level statistics.

Secondly, in the collect event, we can get the sum of liquidity + handling fee. Therefore, we can compare the actual collect value with our theoretical calculation value to get the difference between the theoretical handling fee and the actual handling fee (actually this difference The value also includes the difference of lp principal, but the error of the difference of lp principal is very small and can basically be considered as 0). We will compensate the handling fee difference to each row. In order to improve the accuracy of handling fee estimation (also These are the fee_modify0 and fee_modify1 fields in the table above).

Notice:

When backfilling, the allocation of handling fees must be weighted based on the liquidity of the current hour, otherwise the handling fees for this hour will be too high.
Since the statistical data is for the whole year of 2023, rather than complete data, there is the phenomenon of sunk liquidity mentioned in Section 5. This will make the actual handling fees much higher than the theoretical handling fees, making the yield rate extremely high. .

Since each row is the data at the last moment of this hour, for the position that has been completely closed, the net value will be 0. In this case, the net value at the time of the position close will be lost. In order to retain this net value, a line is created at the end of the file The data at the time of 2038-1-1 00:00:00 stores the net value and other data at the time of position close. To prepare for the statistical needs of other projects.

rate of return

Normally, the rate of return is calculated by dividing the starting equity by the ending equity. But this does not apply here. The reasons are as follows:

The rate of return here needs to be refined to every minute,
Since the position will have funds transferred in and out in the middle, simply dividing the net value at the beginning and end cannot reflect the income.

For question 1, we can divide the net value per minute to get the rate of return per minute, and then multiply the rate of return per minute to get the total rate of return.

拆解 Uniswap V3 数据清洗流程

But this algorithm has a serious problem. If there is a data calculation error in the rate of return per minute, it will lead to a large deviation in the total rate of return. In this way, the statistical process becomes a tightrope walk, and no mistakes can be made. But good On the one hand, this leaves no room for statistical errors.

Regarding question 2, if there is a transfer of funds in and out this minute, directly dividing by the rate of return will still result in a very outrageous rate of return. Therefore, it is necessary to refine the rate of return algorithm per minute.

The first attempt we made was to break down the changes in net worth in detail, and then eliminate the changes in funds. We split the changes in net worth into several parts. 1 is the change in principal brought about by the price. 2 is The accumulated handling fees for this minute. 3 is the inflow and outflow of funds. Obviously, 3 must be excluded from the statistics. For this, we have formulated the following calculation method:

Specify that the current minute is n and the previous minute is n-1
Assume that all transfer operations in the current minute occur at the n:0.000th second. Then in the remaining time, the net value of LP is unchanged, that is to say, the net value at the n:0.001st second is equal to the net value at the n:59.999th second.
The accumulation of handling fees occurs at the end of this minute, which is the n:59.999th second.
The price and handling fee at the end of the previous minute (n-1:59.999) are the prices and handling charges at the beginning of this minute (n:0.000)

Based on the above assumptions, the rate of return per minute is to divide the liquidity/price/handling fee at the end by the liquidity/price/handling fee at the end, expressed as follows, where f refers to converting liquidity to Net worth algorithm.

拆解 Uniswap V3 数据清洗流程

This method looks very good. It perfectly avoids changes in liquidity. And reflects the impact of prices and fees on net worth. This is exactly what we expect. However, in practice, it will occur in some rows Big yield. After investigation we found that the problem occurs when withdrawing liquidity. Recall our rules: the time represented by each row is the end of the minute/hour. This provides uniformity for the statistics of the data scale, but it should be noted that the meaning of each column is different:

Xiaobai Navigation

For the net value column, it is the instantaneous value, which is the last value of the current minute/hour.
The handling fee column is the cumulative value. That is, the handling fees accumulated during the current minute/hour.

So for that hour of burn liquidity

When the LP is burned and the token is transferred away, the net value will be 0 at the end of this hour
As for the handling fee, since it is cumulative, at the end of this hour, the handling fee will be greater than 0.

This reduces the above formula to:

This situation will not only occur at the end of the position life cycle, but also when part of the liquidity is burned, it will also cause a change in the ratio of the increase in handling fees to the LP's net worth.

For the sake of simplicity, when the net value of LP changes, we set the rate of return to 1. This will bring errors to the calculation results of the rate of return. But for a normal continuous investment position, the hour of transaction generation is relative to The entire life cycle is still very small. Therefore, the impact is not significant.

8. Calculate the total LP income of the address

With the rate of return of each position, plus the corresponding relationship between position and address, we can get the rate of return of the user address in each position.

The algorithm here is relatively simple. The positions of this address in different periods are connected in series. There is no investment period in the middle, the net value is set to 0, and the rate of return is set to 1 (because the net value before and after is 0, there is no change, so the rate of return is 1.)

If there are multiple positions in the same period, add the net values in the overlapping parts to get the total net value. When merging the returns, we will weight the merger according to the net value of each position.

9. Combined total cash and LP returns

Finally, as long as the cash held by the user address and the LP investment are combined, the final result can be obtained.

The merging of net worth is simpler than the previous step (merging positions). As long as you find the time range on the LP net worth, then look up the cash held in the corresponding time range, and then find out the price of eth, you can get the total net worth. .

For the rate of return, we also use the algorithm of finding the rate of return per minute and then multiplying it. At the beginning, we used the error rate of return algorithm mentioned in Section 7. This requires that the fixed part of this minute (including cash) The amount of cash, liquidity in LP) and the variable part (price changes, fee accumulation, fund transfers in and out) are separated. Compared with the statistics of position, its complexity is much higher, because for the inflow and outflow of uniswap funds , just pay attention to the mint and collect events. The traceability of cash is very troublesome. We have to distinguish whether the funds are transferred to LP or transferred to the outside. If it is transferred to LP, the principal part can remain unchanged. If it is transferred to the outside, , to correct the amount of principal. This requires tracking the transfer destination address of erc20 and eth. This work is very troublesome. First of all, during mint/collect, the transfer address may be a pool or a proxy. What is more complicated is the eth one. For transfers, since eth is a native token, some transfer records can only be found through trace records. However, the amount of trace data is too large and exceeds our processing capabilities.

The last straw that broke the camel's back was when we discovered that the net value of each line is the instantaneous value of this hour, and the handling fee is the cumulative value of this hour, which cannot be directly added in a physical sense. This problem was indeed discovered very late.

Therefore, we gave up this algorithm. Instead, we used the net value of the next minute, divided by the net value of the previous minute. This method is much simpler. But there is also a problem with this method. That is when funds are transferred in and out. , the rate of return will still be unreasonable. Through the above discussion, we know that it is very difficult to separate the flow of funds. Therefore, here we sacrifice some accuracy and set the rate of return when there is a fund transfer to 1.

The remaining question is, how to identify the inflow and outflow of funds in the current hour? The algorithm I thought of at the beginning was very simple. Using the token balance of the previous hour and the current price, we can calculate the net value of this hour if we hold these tokens. What will it be? Then just subtract the estimated value from the actual value. When the difference is not equal to the actual value, it means there is a transfer of funds in and out. The formula is expressed as:

拆解 Uniswap V3 数据清洗流程

However, this algorithm ignores the complexity of uniswap LP. In LP, the number of tokens will change as the price changes, and the net value will also change accordingly. And this method does not take into account changes in handling fees. Ultimately, the estimated value will be different from the actual value. The value has an error of about 0.1%.

In order to improve the accuracy, the structure of the funds is refined, the value change of LP is calculated separately, and the handling fee is also taken into account.

拆解 Uniswap V3 数据清洗流程

In this way, the error of the estimated value can be controlled within 0.001%.

In addition, we limit the decimal of the data to avoid division by too small numbers (usually below 10^-10). These small numbers are errors accumulated from various calculations and resamples. If the direct correlation is not processed, Dividing will cause the error to be amplified, seriously distorting the rate of return.

Dismantling Uniswap V3 data cleaning process

Introduction

Overall process

1. Get Uniswap data

2. Get the cash held by the address

3. Obtain price

4. Handling fee statistics

5. Get position list

6. Obtain the corresponding relationship between address and position

7. Calculate the net value and rate of return of the position

net worth

rate of return

8. Calculate the total LP income of the address

9. Combined total cash and LP returns

Other questions

native token

missing position

Bitcoin spot ETF launches, SEC tames the elephant in the room

Report Interpretation Analysis of North Korean Hackers, Phishing Groups and Money Laundering Tools

相关文章

latest articles

Popular URLs