Clustering Ethereum Addresses – Towards Data Science
Categorizing addresses using patterns in transaction activity
I built a clustering algorithm based on transaction activity that divides Ethereum users into distinct behavioral subgroups. It can predict whether an address belongs to an exchange, miner, or ICO wallet.
The database was constructed using SQL, and the model was coded in Python. Source code is available on GitHub.
The Ethereum blockchain is a platform for decentralized applications called smart contracts. These contracts are often used to represent other assets. These assets can represent physical objects in the real world (like real estate titles) or be purely digital objects (such as utility tokens).
The computations required to execute smart contracts are paid for in ether, the native currency of the ecosystem.
Ether is stored in cryptographically secured accounts called addresses.
Ethereum, however, is more widely used, and its broad flexibility results in a rich, public dataset of transactional behavior. Because Ethereum addresses are unique identifiers whose ownership does not change, their activity can be tracked, aggregated, and analyzed.
Here, I attempt to create user archetypes by effectively clustering the Ethereum address space. These archetypes could be used to predict the owner of an unknown address.
This opens up a wide array of applications:
- understanding network activity
- enhancing trading strategies
- improving AML activities
Participants in the Ethereum ecosystem can be separated by patterns in their transaction activity. Addresses known to belong to exchanges, miners, and ICOs qualitatively show that the results of clustering are accurate.
Feel free to skip to Interpreting the Results below.
Choosing the Appropriate Number of Clusters
Using silhouette analysis, I determined the optimal number of clusters to be roughly 8.
This choice minimizes the number of samples with negative silhouette scores, which indicate that a sample may be assigned to the wrong cluster.
But how do we know if it’s working?
By scraping data from the Etherscan.io block explorer, I gathered crowdsourced labels for 125 addresses in my dataset.
The majority of labels fell into three categories:
exchanges, miners, and ICO wallets.
Clustering is an unsupervised machine learning technique, so I could not use labels to train my model. Instead, I used them to assign user archetypes to clusters, based on the highest label density for each cluster. Results can be found here.
Exchange and miner addresses were mixed together in the same cluster at first. To separate them, I performed a second round of clustering, using only the addresses in that cluster.
By changing the dissimilarity measure from euclidean distance to cosine distance, I dramatically improved separation between exchanges and miners.
By substituting results from re-clustering into the original analysis, we end up with 9 clusters.
Interpreting the Results
We can draw conclusions about user behavior based on the corresponding cluster centroids.
- High ether balance
- High incoming and outgoing transaction volume
- Highly irregular time between transactions
Exchanges are the banks of the crypto space. These results are intuitive.
- Low ether balance
- Small average transaction size
- More regular time between transactions
Miners secure the blockchain by expending computational power, and are rewarded with ether. Groups of miners often “pool” their resources to reduce variance in payouts, splitting the proceeds based on resources contributed.
- High ether balance
- Small number of large transactions
- Most regular time between transactions
ICOs (Initial Coin Offerings) are a common fundraising method for crypto startups. It makes sense that these startups would have large war chests, and periodically sell large amounts to cover regular business expenses.
- The Exchange and Mining clusters are highly similar, as they were created in the second round of clustering.
- Addresses in cluster 7 have a large amount of smart contract activity.
- Clusters 2 and 5 are highly distinct.