Clustering Ethereum Addresses – Towards Data Science

0 18


Introduction

Ethereum users may be anonymous, but their addresses are unique identifiers that leave a trail publicly visible on the blockchain.

I built a clustering algorithm based on transaction activity that divides Ethereum users into distinct behavioral subgroups. It can predict whether an address belongs to an exchange, miner, or ICO wallet.

The database was constructed using SQL, and the model was coded in Python. Source code is available on GitHub.

3D representation of Ethereum address feature space using T-SNE

Background

The Ethereum blockchain is a platform for decentralized applications called smart contracts. These contracts are often used to represent other assets. These assets can represent physical objects in the real world (like real estate titles) or be purely digital objects (such as utility tokens).

The computations required to execute smart contracts are paid for in ether, the native currency of the ecosystem.

Ether is stored in cryptographically secured accounts called addresses.

Motivation

Many people believe that cryptocurrencies offer digital anonymity, and there is some truth to that belief. In fact, anonymity is the core mission of Monero and ZCash.

Ethereum, however, is more widely used, and its broad flexibility results in a rich, public dataset of transactional behavior. Because Ethereum addresses are unique identifiers whose ownership does not change, their activity can be tracked, aggregated, and analyzed.

Here, I attempt to create user archetypes by effectively clustering the Ethereum address space. These archetypes could be used to predict the owner of an unknown address.

This opens up a wide array of applications:

  • understanding network activity
  • enhancing trading strategies
  • improving AML activities

Results

Participants in the Ethereum ecosystem can be separated by patterns in their transaction activity. Addresses known to belong to exchanges, miners, and ICOs qualitatively show that the results of clustering are accurate.

Technical Details

Feel free to skip to Interpreting the Results below.

Feature Engineering

The Ethereum transaction dataset is hosted on Google BigQuery. Using the 40,000 addresses with the highest ether balances, I created 25 features to characterize differences in user behavior.

Features derived for each address

Choosing the Appropriate Number of Clusters

Using silhouette analysis, I determined the optimal number of clusters to be roughly 8.

This choice minimizes the number of samples with negative silhouette scores, which indicate that a sample may be assigned to the wrong cluster.

But how do we know if it’s working?

By scraping data from the Etherscan.io block explorer, I gathered crowdsourced labels for 125 addresses in my dataset.

The majority of labels fell into three categories:

exchanges, miners, and ICO wallets.

Clustering is an unsupervised machine learning technique, so I could not use labels to train my model. Instead, I used them to assign user archetypes to clusters, based on the highest label density for each cluster. Results can be found here.

2D visualization of initial clustering. Known addresses on the left.

Re-clustering

Exchange and miner addresses were mixed together in the same cluster at first. To separate them, I performed a second round of clustering, using only the addresses in that cluster.

By changing the dissimilarity measure from euclidean distance to cosine distance, I dramatically improved separation between exchanges and miners.

Improved separation of exchanges and miners. Known addresses on the left.

By substituting results from re-clustering into the original analysis, we end up with 9 clusters.

2D visualization of final clustering results. Known addresses on the left.

Interpreting the Results

We can draw conclusions about user behavior based on the corresponding cluster centroids.

Radar plot — cluster centroid address features

Exchanges

  • High ether balance
  • High incoming and outgoing transaction volume
  • Highly irregular time between transactions

Exchanges are the banks of the crypto space. These results are intuitive.

Miners

  • Low ether balance
  • Small average transaction size
  • More regular time between transactions

Miners secure the blockchain by expending computational power, and are rewarded with ether. Groups of miners often “pool” their resources to reduce variance in payouts, splitting the proceeds based on resources contributed.

ICO Wallets

  • High ether balance
  • Small number of large transactions
  • Most regular time between transactions

ICOs (Initial Coin Offerings) are a common fundraising method for crypto startups. It makes sense that these startups would have large war chests, and periodically sell large amounts to cover regular business expenses.

Other categories

  • The Exchange and Mining clusters are highly similar, as they were created in the second round of clustering.
  • Addresses in cluster 7 have a large amount of smart contract activity.
  • Clusters 2 and 5 are highly distinct.

Can you identify any of these user groups?

You might also like

Pin It on Pinterest

Share This

Share this post with your friends!