How to use news articles to predict BTC price changes
Bitcoin (BTC) price changes are volatile due to many reasons, such as its specially different perceived values by public and high profile losses. In this article, we focus on one of its major factor, BTC news articles’ influences. Due to the past momentum of BTC and a huge portion of BTC market among the cryptocurrency market, a massive group of BTC players have already existed. However, an analytically quantitative methodology to make decisions in this market is still lacking, so we endeavored to help with this situation based on machine learning algorithms here.
Utilizing various vanilla supervised machine learning algorithms and exploring semi-supervised algorithms, we built a model to predict whether a certain news article will cause an apparent BTC price increase or decrease to help BTC players to make a more reasonable decision on selling or buying BTC.
Our project is roughly split into three major parts, financial data processing, news articles processing, and final prediction models. Can’t wait to go through the whole process now!
Financial Data processing
We scraped BTC second-by-second datasets from 01/01/2018 to 04/o1/2018 on Kaggle, which consists of the following information after being converted to a day-based dataset.
According to the stock market’s rule of thumb, an obvious change of stock price is defined as price fluctuation more than 2 standard deviations of past 30 days’ rolling average price, but there is no generally acknowledged definition of distinct BTC price change. And due to the dramatically and constantly changing BTC price, it is hard to test which number of standard deviations the most distinct one. Therefore, we use the same 2 standard deviations as in the stock market for BTC. The code is as the follows,
Based on the metric, we got and marked the dates with obvious BTC price change in our selected period. Then we divided the dates into three groups, first group with distinct price increases, which has namely positive markers (Date: 3/19/2018), and second with distinct price decreases and positive markers (Dates: 2/5/2019, 3/15/2019, 3/17/2019, 3/18/2019), and another trivial one with price changes within 2 standard deviations.
Now we move on to the second part, news articles’ data processing. The logic of these two parts’ connection will be elaborated at the third part.
The contents extracted from news articles are shown in the following form,
As the lead of news articles always contains the most important information, and there are many readers who only skim the introductory paragraph, so for the content column, we only extracted the lead of news articles.
We regard the top 100 most frequently appearing words (the lemmatization and stemming are done and stop words are deleted at first) in all the extracted article’s introductory paragraphs as our corpora, and we deem them most relevant to BTC trading and price, so if a news article does not mention any of these keywords in the first 300 words, we deem this article irrelevant to BTC and then exclude it from our to-be-explored articles. (The first 300 words is the main idea of a news article concluded by empirical analysis and we do not choose introductory paragraph is because sometimes the lead is fewer than 300 words and we want to cover more articles),
The following is part of the 100 words.
Matched with the above three date groups, the news articles are also separated into three groups. Among the three, the two groups with distinct price changes are our focus.
Then we processed our features, until now, we have got the author and publisher of each article. To convert the textual data into numeric one, first, we ranked authors and publishers according to the number of articles they released among our selected articles, and then by seeing the distribution of the number of articles, we divided the authors and publishers into five groups numbered 1 to 5 to make each group roughly have the same population.
The reason why we did not simply categorize the authors and publishers is that only tree-based algorithms like Random Forest can recognize categoric numbers, while non-tree based algorithms such as SVM will take the categoric number as some meaningful numeric numbers, which will highly influence the model’s accuracy.
There is a pretty important property of news article, the prevailing emotions expressed in them. Plenty of organizations have worked on the sentiment analysis of news articles, such as Microsoft’s Azure, Google’s Cloud Natural Language API and python’s NLTK package as well.
We have explored Google’s API and Azure, it turned out that Google’s works the better in our model after both were compared with our own judgments of the emotions in articles. In many pieces of researches, Google’s API also has the highest accuracy among all the tools.
To use Google API or Azure, you have to get authentication, the detailed setup process to conduct sentiment analysis in Python can be referred to https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data
The results that Google API gave have 4 columns, we only use two of them, one is score, the other one is magnitude. Score corresponds to the overall negative, neutral or positive emotions, ranging from -1 to 1. -1 means completely negative mood, while 1 is a positive extreme, and 0 means neutral emotion. The other extents of overall emotions vary between -1 and 1. Score has been normalized, which means that the number of news article’s words will not influence this number. Magnitude measures the overall strength of both positive and negative emotions. Unlike score, magnitude is not normalized, so it is proportional to article’s length and it ranges from 0 to +inf. For example, if an author in an article expressed both strong positive and negative emotions which canceled off positive one, the overall emotion delivered in that article will be neutral. However, magnitude counts all the cumulative emotions, so the final number of score might be 0, but magnitude is more than 3 or whatever big number.
The following are some examples given by Google,
(Exhaustive documentation can be found in
To avoid the problem that the number of words in news article will influence the number of magnitude, we also normalized it.
In a group of news article, some important words may appear many times in a certain news article while with few occurrences in the other news articles. The repeated presences will intensify the visual effects while readers are reading through an article, and thus readers are more likely to respond to these articles. TF-IDF score just helps us measure the influences well. TF-IDF which stands for Term Frequency-Inverse Data Frequency is the product of two parts.
The first part is Term Frequency, it calculates the frequency of every word in a document (in our study it is a news article) in a corpus with an equation,
n_i,j means the number of times a word i appears in a document j, the denominator is the total number of words in document j.
The second part is Inverse Data Frequency, its use is to calculate the weight of rare words across all our documents (all the news articles) as the follows,
N is the total number of documents, df_t is the number of documents which contain word t. Therefore, the words that rarely appear in all the documents but only in a small number of them will a high IDF score due to a low value of df_t.
Finally, we got TF-IDF score by multiplying the above two parts.
so if word i appears many times in a news article but seldom in other news article leading to a high TF-IDF score, we can conclude that this word is significant in this article. After deleting stop words, We deem words with high TF-IDF score as important words in the news, so we averaged the 5 highest IF-IDF score to measure the visual effects. The calculated result is basically shown like,
Cheers! With all the food washed and condiments mixed, the only left task is to pour into the pot and boil. Let’s check out the prediction models now!
We have totally 5 features in our model, which are publisher, author, sentiment score, sentiment magnitude score and TD-IDF score, the data is like
The predicted data are binary numbers. If the predicted value is 0, the news article is supposed to be not influential in the BTC market and will not cause distinct price change. Otherwise, this news article will give rise to obvious price change. And the results instantly released on our BTC app can be referred by BTC investors to decide whether to buy or sell, the prompt reactions of investors can help them save or earn a reasonable amount of money.
How precise the prediction is is tested by two metrics. First, we used the traditional metric, accuracy, to test it. Besides, we also drew ROC curve of those algorithms which have built-in probability calculation function and displayed AUC score of each ROC curve. The reason why we also explored ROC-AUC metric is that we desired to pay more attention to the positive result, 1. If the BTC players see 1 in most of that day’s news articles, they will gain the information that the BTC price will be highly probable to increase, and thus buy more BTC to speculate in it. However, if the price actually decreases rather than increases, the buyers will take a considerable cost to cover the loss. On the contrary, if the price is predicted to decrease, but in reality increase, the sellers’ loss will be relatively small compared with the buyers’ reaction. In summary, the positive is more valuable to look at due to its property of high cost. From both accuracy and ROC-AUC metrics, we can find the best-performing algorithm.
We used Logistic Regression, Random Forest, SVM, KNN, Perceptron and XGBoost to predict. The accuracy results are listed below,
It can be seen that Random Forest is the most impressive algorithm with both highest training and test accuracy among the 6 algorithms.
And ROC-AUC yielded the same conclusion with the largest AUC score out of all the other algorithms,
In the real situation, it is hard to collect massive data as expected, so we came up with the idea to explore semi-supervised algorithms to leverage only part of the approachable data to generate more self-labeling data and then conduct the prediction.
The present semi-supervised learning area has already been developed into a well-established system of knowledge, so I am not going to tell the details of it here. Basically, there are three kinds of semi-supervised learning, active learning, pure transductive learning and transductive learning with plenty of models. Projects with different datasets’ characteristics and ultimate goals should be matched with the proper semi-supervised models. Because our model’s objective function is not convex, we chose to use Transductive Support Vector Machine (TSVM).
We utilized 100 labeled samples and 1014 unlabeled ones to predict 478 unlabeled test samples. Finally, TSVM got 67% accuracy. Compared to 60% by using supervised algorithm SVM, TSVM shows improvements, which is basically the main use of semi-supervised learning while constrained by a limited dataset.
We also made an effort to check out graph-based model, Label Propagation. The methodology of it can be simplified as the following picture,
However, one of the key input datasets is the labeled data given by experts, which is aimed to constantly improve the labels but we are in a lack of, so in the future, we will dig deeper into this method and try to look for some useful output from it.
Future exploration and reflection
Although our model has rendered a satisfying result, there are still some potential problems needed to be considered. The one that has already popped up is associated with Neural Network which we want to explore in the future.
Within the constraints of the Paradigm NLP formulation(s), Neural Networks (NN’s) can extract information contained in the article at present time, but they cannot extract information associated with future events. That is to say, and as an example, if I am a citizen of Israel and I read that someone-that-matters was assassinated, then being an informed reader with a grasp of not only history but also of retaliatory capabilities, I can infer the consequences of such an act — say economic and/or military retaliation. On the other hand, a complete outsider would read the same article and only gather that someone was assassinated and that this person mattered enough to be written about. NN’s are like the latter.
So we have to figure out if there are ways to get around the gap between the sophisticated readers and the naive readers — but based on ML formulations/approaches.
And for future application, we plan to develop a chatbot to automate our prediction display. The prototype interface is
It generally has three parts, a chat window for asking questions, real-time BTC price chart for reference and news articles with labels of author and publisher, as well as sentiment score, magnitude and TF-IDF score.