Financial News - Noise or Information? [Part I]

 

The inaugural post of this blog, A Random Walk down the Stock Market, sets the stage for the challenging task of predicting stock price movement. An efficient market driven by millions of investors competing against each other to uncover profitable trading patterns eliminates the influence of such patterns on future prices. This conundrum underpins the weak form of the Efficient Market Hypothesis (EMH).

But all hope is not lost. Analysing and quantifying information such as growth prospects and market conditions accurately is a costly endeavour. Analysts or predictive models that could sniff out insights from a large mess of information could potentially be rewarded with higher stock market returns.

This post is the first of a three-part series that explores the feasibility of building analytical models using financial news data to predict stock price movement (up or down). We start off our series by introducing a very simple yet often effective language model to model word usage in news articles, and demonstrate how it could be used to predict stock price movement.

In this analysis, we continue to use the same set of companies listed on the US stock market for model development and prediction.

1. AAPL
2. ADBE
3. AMZN
4. COST
5. CSCO
6. GOOG
7. MSFT
8. NVDA
8. PEPS
10. TSLA

For each of the ten shortlisted stocks, a target variable was created by assigning a label value of TRUE if the adjusted close price today is higher than the adjusted close price of the previous trading day, and FALSE otherwise.

The news data used in this analysis was downloaded from Kaggle. It contains historical news of the US equities publicly traded on NASDAQ or NYSE. To reduce duplicated content, only news from a single source, namely Reuters, was used. The content of the news articles were cleaned by a series of preprocessing steps such as converting text to lowercase, removing numbers and stopwords, as well as word stemming.

Computers and machine learning models do not understand words as humans do. It is necessary to convert the text in news articles into numbers that computer is able to analyse. Models that translate text into machine-understandable format are called language models (LMs). One of the simplest LMs is the n-gram i.e. a sequence of n words. Specifically, the cleaned text we obtained from our pre-processing steps were tokenised into uni-, bi- and tri-grams as exemplified in Figure 1.

Figure 1. Tokenising text into n-grams for further analysis

These n-grams were subsequently used to compute a statistical measure called term frequency-inverse document frequency (TF-IDF) which were then used as features for supervised machine learning models. The measure reflects how important a word/phrase is to an article in a collection of articles. TF-IDF is computed by multiplying the number of times a word appears in an article, and the inverse number of documents the word appear in. Based on this measure, a word that appears frequently in a document and only found in a small fraction of documents is highly important. Prior to more recent developments in LM such as word2vec and BERT, TF-IDF of n-grams was a popular scheme. For instance, a 2015 literature review found that TF-IDF was the most frequently used approach for text-based recommender systems [1].

In this analysis, the TF-IDF of news articles relating to the ten selected stocks over a 3-day period was used as features for predicting stock price movement one day ahead. For instance, the TF-IDF of news articles about AAPL published between 1st Jan 2019 and 3rd Jan 2019 was used to predict AAPL's price movement on 4th Jan 2019.

The computed TF-IDF was feed as input to an logistic regression which is a type of statistical model for estimating the probability of an event happening. Since the output of a logistic regression is probability, we assigned the predicted price movement to be UP if the estimated probability is more than 0.5. As with the set up in our inaugural post, data between 2013 and 2018 were used for model development and 2019 data was used for model assessment.

Figure 2 summarises the modelling process from the raw input to model output.

Figure 2. Modelling process

Figure 3 shows box plots of the estimated probability by the actual price movement. A model that could discriminate the up and down price movement effectively would show to blue box plot for actual upward movement one day ahead to be higher along the probability scale, while that for the orange box plot to be lower along the probability scale. However, as seen in the Figure, the distribution of the estimated probability when the actual one day ahead price movement was up vis-a-vis those when actual one day ahead price movement was down was almost identical. Similar to LSTM model based on past price trends we constructed previously, our logistic regression model built using news data could not pick up any signals for profitable trades.

Figure 3. Distribution of the estimated probability by actual price movement one day ahead

In fact, suppose a trader started trading with $1,000 on 2nd Jan 2019, and for each trading day, he split his investment evenly into the 10 selected stocks. And suppose he was able to buy stocks at the previous closing price and sell his stocks at the end of each trading data at the closing price for that day. By 31st Dec 2019, his portfolio would have grown to about $1,434. By contrast, an analysts that made his purchase by following our logistic model prediction, he would only have $1,429 over the same period! 😨

Once again, our predictive model was unceremoniously defeated by the efficient market. However, it does not necessarily implies that financial news contain no useful information to aid analysts in identifying profitable trades. A clear shortcoming of a model based on n-grams proposed in this post, is that it could not represent long range word dependencies beyond the sequence of n words. This might have affected model performance if the concepts associated with profitable trades were separated by many words in a document.

In the next part of this series, we call on more powerful text mining techniques and we may finally see the tide turning to our favour. 👍💪


The python code used for this analysis is available at my Github here.

How would you have constructed the predictive model differently to predict stock price movement? What other finance-related analyses you would like to see? Leave a comment below to share your thoughts! 

 

[1] C. Breitinger, B. Gipp, S. Langer (2015). Research-paper recommender systems: a literature survey. International Journal on Digital Libraries. 17(4): 305-338. DOI: https://doi.org/10.1007/s00799-015-0156-0

Comments

Popular posts from this blog

Reflecting on 2024

Networth Update 3Q 2024

Starting Early and Staying Focused: How I Reached $500,000 at 32