IB Quant Blog


1 2 3 4 2


Stocks

Replicating Indexes in R with Style Analysis: Part I


By CapitalSpectator.com

In the quest for clarity in portfolio analytics, Professor Bill Sharpe's introduction of returns-based style analysis was a revelation. By applying statistical techniques to reverse engineer investment strategies using historical performance data, style analysis offers a powerful, practical tool for understanding the source of risk and return in portfolios. The same analytical framework can be used to replicate indexes with ETFs and other securities, providing an intriguing way to invest in strategies that may otherwise be unavailable.

Imagine that there's a hedge fund or managed futures portfolio that you'd like to own but for one reason or another is inaccessible. Perhaps the minimum investment is too high or the fund is closed. Or maybe you prefer to build your own to keep costs down or maintain a tighter control on risk. If the returns are published, even with a short lag, you can still jump on the bandwagon by statistically creating a rough approximation of the strategy's asset allocation via style analysis.

Any index, in theory, can be replicated, which opens up a world of opportunity. Even if you're not interested in investing per se, decomposing key indexes through style analysis offers valuable tactical and strategic information. As one example, deconstructing key hedge fund or CTA benchmarks published by BarclayHedge.com provides the basis for quasi-real time analysis of investment trends in the alternative investment space. In turn, the analysis can provide useful perspective on the evolution of manager preferences for asset classes in global macro or managed futures strategies.

Let's run through a simple example of how to estimate weights for an index through style analysis. To illustrate the process clearly in Part I of this two-part series, I'll start by reverse engineering an index that's already fully transparent: the S&P 500.

From a practical standpoint there's no need to decompose the S&P since its components are widely known and you can readily invest in the index through low-cost proxy ETFs and mutual funds. But let's pretend that the S&P 500 is an exotic benchmark and its design rules are a mystery. All we have to work with: the S&P's daily returns and a vague understanding that 11 equity sectors (financials, energy, etc.) drive the S&P's risk and return profile.

Fortunately, we have access to ETF proxies for those 11 sectors. Thanks to style analysis, we're also in luck because these puzzle pieces can be analyzed to create a replicated version of the S&P 500 via the 11 funds.

The basic procedure is to run a regression on the S&P's historical returns against a set of relevant reference indexes. To maintain a long-only, unlevered result we'll impose constraints on the resulting coefficients.

There are several ways to crunch the numbers, including several off-the-shelf software packages that do all the heavy lifting for you. If you prefer to go behind the curtain to 1) understand how the analytics work; and 2) gain more control over the results, it's time to fire up R (much of what follows, by the way, is inspired and facilitated by the FactorAnalytics package).

There are a number of possibilities for estimating weights via style analysis. In this example I use the quadratic programming method via the solve.QP function. If you're curious, here's a basic setup I wrote using R code for a one-period analysis.

In terms of ETFs, the target index is represented by SPDR S&P 500 (SPY); you can find a list of the 11 sector funds here.

For this example I used daily returns from the end of 2010 through last week's close (Oct. 6) with the first asset-mix estimate following a year later. From there, I re-estimated the weights once every year (252 trading days). Here's how the replicated SPY portfolio compares with the genuine article:

R

Chart: Courtesy of CapitalSpectator.com

It's not perfect, but it's close. The correlation for the daily returns for the two indexes is 0.72 (if the match was perfect the correlation would be 1.0; if there was no correlation the reading would be 0.0). Looking back on the history for the sample period shows that the estimated weights for any one of the 11 sector funds ranged from 0 to roughly 22%.

Keep in mind that this replication example was the financial-engineering equivalent of shooting fish in a barrel. That was intentional, to illustrate the process for an outcome we generally knew in advance. In this case, it was clear from the get-go that 11 sector funds would explain the lion's share of the S&P 500's risk and return variation. Replicating other indexes, however, requires more work.

To estimate weights for, say, a hedge fund index that's opaque beyond its performance history requires subjective decisions about which set of benchmarks/funds to use for the regression. Fortunately, there's a wide range of ETFs that provides the raw material to replicate most strategies. Nonetheless, it's fair to say that this process generally requires a mix of art and science.

In the example above, most of the effort was science. In Part II of this series I'll tackle a more ambitious subject that requires more art by attempting to replicate a hedge fund index via a set of ETFs.

R code can also be downloaded from GitHub here:
https://gist.github.com/jpicerno1/8c26e3c6b16364fac3d01149b5ba401d

replicate.10oct2017

# R code re: CapitalSpecator.com post for replicating indexes in R
# "Replicating Indexes In R With Style Analysis: Part I"
# http://www.capitalspectator.com/replicating-indexes-in-r-with-style-analysis-part-i/
# 10 Oct 2017
# By James Picerno
# http://www.capitalspectator.com/
# (c) 2017 by Beta Publishing LLC

# load packages
library(quadprog)
library(PerformanceAnalytics)
library(quantmod)
require(Quandl)

# download price histories
Quandl.api_key("ABC123") # <-enter your Quandl API key here.
# Or use free price history at Tiingo.com or alphavantage.co
# to populate prices.1 file below
# symbols <-c("XLF", "XLK", "XLI", "XLB", "XLY", "XLV", "XLU", "XLP", "XLE", "VOX", "VNQ", "SPY")

prices <- list()
for(i in 1:length(symbols)) {
price <- Quandl(paste0("EOD/", symbols[i]), start_date="2010-12-31", type = "xts")$Adj_Close
colnames(price) <- symbols[i]
prices[[i]] <- price
}
prices.1 <- na.omit(do.call(cbind, prices))
dat1 <-ROC(prices.1,1,"discrete",na.pad=F)

# estimate weights
y.fund <-dat1[,12] # returns of target fund to replicate
x.funds <-dat1[,1:11] # returns of funds to reweight to replicate target fund
rows <-nrow(x.funds)
cols <-ncol(x.funds)
Dmat <-cov(x.funds, use="pairwise.complete.obs")
dvec <-cov(y.fund, x.funds, use="pairwise.complete.obs")
a1 <-rep(1, cols)
a2 <-matrix(0, cols, cols)
diag(a2) <- 1
w.min <-rep(0, cols)
Amat <-t(rbind(a1, a2))
b0 <-c(1, w.min)
optimal <- solve.QP(Dmat, dvec, Amat, bvec = b0, meq = 1)
weights <- as.data.frame(optimal$solution)
rownames(weights) = names(x.funds)

weights

# END

CapitalSpectator.com is a finance/investment/economics blog that's edited by James Picerno. The site's focus is macroeconomics, the business cycle and portfolio strategy (with an emphasis on asset allocation and related analytics).

Picerno is the author of Dynamic Asset Allocation: Modern Portfolio Theory Updated for the Smart Investor (Bloomberg Press, 2010) and Nowcasting The Business Cycle: A Practical Guide For Spotting Business Cycle Peaks (Beta Publishing, 2014). In addition, Picerno publishes The US Business Cycle Risk Report, a weekly newsletter that quantitatively evaluates US recession risk in real time.

This article is from CapitalSpectator.com and is being posted with CapitalSpectator.com's permission. The views expressed in this article are solely those of the author and/or CapitalSpectator.com and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad-based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.


16478




Stocks

Machine Learning Classification Strategy In Python


Ishan Shah from QuantInsti implements a machine learning classification algorithm on S&P500 using Support Vector Classifier (SVC) in Python.

Machine Learning Python

In this blog, we will step by step implement a machine learning classification algorithm on S&P500 using Support Vector Classifier (SVC).  SVCs are supervised learning classification models. A set of training data is provided to the machine learning classification algorithm, each belonging to one of the categories. For instance, the categories can be to either buy or sell a stock. The classification algorithm builds a model based on the training data and then, classifies the test data into one of the categories.

Now, let’s implement the machine learning classification strategy in Python.

 

Step 1: Import the libraries

In this step, we will import the necessary libraries that will be needed to create the strategy.

# machine learning classification
from sklearn.svm import SVC
from sklearn.metrics import scorer
from sklearn.metrics import accuracy_score

# For data manipulation
import pandas as pd
import numpy as np

# To plot
import matplotlib.pyplot as plt
import seaborn

# To fetch data
from pandas_datareader import data as pdr


Step 2: Fetch data

We will download the S&P500 data from google finance using pandas_datareader.

After that, we will drop the missing values from the data and plot the S&P500 close price series.

Df = pdr.get_data_google('SPY', start="2012-01-01", end="2017-10-01")
Df= Df.dropna()
Df.Close.plot(figsize=(10,5))
plt.ylabel("S&P500 Price")
plt.show()

Python

 

Step 3: Determine the target variable

The target variable is the variable which the machine learning classification algorithm will estimate. In this example, the target variable is whether S&P500 price will close up or close down on the next trading day.

We will first determine the actual trading signal using the following logic – if next trading day’s close price is greater than today’s close price then, we will buy the S&P500 index, else we will sell the S&P500 index. We will store +1 for the buy signal and -1 for the sell signal.

y = np.where(Df['Close'].shift(-1) > Df['Close'],1,-1)

 

Step 4: Creation of variables

The X is a dataset that holds the predictor’s variables which are used to estimate target variable, ‘y’. The X consists of variables such as ‘Open – Close’ and ‘High – Low’. These can be understood as indicators based on which the algorithm will try to estimate the option price.

Df['Open-Close'] = Df.Open - Df.Close
Df['High-Low'] = Df.High - Df.Low
X=Df[['Open-Close','High-Low']]

In the later part of the code, the machine learning classification algorithm will use the predictors and target variable in the training phase to create the model and then, try to estimate the target variable in the test dataset.

 

Step 5: Test and train dataset split

In this step, we will split data into the train dataset and the test dataset.

  1. First, 80% of data is used for training and remaining data for testing
  2. X_train and y_train are train dataset
  3. X_test and y_test are test dataset

split_percentage = 0.8
split = int(split_percentage*len(Df))

# Train data set
X_train = X[:split]
y_train = y[:split]
# Test data set

X_test = X[split:]
y_test = y[split:]

 

Step 6: Create the machine learning classification model using the train dataset

We will create the machine learning classification model based on the train dataset. This model will be later used to estimate the trading signal in the test dataset.

cls = SVC().fit(X_train, y_train)

 

Step 7: The classification model accuracy

We will compute the accuracy of the classification model on the train and test dataset, by comparing the actual values of the trading signal with the predicted values of the trading signal. The function accuracy_score() will be used to calculate the accuracy.

Syntax: accuracy_score(target_actual_value,target_predicted_value)

  1. target_actual_value: correct signal values
  2. target_predicted_value: predicted signal values

accuracy_train = accuracy_score(y_train, cls.predict(X_train))
accuracy_test = accuracy_score(y_test, cls.predict(X_test))

print('\nTrain Accuracy:{: .2f}%'.format(accuracy_train*100))
print('Test Accuracy:{: .2f}%'.format(accuracy_test*100))

 

Step 8: Estimation

We will try to estimate the signal (buy or sell) for the test data set, using the cls.predict() function. Then, we will compute the strategy returns based on the signal generated by the model in the test dataset. We save it in the column ‘Strategy_Return’ and then, plot the cumulative strategy returns.

Df['Predicted_Signal'] = cls.predict(X)
# Calculate log returns
Df['Return'] = np.log(Df.Close.shift(-1) / Df.Close)*100
Df['Strategy_Return'] = Df.Return * Df.Predicted_Signal
Df.Strategy_Return.iloc[split:].cumsum().plot(figsize=(10,5))
plt.ylabel("Strategy Returns (%)")
plt.show()

 

Python

 

Visit QuantInsti website for more Python coding examples and to learn about their Executive Programme in Algorithmic Trading (EPAT™)

 

 

This article is from QuantInsti and is being posted with QuantInsti’s permission. The views expressed in this article are solely those of the author and/or QuantInsti and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad-based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.


16450




Stocks

4 Big Ways Data Science is Transforming the FinTech Industry


Big Data

The financial industry is no stranger to the power of technology. After all, it has developed its own legion of Quants who use advanced quantitative analysis and programming techniques to exploit big opportunities in the capital markets. Despite being highly regulated, the FinTech industry is booming with the entry of smaller startups looking to disrupt the space with the latest technologies, with data science being at the topmost of the list.

Here are the top 4 ways in which data science is being leveraged by the FinTech industry.
 

Credit Risk Scoring

A credit score is a statistical analysis that estimates a person’s credit worthiness based on their past credit details. This value is used to determine whether a loan should be given to the person or not. Traditionally, banks use complex statistical methods to determine the credit score of individuals. However, the rise of data science has led to the introduction of advanced techniques such as machine learning algorithms that provide estimates with higher accuracy by using a large number of data points (from relevant to obscure variables)

Data science, therefore, provides a holistic view of a person’s creditworthiness, by taking all data into consideration.

Example: Alibaba’s Aliloan is a prime example of this. Aliloan is an automated online system that provides flexible microloans to entrepreneurial online vendors who would be ignored by traditional banks due to a lack of collateral. It collects data from its e-commerce and payment platforms and uses predictive models to analyze transaction records, customer ratings, shipping records and a host of other info to determine the creditworthiness of a merchant.

Fraud Detection and Prevention

Fraud costs the financial industry about $80 Billion per year (Consumer Reports June 2011). The repercussions of fraudulent transactions are experienced by both institutions and individuals, thus making fraud detection a top priority for FinTech executives. Currently, fraud detection is based on certain rules such as flags that are triggered based on location, ATM or IP address used. However, instead of relying only on a finite number of transactions, the process can be improved by using machine learning methods such as logistic regression, naive bayes classifier, etc. that can compute the probability of a transaction being fraudulent based on patterns in the historical transaction data.

This not only improves accuracy but can also be employed on live data, thus helping FinTech companies take action more effectively.
 

Portfolio Optimization and Asset Management

Portfolio optimization and asset management are key functions performed by FinTech institutions. With the rise of Big Data, these institutions can crunch a massive amount of financial data to build asset management models based on machine learning principles (as opposed to statistical models). This has also given rise to what is called Robo-advisors where companies use software to automate asset allocation decisions which reduce risk, improve returns and provide automatic tax loss harvesting.

Sentiment Analysis is also used to analyze public data (such as worldwide Twitter feeds) to gauge market sentiment and short the market whenever any natural or manmade disaster strikes. This process can also be fully automated, which can further reduce costs for these institutions.
 

Marketing, Customer Retention, and Loyalty Programs

FinTech companies collect huge chunks of data from their users, which often remain unused unless relevant for financial analysis. But this customer information, right from their transaction data to their personal information as well as social media presence can all be taken into consideration to boost Marketing efforts by providing contextual and personalized product advertisements, or discount offerings which can improve the churn rate of customers. Furthermore, this information can be used to better target future customers, thus optimizing the marketing spends of the company as well.


Final Notes

The Financial Industry is a behemoth in its own right, but by employing the advanced methods provided by Data Science, it can scale hitherto unknown heights of growth and profit.

 

 

Learn more about Byte Academy here.

 

This article is from Byte Academy and is being posted with Byte Academy’s permission. The views expressed in this article are solely those of the author and/or Byte Academy and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad-based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.


16345




Stocks

Deep Learning for Trading Part 4: Fighting Overfitting with Dropout and Regularization


In this post, I’m going to demonstrate some tools to help fight overfitting and push your models further.

Regularization

Regularization is a commonly used technique to mitigate overfitting of machine learning models, and it can also be applied to deep learning. Regularization essentially constrains the complexity of a network by penalizing larger weights during the training process. That is, by adding a term to the loss function that grows as the weights increase.

Keras implements two common types of regularization:

  • L1, where the additional cost is proportional to the absolute value of the weight coefficients
  • L2, where the additional cost is proportional to the square of the weight coefficients

These are incredibly easy to implement in Keras: simply pass  regularizer_l2(regularization_factor)  or regularizer_l2(regularization_factor)  to the kernal_regularizer  argument in a Keras layer instance (details on how to do this below), where regularization_factor * abs(weight_coefficient)  or regularization_factor * weight_coefficient^2  is added to the total loss, depending on the type of regularization chosen.

Note that in Keras speak, 'kernel' refers to the weights matrix created by a layer. Regularization can also be applied to the bias terms via the argument bias_regularizer and the output of a layer by activity_regularizer .

Getting smarter with our learning rate

When we add regularization to a network, we might find that we need to train it for more epochs in order to reach convergence. This implies that the network might benefit from a higher learning rate during early stages of model training.1

However, we also know that sometimes a network can benefit from a smaller learning rate at later stages of the training process. Think of it like the model’s loss being stuck halfway down the global minimum, bouncing from one side of the loss surface to the other with each weight update. By reducing the learning rate, we can make the subsequent weight updates less dramatic, which enables the loss to ‘fall’ further down towards the true global minimum.

By using another Keras callback, we can automatically adjust our learning rate downwards when training reaches a plateau:

reduce-lr-keras

This tells Keras to reduce the learning rate by a factor of 0.9 whenever validation accuracy doesn’t improve for patience  epochs. Also note the epsilon  parameter, which controls the threshold for measuring the new optimum. Setting this to a higher value results in fewer changes to the learning rate. This parameter should be on a scale that is relevant to the metric being tracked, validation accuracy in this case.

 

Putting it together

Here’s the code for an L2 regularized feed forward network with both  reduce_lr_on_plateau and model_checkpoint callbacks (data import and processing is the same as in the previous post):

keras-model

Plotting the training curves now gives us three plots – loss, accuracy and learning rate:

data-training

This particular training process resulted in an out of sample accuracy of 53.4%, slightly better than our original unregularized model. You can experiment with more or less regularization, as well as applying regularization to the bias terms and layer outputs.

 

Dropout

Dropout is another commonly used tool to fight overfitting. Whereas regularization is used throughout the machine learning ecosystem, dropout is specific to neural networks. Dropout is the random zeroing (“dropping out”) of some proportion of a layer’s outputs during training. The theory is that this helps prevents pairs or groups of nodes from learning random relationships that just happen to reduce the network loss on the training set (that is, result in overfitting). Hinton and his colleagues, the discoverers of dropout, showed that it is generally superior to other forms of regularization and improves model performance on a variety of tasks. Read the original paper here.2

Dropout is implemented in Keras as it’s own layer, layer_dropout() , which applies dropout on it’s inputs (that is, on the outputs of the previous layer in the stack). We need to supply the fraction of outputs to drop out, which we pass via the rate parameter. In practice, dropout rates between 0.2 and 0.5 are common, but the optimal values for a particular problem and network configuration need to be determined through appropriate cross validation.

At the risk of getting ahead of ourselves, when applying dropout to recurrent architectures (which we’ll explore in a future post), we need to apply the same pattern of dropout at every timestep, otherwise dropout tends to hinder performance rather than enhance it.3

Here’s an example of how we build a feed forward network with dropout in Keras:

keras-sequential-model

Training the model using the same procedure as we used in the L2-regularized model above, including the reduce learning rate callback, we get the following training curves:

keras

One of the reasons dropout is so useful is that it enables the training of larger networks by reducing their propensity to overfit. Here’s the training curves for a similar model but this time eight layers deep:

epoch-keras

Notice that it doesn’t overfit significantly worse than the shallower model. Also notice that it didn’t really learn any new, independent relationships from the data – this is evidenced by the failure to beat the previous model’s validation accuracy. Perhaps 53% is the upper out of sample accuracy limit for this data set and this approach to modeling it.

With dropout, you can also afford to use a larger learning rate, which means it is a good idea to make use of the reduce_lr_on_plateau  callback and kick off training with a higher learning rate, which can always be decayed as learning stalls.

Finally, one important consideration when using dropout is constraining the size of the network weights, particularly when a large learning rate is used early in training. In the Hinton et. al. paper linked above, constraining the weights was shown to improve performance in the presence of dropout.

Keras makes that easy thanks to the kernel_constraint  parameter of layer_dense() :

model-max-weight-keras

This model provided an ever-so-slight bump in validation accuracy:

validation-accuracy

And quite a stunning test-set equity curve:

keras-threshold

keras-chart

Interestingly, every experiment I performed in writing this post resulted in a positive out of sample equity curve. The results were all slightly different, even when using the same model setup, which reflects the non-deterministic nature of the training process (two identical networks trained on the same data can result in different weights, depending on the initial, pre-training weights of each network). Some equity curves were better than others, but they were all positive.

Here are some examples:

With L2-weight regularization and no dropout:

curve-keras

With a dropout rate of 0.2 applied at each layer, no regularization, and no weight constraints:


keras-estimation

Of course, as mentioned in the last post, the edge of these models disappears when we apply retail spreads and broker commissions, but the frictionless equity curves demonstrate that deep learning, even using a simple feed-forward architecture, can extract information from historical price action, at least for this particular data set, and that tools like regularization and dropout can make a difference to the quality of the model’s estimations.
 

What’s next?

Before we get into advanced model architectures, in the next unit I’ll show you:

  1. One of the more cutting edge architectures to get the most out of a densely connected feed forward network.
  2. How to interrogate and visualize the training process in real time.


Conclusions

This post demonstrated how to fight overfitting with regularization and dropout using Keras’ sequential model paradigm.



Notes:

  1. Learning rate is the parameter that controls the magnitude of the weight adjustments after each batch of training data, and you’ll find it implemented as a parameter in Keras’ optimizers.
  2. As an aside, note the publication date of that paper: June 2014. That’s only about 3.5 years ago, yet dropout is already almost universally adopted throughout deep learning practice. This is a testament to just how quickly this field is evolving, and how closely integrated research and application really are.
  3. This statement will become clear when we get to recurrent networks in a future post. I wanted to include it here in case any readers start experimenting with recurrent networks before I get around to writing about them.

 

Visit Robot Wealth website to Download the code and data used in this post.

 

Learn more about Robot Wealth here: https://robotwealth.com/

This article is from Robot Wealth and is being posted with Robot Wealth’s permission. The views expressed in this article are solely those of the author and/or Robot Wealth and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad-based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.


16344




Stocks

Hierarchical clustering of Exchange-Traded Funds


Author: PMERCATORIS, QuantDare

clusteringETFs

 

Clustering has already been discussed in plenty of detail, but today I would like to focus on a relatively simple but extremely modular clustering technique, hierarchical clustering, and how it could be applied to ETFs. We’ll also be able to review the Python tools available to help us with this.

Clustering suitability

First of all, ETFs are well suited for clustering, as they are each trying to replicate market returns by following a market’s index. We can therefore expect to find clear clusters. The advantage of using hierarchical clustering here, is that it allows us to define the precision of our clustering (number of clusters) after the algorithm has run. This is a clear advantage compared to other unsupervised methods, as it will ensure impartial and equidistant clusters, which is important for good portfolio diversification.

The data used here are the daily series of the past weeks’ returns. This ensures stationarity and allows for better series comparison. The prices used are all in US dollars from September 2011 to December 2017, to try and capture different market conditions while keeping a high number of ETFs (790).

How to begin

The first step is to calculate all the pairwise distances between the series. The Scipy package provides an efficient implementation to do this with the pdist function, and includes many distances.

Here I compared all the applicable ones to calculate distances between 2 numerical series. To compare them, I decided to use the cophenetic distance, which is (very briefly) a value ranging from 0 to 1 and allows us to determine how well the pairwise distances between the series compare (correlate) to their cluster’s distance. A value closer to 1 would result in better clustering, as the clusters are able to preserve original pairwise distances.

numpy-pandas

euclidean 0.445105212991
sqeuclidean 0.636766347254
cityblock 0.449263737373
cosine 0.852746101706
hamming -0.148087351237
chebyshev 0.480135889737
braycurtis 0.486277543793
correlation 0.850386271327
 

Here, the cosine (as well as the correlation) distances worked best. It’s then time to apply the agglomerative hierarchical clustering, which is done by the linkage function. There are a few methods for calculating between cluster distances, and I invite you to read further about them in the description of the linkage function. In this case, I will use the ward method, which minimises the overall between-cluster distance using the Ward variance minimisation algorithm, and is often a good default choice.

scipy

The resulting matrix Z is informing each step of the agglomerative clustering by informing the first two columns of which cluster indices were merged. The third column is the distance between those clusters, and the fourth column is the number of original samples contained in that newly merged cluster. Here are the 3 last merges:

print-np.array

 

Visualising the clusters

A good way to visualise this is with a dendrogram, which shows at which inter-cluster distance each merge occurred. From there, it is possible to select a distance where clusters are clear (indicated with the horizontal black lines).

dendrogram

dendogram-final

As we can see, the clear number of clusters appear to be 2, 4 and 6 (depending on the desired level of detail).

Another, more automatic, way of selecting the cluster number is to use the Elbow method and pick a number where the decrease of inter-cluster distance is the highest, which seems to occur at 2 clusters. However, this is probably too simplistic and we can also see this occur at 4 and at 6, as shown by the second derivative of those distances (in orange).

distances-clustering

number-of-clusters

Plotting the results

In order to plot the results, it is necessary to carry out some dimensionality reduction. For this, I have decided to use TSNE as it’s particularly efficient to plot over 2 dimensions. However, it’s a good idea to first reduce the dimensions to a reasonable number, using PCA, when the number of features is too high. This is certainly the case for time series, where each daily return is considered a dimension.

sklearn

number-of-component

In order to get a minimum of 95% of the variance explained, it is necessary to use a minimum of 80 components.

variance-ratio

With those reduced dimensions, we can now use the TSNE and reduce it further to 2 dimensions. I highly recommend this read, to see how to fine-tune it (the article has some very nice interactive visualisation).

manifold

Finally, the clusters seem to be relatively cohesive when plotted on a two-dimensional space. So assets of each cluster are expected to behave similarly across observed market conditions since 2011. This assumption needs to be taken with a pinch of salt of course, but can help create a diversified portfolio by selecting assets from each cluster.

k4-clusters

counter-cluster

 

Author: PMERCATORIS, QuantDare https://quantdare.com/

Daring to quantify the markets

This article is from QuantDare and is being posted with QuantDare’s permission. The views expressed in this article are solely those of the author and/or QuantDare and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broad-based economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.


16306




1 2 3 4 2

Disclosures

We appreciate your feedback. If you have any questions or comments about IB Quant Blog please contact ibkrquant@ibkr.com.

The material (including articles and commentary) provided on IB Quant Blog is offered for informational purposes only. The posted material is NOT a recommendation by Interactive Brokers (IB) that you or your clients should contract for the services of or invest with any of the independent advisors or hedge funds or others who may post on IB Quant Blog or invest with any advisors or hedge funds. The advisors, hedge funds and other analysts who may post on IB Quant Blog are independent of IB and IB does not make any representations or warranties concerning the past or future performance of these advisors, hedge funds and others or the accuracy of the information they provide. Interactive Brokers does not conduct a "suitability review" to make sure the trading of any advisor or hedge fund or other party is suitable for you.

Securities or other financial instruments mentioned in the material posted are not suitable for all investors. The material posted does not take into account your particular investment objectives, financial situations or needs and is not intended as a recommendation to you of any particular securities, financial instruments or strategies. Before making any investment or trade, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice. Past performance is no guarantee of future results.

Any information provided by third parties has been obtained from sources believed to be reliable and accurate; however, IB does not warrant its accuracy and assumes no responsibility for any errors or omissions.

Any information posted by employees of IB or an affiliated company is based upon information that is believed to be reliable. However, neither IB nor its affiliates warrant its completeness, accuracy or adequacy. IB does not make any representations or warranties concerning the past or future performance of any financial instrument. By posting material on IB Quant Blog, IB is not representing that any particular financial instrument or trading strategy is appropriate for you.