Web Scrapping Mutual Fund’s Data and Creating a Mutual Fund Portfolio (Hedge Portfolio)

Neelakash Chatterjee
InsiderFinance Wire
9 min readFeb 26, 2023

--

Hello readers, today I’m back with a very exciting experiment using Mutual Funds, Mutual Funds are basically a class of asset that in itself is a portfolio, it is an investment programme funded by shareholders that trades in diversified holdings and is professionally managed by experienced portfolio managers.

So, if the mutual funds are already diversified why do we want to diversify it all over again ? The answer is simple, we want to create a portfolio that does quite good, even at the toughest of times, say COVID-19 (in 2020). The one thing that I always keep in mind is that, it is okay if we cannot make money out of the market, but we don’t want to loose money .

So, without further ado, let’s get our hands dirty.

Import Statements

import warnings
warnings.filterwarnings('ignore')

import glob
import json
import urllib
import numpy as np
import pandas as pd
import yfinance as yf
import plotly.graph_objects as go

from tqdm import tqdm
from scipy.stats import kurtosis

from pypfopt.expected_returns import mean_historical_return
from pypfopt.risk_models import risk_matrix
from pypfopt.efficient_frontier import EfficientFrontier
from numpy.linalg import inv
from pypfopt.hierarchical_portfolio import HRPOpt
from pypfopt import objective_functions

Getting the Data

I collected all available **scheme codes** in the Indian Market, because that would be quite useful in scrapping the historical data.

# Getting Scheme Codes 

scheme_codes = list(set(pd.read_csv("List of MF Codes.csv")['Scheme Code'].to_list()))

Now, we will start scrapping the data from the mfapi.in website which has historical NAV data for all mutual funds in the Indian market updated till recently. Let’s write a small function to do the task.

BASE = 'https://api.mfapi.in/mf/'

def mfapidata(listOfcodes:list) -> None:
'''
Scrapping MF Data from mfapi.in website.
'''

try:
for code in tqdm(listOfcodes):
try:
# scrapping the data
json_data = urllib.request.urlopen(BASE+str(code))
data = json.loads(json_data.read())

# creating dataframes
df = pd.DataFrame(data['data'])

df['scheme_code'] = data['meta']['scheme_code']
df['scheme_name'] = data['meta']['scheme_name']
df['scheme_category'] = data['meta']['scheme_category']
df['scheme_type'] = data['meta']['scheme_type']

df['date'] = df['date'].apply(lambda x: x[-4:] + "-" +x[-7:-5] + "-" + x[-10:-8])
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['date'], ascending=True)

# Renaming Columns of Dataframe
df = df.rename(columns={"date":"Date", "nav":"NAV", "scheme_code":"Scheme Code", "scheme_name":"Scheme Name", "scheme_category":"Scheme Category", "scheme_type":"Scheme Type"})
df = df.set_index("Date")

# save the data
df.to_csv(f"Data/{data['meta']['scheme_name']}.csv")

except:
pass

except Exception as e:
print(e)

There are almost 8k unique funds available in the Indian Market, but for the sake of simplicity I’ll choose just 100 (just randomly pick any 100 MF)

# Total Unique schemes
len(scheme_codes)
7876

# Download all available data for all schemees
mfapidata(scheme_codes[-100:])

So, I just downloaded the historical data for some 100 MF’s and stored it in my hard drive. It contains information such as NAV, Scheme Codes, Scheme Category, Scheme Name etc.

NAV — Net Asset Value — is the price of 1 unit of a mutual fund.

Scheme Code — A unique identifier for the Fund

Scheme Name — Name of the Fund

Scheme Category — Category in which it belongs.

Loading the Downloaded Data

Now, I load the downloaded data and arrange it in a format that would help me in running the experiments.

A short note on NAV’s: NAV is almost irrelavant when it comes to Mutual Funds. Let’s see that with the help of an example below. (The Below Example has been taken from https://groww.in/)

Let’s consider the example of two mutual funds: mutual fund A (MF-A) and mutual fund B (MF-B). Now, to explain why NAV isn’t important, in our example, we’ll consider every characteristic (fund manager, assets held, investment style, launch date, etc) of two mutual funds to be exactly the same, except for the NAV.

In our example, MF-A has a NAV of ₹20 and MF-B has a NAV of ₹50. Both these mutual funds have 20% of their funds allocated in the shares of company XYZ. If the shares of company XYZ rise 10% in value, the NAV of both MF-A and MF-B will increase by 2%. So, the NAV of MF-A will become ₹20.4. At the same time, the NAV of MF-B will become ₹51.

At this stage, you might think that MF-B increased by a greater margin. That is true. However, you must remember if you had invested in MF-B, you would also have to pay a higher price for each unit of MF-B.

If you had invested ₹100 in the beginning in MF-A, you would have paid ₹20 per unit and gotten 5 units. After the increase in NAV, if you sold these units, you would have gotten back ₹20.4 X 5 = ₹102 back. So you would have made a profit of ₹2.

Instead, if you had invested ₹100 in MF-B, you would have paid ₹50 for 2 units of MF-B. If you sold these units, after the increase in NAV, you would have gotten back ₹51 X 2 = ₹102. Same as MF-A.

mf = pd.DataFrame()

for fund in glob.glob("Data/*.csv"):
data = pd.read_csv(fund, index_col=0)
mf = mf.append(data)

total_df = mf.reset_index()[['Date', 'NAV', 'Scheme Name']].set_index(['Date', 'Scheme Name']).unstack()
total_df.columns = total_df.columns.droplevel()
total_df.columns.name = None

# dropping null rows and using ffill to fill any missing data
total_df = total_df.ffill().dropna()

total_df.head()
The Above Created DataFrame

The Strategy

Now comes the most important part, to choose the Funds for our Portfolio. Let’s pick the ones with the highest return over the last few months, we calculate the overall return of all Funds upto end of 2018 and choose the top 10 Funds with highest cumulative returns, and we will run our backtest from 2019–01–01 upto 2023–02–24, rebalancing our portfolio daily.

Note: Treating data till 2018–12–31 as our available data, and starting from 2019–01–01 we start trading. Make sure you don’t take the entire dataset for the calculation of cumulative returns to choose the mutual funds, as that would cause a `lookahed` bias and you would end up having a very unrealistic and optimal portfolio.

Below are the list of 10 Funds we will be investing on over the horizon of 4 years on a daily basis.

# Selecting Top 10 MF out of these 100
topmfs = total_df.loc[:"2018-12-31"]
cum_rets = (1 + topmfs.pct_change().dropna()).prod() - 1
top_10 = cum_rets.sort_values(ascending=False)[:10]

top_10
The Selected (Top 10) Funds

We have created two functions which would be needed to back-test our strategy, the functions basically calculates the invested amount and the amount that we received the next day on the investment that we made the previous day.

# Functions for Backtesting

def backtest(cash, backtest_df, weights):
amount_allocation = {}
shares, total, balance, total_invested = {}, 0, 0, 0
prices = backtest_df.iloc[0].to_dict()
new_prices = backtest_df.iloc[len(backtest_df)-1].to_dict()

for keys in backtest_df.columns:
amount_allocation[keys] = weights[keys] * cash

for keys in backtest_df.columns:
shares[keys] = (amount_allocation[keys] // prices[keys])

for keys in backtest_df.columns:
total_invested = total_invested + (shares[keys] * prices[keys])

balance = cash - total_invested

for keys in backtest_df.columns:
total = total + (shares[keys] * new_prices[keys])

return total_invested, total, balance, shares


def backtest_lastm(cash, backtest_df, weights):
amount_allocation = {}
shares, total, balance, total_invested = {}, 0, 0, 0
prices = backtest_df.to_dict()

for keys in backtest_df.index:
amount_allocation[keys] = weights[keys] * cash

for keys in backtest_df.index:
shares[keys] = (amount_allocation[keys] // prices[keys])

for keys in backtest_df.index:
total_invested = total_invested + (shares[keys] * prices[keys])

balance = cash - total_invested

return total_invested, np.nan, balance, shares
# dates in our backtest period
dates_to_trade = list(total_df.loc["2019-01-01":].index)

Below we iterate through each day in the back-test period, by re-balancing our portfolio on a daily basis, we use the max_sharpe (Maximum Sharpe Ratio), portfolio optimisation method from the PyPortfolioOpt library, we begin our back-test with an initial amount of 100000 INR, that is One Lakh Indian Rupees.

N = 10
cash = 100000
flag = 0

investment, got_back, balances, date, all_weights = [], [], [], [], []
turnover_stack = []
turnover = []
opt_method = "max_sharpe"


for dates in range(len(dates_to_trade)):

# choose mutual funds to invest
temp = list(top_10.index)

# backtest
if dates != len(dates_to_trade) - 1:
new_df = total_df.loc[dates_to_trade[dates]:dates_to_trade[dates+1], temp]
else:
new_df = total_df.loc[dates_to_trade[dates], temp]
flag = 1

returns = {}

# Annualized Returns
for tickers in temp:
returns[tickers] = total_df.loc[:dates_to_trade[dates], tickers][-60:].pct_change().dropna().mean() * 252

returns = pd.Series(returns)


# Max Sharpe Optimisation and EQ, Min Vol
if opt_method == "EQ":
weights = {s:1/N for s in temp}

elif opt_method.lower() == "max_sharpe":
cov = risk_matrix(total_df.loc[:dates_to_trade[dates], temp][-60:], method="ledoit_wolf")

ef = EfficientFrontier(returns, cov)

weights = ef.nonconvex_objective(
objective_functions.sharpe_ratio,
objective_args=(ef.expected_returns, ef.cov_matrix),
weights_sum_to_one=True,
)

elif opt_method.lower() == "min_vol":
cov = risk_matrix(total_df.loc[:dates_to_trade[dates], temp][-60:], method="ledoit_wolf")

ef = EfficientFrontier(returns, cov, verbose=False)
ef.add_objective(objective_functions.L2_reg, gamma=1)

ef.min_volatility()
weights = ef.clean_weights()


all_weights.append(weights)

# Last day check
if flag == 0:
invested, total, balance, _ = backtest(cash, new_df, weights)
else:
invested, total, balance, _ = backtest_lastm(cash, new_df, weights)

date.append(dates_to_trade[dates])
investment.append(invested)
got_back.append(total)
balances.append(balance)

# Recalculate cash at next day end
cash = total + balance

Now, we accumulate everything in a single DataFrame and see how we did over the 4-year backtest period. Let’s see. Now it’s time to evaluate our results. (Quite Excited!)

df_accounts = pd.DataFrame({"date": date, "invested": investment, "value_next_day": got_back, "balance": balances})
df_accounts['total_portfolio'] = df_accounts['invested'] + df_accounts['balance']
df_accounts['returns'] = df_accounts['invested'].pct_change()

df_accounts
The Total Investment Details

Now, in order to generate a summary of our investment we will create a function called financial_summary, which will summarise how our portfolio of MF’s did over the 4-year period.

def financial_summary(df_rets) -> pd.DataFrame:
'''
Must supply a dataframe with date and daily retruns as columns
Outputs: pd.DataFrame()
'''
df_rets['c_ret'] = (1 + df_rets['returns']).cumprod() - 1
days = (pd.to_datetime(df_rets.iloc[len(df_rets)-1, 0]) - pd.to_datetime(df_rets.iloc[0, 0])) // np.timedelta64(1, 'D')
volatility = np.std(df_rets['returns']) * np.sqrt(252)
returns = ((df_rets['c_ret'].values[-1])/(days)) * 252
sharpe = returns/volatility
sortino = returns / (np.std(df_rets[df_rets['returns'] < 0]['returns']) * np.sqrt(252))

return pd.DataFrame(data=[df_rets.iloc[0, 0],
df_rets.iloc[len(df_rets)-1, 0],
days,
np.around(returns*100, 2),
np.around(volatility*100, 2),
np.around(sharpe, 2),
np.around(kurtosis(df_rets['returns'], fisher=False), 2),
np.around(sortino, 2)],

columns=['Summary'],
index=['Start Date', 'End Date', 'Time Period (in Days)', 'Annual Return %',
'Annual Volatility %', 'Sharpe Ratio', 'Kurtosis', 'Sortino Ratio'])

financial_summary(df_accounts)
Summary of Portfolio

Creating a portfolio is not enough, we need to compare it against a well-defined benchmark, since our experiment is based on the Indian Market we will choose NIFTY50 as our baseline or the benchmark.

baseline = yf.download("^NSEI", start="2018-12-31", progress=False).loc[df_accounts.iloc[0, 0]:df_accounts.iloc[-1, 0]][['Adj Close']]
baseline.index = pd.Series(baseline.index).apply(lambda x: x.strftime("%Y-%m-%d"))
baseline["returns"] = baseline['Adj Close'].pct_change()
baseline = baseline.reset_index().rename(columns={'Date':'date'})

financial_summary(baseline)
Baseline Summary

Comparison of NIFTY50 and our Portfolio.

Cumulative Return of Baseline vs Portfolio

Conclusions

  • Though the return generated over the years is more in case of the benchmark index, the sharpe ratio of our portfolio is way more (1.17, any sharpe ratio above 1 is considered good) .
  • The Volatility (or the risk) involved is much less in the MF Portfolio as compared to the benchmark Index Fund .
  • We did great during the covid, when almost everything went for a toss, we actually became successful in creating a Hedge Portfolio which actually serves its purpose. When the Index Fund went down by -29% we were still at +16% .
  • It indeed failed to capture the upside potential, which is fine because we only intended to create a portfolio to protect us against the downside, which it did quite remarkably.
  • One major drawback of our experiment was, we chose our Funds out 100 random funds (It is possible that we missed out on some even better Funds).

There can be many such strategies, each and every strategy depends on the needs of an individual and their respective financial needs as well. We can even some stocks along with the MF’s to capture the upside potential and thereby keep a good balance between Risk and Reward. Thank you for reading the article.

If you enjoyed reading the article do let me know! See ya geeks!

A Message from InsiderFinance

Thanks for being a part of our community! Before you go:

--

--

Data Scientist @NX Block Trades Pvt Ltd | Ex-Juniour Data Scientist @Cloudcraftz Solutions Pvt Ltd | MBA (FinTech) @BITS https://www.linkedin.com/in/neelakash-c