Overview

This is the second blog post of my Finaical Analysis in Python series.

This blog series covers topics like the following:

  • how to visualize the long-term investment plan
  • backtest dollar-cost averaging strategy for long-term investment
  • backtest value averaging strategy for long-term investment
  • compare different investment strategies for short-term investment

The jupyter notebooks can be downloaded here.

Disclaimer

Investing money into anything is often involved with risk. Please do your own research before investing and be responsible for your own investment decisions.

I am just learning investment on my own and want to share some codes that I have written that might be useful for others.

The content here is only for informational purpose (instead of taking them as professional investment advice).

Introduction

In the last notebook, we already talked about how to visualize the return of a long term investment plan.

We made an assumption that the annual return rate is about 5%. How can we ensure that we can invest our savings properly so that we can have a 5% annual return?

In this notebook, we are going to use the historical data of SP500 to do some backtesting.

We will calculate the overall rate of return and the internal rate of return for two dollor cost averaging methods.

# Import libraries
import datetime
import warnings

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from pandas_datareader import data as pdr

plt.style.use('bmh')
warnings.filterwarnings("ignore")

A peak into the history

The first step here is to download data. I used a method introducted in a datacamp tutorial.

The total return index of SP500 between 1988 and 2020 is used here.

# Get historical data
startdate = datetime.datetime(1988, 1, 1)
enddate = datetime.datetime(2021, 1, 1)

# SP 500 Total return index
SP = pdr.get_data_yahoo('^SP500TR', start=startdate, end=enddate)

# If you want to use the price index of SP500, you can use the ticker of '^GSPC'
# SP = pdr.get_data_yahoo('^GSPC', start=startdate, end=enddate)
SP
Date High Low Open Close Volume Adj Close
1988-01-04 256.019989 256.019989 256.019989 256.019989 0 256.019989
1988-01-05 258.769989 258.769989 258.769989 258.769989 0 258.769989
1988-01-06 259.029999 259.029999 259.029999 259.029999 0 259.029999
1988-01-07 261.209991 261.209991 261.209991 261.209991 0 261.209991
1988-01-08 243.550003 243.550003 243.550003 243.550003 0 243.550003
... ... ... ... ... ... ...
2020-08-17 6954.669922 6937.540039 6940.890137 6943.220215 0 6943.220215
2020-08-18 6971.020020 6920.080078 6954.689941 6960.299805 0 6960.299805
2020-08-19 6981.410156 6920.370117 6967.069824 6930.799805 0 6930.799805
2020-08-20 6963.520020 6889.850098 6901.520020 6952.919922 0 6952.919922
2020-08-21 6982.970215 6941.000000 6954.390137 6977.270020 0 6977.270020
# Generate group by keys
SP['date'] = SP.index
SP['year'] = SP['date'].apply(lambda d: f'{d.year}')

# Group by year, take the adjusted close price of the first business day in each year
df = SP[['year', 'Adj Close']].sort_index().groupby('year').first()

# Generate annual return
df['return'] = df['Adj Close'].pct_change()

# Show annual return by the beginning of next year
df['return_next_y'] = df['return'].shift(-1)

df = df.drop(columns='return')
df = df.reset_index()

df
year Adj Close return_next_y
0 1988 256.019989 0.115733
1 1989 285.649994 0.351864
2 1990 386.160004 -0.058836
3 1991 363.440002 0.320273
4 1992 479.839996 0.075358
5 1993 516.000000 0.099031
6 1994 567.099976 0.014848
7 1995 575.520020 0.386954
8 1996 798.219971 0.214014
9 1997 969.049988 0.346721
10 1998 1305.040039 0.278520
11 1999 1668.520020 0.199932
12 2000 2002.109985 -0.108011
13 2001 1785.859985 -0.088109
14 2002 1628.510010 -0.199778
15 2003 1303.170044 0.241626
16 2004 1618.050049 0.103155
17 2005 1784.959961 0.075072
18 2006 1918.959961 0.138075
19 2007 2183.919922 0.040977
20 2008 2273.409912 -0.340563
21 2009 1499.170044 0.245396
22 2010 1867.060059 0.145277
23 2011 2138.300049 0.025300
24 2012 2192.399902 0.171570
25 2013 2568.550049 0.279590
26 2014 3286.689941 0.146649
27 2015 3768.679932 -0.001244
28 2016 3763.989990 0.146411
29 2017 4315.080078 0.218119
30 2018 5256.279785 -0.050553
31 2019 4990.560059 0.324358
32 2020 6609.290039 NaN
# Describe the annual return
df['return_next_y'].describe()

Output:

count    32.000000
mean      0.120554
std       0.169035
min      -0.340563
25%       0.022687
50%       0.141676
75%       0.242569
max       0.386954
Name: return_next_y, dtype: float64

When checking the annual return rate, the minimum value is -34.1% and maximum value is 38.7%. So we should not be panic if we see a drop of 30-40% in the total value in a certain year. Try to buy more in the dip, and the bull market will follow in 1-3 years.

Define functions

def calc_dollar_cost_averaging_return(df_dc: pd.DataFrame, value_per_period: float = 1.0, 
                                      increase_investment_per_period: float = 0.0,
                                      display_returns: bool = True) -> pd.DataFrame:
    """
    This function calculates the overall rate of return and the internal rate of return of dollar cost averaging method. 
    :param df_dc: Original dataframe that contains the price at the beginning of each period
    :param value_per_period: Investment value per period
    :param increase_investment_per_period: Increase the investment by x each period
    :param display_returns: Whether to display return results or not
    :return: A dataframe that contains all the relevant info
    """
    # amount to invest each period
    df_dc.loc[0, 'should_invest'] = value_per_period   
    # total amount of investment after the investment this period
    df_dc.loc[0, 'total_invest'] = value_per_period   
    # the total value by the beginning of next period
    df_dc.loc[0, 'total_value_next'] = value_per_period*(1 + df_dc.loc[0, 'return_next_y'])
    df_dc.loc[0, 'overall_return_next'] = df_dc.loc[0, 'total_value_next'] / df_dc.loc[0, 'total_invest'] - 1

    for i in range(1,len(df_dc.index)):
        df_dc.loc[i, 'should_invest'] = value_per_period * (1 + increase_investment_per_period)**i
        df_dc.loc[i, 'total_invest'] = df_dc.loc[i-1, 'total_invest'] + df_dc.loc[i, 'should_invest']
        df_dc.loc[i, 'total_value_next'] = ((df_dc.loc[i-1, 'total_value_next'] + 
                                            df_dc.loc[i, 'should_invest']) *
                                            (1 + df_dc.loc[i, 'return_next_y']) 
                                           )
        df_dc.loc[i, 'overall_return_next'] = df_dc.loc[i, 'total_value_next'] / df_dc.loc[i, 'total_invest'] - 1
        
    if display_returns:
        
        # Calculate overall return
        overall_return = round(df_dc.loc[len(df_dc.index) - 2, 'overall_return_next']*100, 2)
        print(f"The overall return rate is {overall_return}%")

        # Calculate IRR
        cf = (-df_dc['should_invest']).tolist()
        cf[-1] = df_dc.loc[len(df_dc.index) - 2, 'total_value_next']
        print("The cash flow (the negative sign stands for investment):")
        display(cf)
        irr = round(100*np.irr(cf), 2)
        print(f"The internal return rate (IRR) is {irr}% per period")
        
        # Calculate CAGR
        cagr = round(100*((df_dc.loc[len(df_dc.index) - 2, 'total_value_next']/
                           df_dc.loc[len(df_dc.index) - 2, 'total_invest'])**(1/(len(df_dc.index) - 1)) - 1), 2)
        
        print(f"The compound annual growth rate (CAGR) is {cagr}% per period")
        
    return df_dc


def plot_changes(df: pd.DataFrame,
                 y1: str = 'total_invest', y2: str = 'total_value_next',
                 xlabel: str = 'period', ylabel: str = 'value',
                 title: str = 'Visualize total investment and value each year',
                 target: int = None) -> None:
    """
    This function visualizes the total investment and the total value of the investment plan across time
    :param df: A dataframe contains total investment and total value
    :param y1: Column name
    :param y2: Column name
    :param xlabel: X label
    :param ylabel: Y label
    :param title: Title of the graph
    :return: A plot
    """
    plt.figure(figsize=(15, 5))
    plt.plot(df[y1], label=y1)
    plt.plot(df[y2], label=y2)
    if target:
        plt.axhline(y=target, c='black', label='target value')
    plt.xlim([0, len(df[y1])-1])
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.title(title)
    plt.legend()
    plt.show()

Plan 1: Yearly dollar cost averaging buying method

Imagine we were able to invest 1200 dollar per year into a SP500 index based ETF starting from 1988.

What’s the overall return after 32 years?

What’s the average annual return?

df1 = calc_dollar_cost_averaging_return(df, value_per_period=1200)

Output:

The overall return rate is 580.19%
The cash flow (the negative sign stands for investment):

[-1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 -1200.0,
 261192.11446741945]

The internal return rate (IRR) is 9.92% per period
The compound annual growth rate (CAGR) is 6.17% per period

df1:

year Adj Close return_next_y should_invest total_invest total_value_next overall_return_next
0 1988 256.019989 0.115733 1200.0 1200.0 1338.879804 0.115733
1 1989 285.649994 0.351864 1200.0 2400.0 3432.220743 0.430092
2 1990 386.160004 -0.058836 1200.0 3600.0 4359.680708 0.211022
3 1991 363.440002 0.320273 1200.0 4800.0 7340.295930 0.529228
4 1992 479.839996 0.075358 1200.0 6000.0 9183.879488 0.530647
5 1993 516.000000 0.099031 1200.0 7200.0 11412.205047 0.585028
6 1994 567.099976 0.014848 1200.0 8400.0 12799.465363 0.523746
7 1995 575.520020 0.386954 1200.0 9600.0 19416.618801 1.022564
8 1996 798.219971 0.214014 1200.0 10800.0 25028.857872 1.317487
9 1997 969.049988 0.346721 1200.0 12000.0 35322.955610 1.943580
10 1998 1305.040039 0.278520 1200.0 13200.0 46695.335609 2.537525
11 1999 1668.520020 0.199932 1200.0 14400.0 57471.129235 2.991051
12 2000 2002.109985 -0.108011 1200.0 15600.0 52333.999013 2.354744
13 2001 1785.859985 -0.088109 1200.0 16800.0 48817.182741 1.905785
14 2002 1628.510010 -0.199778 1200.0 18000.0 40024.865577 1.223604
15 2003 1303.170044 0.241626 1200.0 19200.0 51185.872534 1.665931
16 2004 1618.050049 0.103155 1200.0 20400.0 57789.735898 1.832830
17 2005 1784.959961 0.075072 1200.0 21600.0 63418.196358 1.936028
18 2006 1918.959961 0.138075 1200.0 22800.0 73540.339150 2.225453
19 2007 2183.919922 0.040977 1200.0 24000.0 77802.957039 2.241790
20 2008 2273.409912 -0.340563 1200.0 25200.0 52097.453233 1.067359
21 2009 1499.170044 0.245396 1200.0 26400.0 66376.423781 1.514258
22 2010 1867.060059 0.145277 1200.0 27600.0 77393.691544 1.804119
23 2011 2138.300049 0.025300 1200.0 28800.0 80582.143634 1.797991
24 2012 2192.399902 0.171570 1200.0 30000.0 95813.509570 2.193784
25 2013 2568.550049 0.279590 1200.0 31200.0 124137.478352 2.978765
26 2014 3286.689941 0.146649 1200.0 32400.0 143718.101728 3.435744
27 2015 3768.679932 -0.001244 1200.0 33600.0 144737.758102 3.307671
28 2016 3763.989990 0.146411 1200.0 34800.0 167304.672506 3.807606
29 2017 4315.080078 0.218119 1200.0 36000.0 205258.694569 4.701630
30 2018 5256.279785 -0.050553 1200.0 37200.0 196021.626888 4.269399
31 2019 4990.560059 0.324358 1200.0 38400.0 261192.114467 5.801878
32 2020 6609.290039 NaN 1200.0 39600.0 NaN NaN
plot_changes(df1, title='Plan 1', xlabel='year')

png

Plan 2: Increase the investment each year by 2%

In Plan 2, let’s assume the initial_investment is 1200, and then we increase the investment each year by 2%.

# Assuming initial_investment is 1200, and then increase investment each year by 2%
df2 = calc_dollar_cost_averaging_return(df, value_per_period=1200, increase_investment_per_period=0.02)

Output:

The overall return rate is 486.51%
The cash flow (the negative sign stands for investment):
   
[-1200.0,
 -1224.0,
 -1248.48,
 -1273.4496000000001,
 -1298.918592,
 -1324.8969638400001,
 -1351.3949031168002,
 -1378.422801179136,
 -1405.991257202719,
 -1434.1110823467732,
 -1462.7933039937088,
 -1492.049170073583,
 -1521.8901534750546,
 -1552.327956544556,
 -1583.374515675447,
 -1615.042005988956,
 -1647.3428461087349,
 -1680.2897030309098,
 -1713.8954970915281,
 -1748.1734070333584,
 -1783.1368751740258,
 -1818.7996126775063,
 -1855.1756049310566,
 -1892.2791170296778,
 -1930.1246993702714,
 -1968.7271933576767,
 -2008.1017372248302,
 -2048.263771969327,
 -2089.2290474087135,
 -2131.0136283568877,
 -2173.633900924026,
 -2217.106578942506,
 311273.63335198513]

The internal return rate (IRR) is 9.91% per period
The compound annual growth rate (CAGR) is 5.68% per period

df2:

year Adj Close return_next_y should_invest total_invest total_value_next overall_return_next
0 1988 256.019989 0.115733 1200.000000 1200.000000 1338.879804 0.115733
1 1989 285.649994 0.351864 1224.000000 2424.000000 3464.665484 0.429317
2 1990 386.160004 -0.058836 1248.480000 3672.480000 4435.844184 0.207861
3 1991 363.440002 0.320273 1273.449600 4945.929600 7537.826024 0.524046
4 1992 479.839996 0.075358 1298.918592 6244.848192 9502.668090 0.521681
5 1993 516.000000 0.099031 1324.896964 7569.745156 11899.829221 0.572025
6 1994 567.099976 0.014848 1351.394903 8921.140059 13447.972306 0.507428
7 1995 575.520020 0.386954 1378.422801 10299.562860 20563.532573 0.996544
8 1996 798.219971 0.214014 1405.991257 11705.554117 26671.303126 1.278517
9 1997 969.049988 0.346721 1434.111082 13139.665200 37850.153571 1.880603
10 1998 1305.040039 0.278520 1462.793304 14602.458504 50262.395731 2.442050
11 1999 1668.520020 0.199932 1492.049170 16094.507674 62101.796628 2.858571
12 2000 2002.109985 -0.108011 1521.890153 17616.397827 56751.625622 2.221523
13 2001 1785.859985 -0.088109 1552.327957 19168.725784 53166.862347 1.773625
14 2002 1628.510010 -0.199778 1583.374516 20752.100299 43812.361085 1.111225
15 2003 1303.170044 0.241626 1615.042006 22367.142305 56403.853151 1.521728
16 2004 1618.050049 0.103155 1647.342846 24014.485152 64039.465661 1.666702
17 2005 1784.959961 0.075072 1680.289703 25694.774855 70653.449907 1.749721
18 2006 1918.959961 0.138075 1713.895497 27408.670352 82359.450191 2.004869
19 2007 2183.919922 0.040977 1748.173407 29156.843759 87554.082572 2.002866
20 2008 2273.409912 -0.340563 1783.136875 30939.980634 58912.245650 0.904082
21 2009 1499.170044 0.245396 1818.799613 32758.780247 75634.187986 1.308822
22 2010 1867.060059 0.145277 1855.175605 34613.955851 88746.748765 1.563901
23 2011 2138.300049 0.025300 1892.279117 36506.234969 92932.231838 1.545654
24 2012 2192.399902 0.171570 1930.124699 38436.359668 111137.895169 1.891478
25 2013 2568.550049 0.279590 1968.727193 40405.086861 144730.058188 2.581976
26 2014 3286.689941 0.146649 2008.101737 42413.188598 168257.173136 2.967096
27 2015 3768.679932 -0.001244 2048.263772 44461.452370 170093.499961 2.825640
28 2016 3763.989990 0.146411 2089.229047 46550.681418 197392.199679 3.240372
29 2017 4315.080078 0.218119 2131.013628 48681.695046 243042.959525 3.992492
30 2018 5256.279785 -0.050553 2173.633901 50855.328947 232820.204951 3.578089
31 2019 4990.560059 0.324358 2217.106579 53072.435526 311273.633352 4.865072
32 2020 6609.290039 NaN 2261.448711 55333.884237 NaN NaN
plot_changes(df2, title='Plan 2', xlabel='year')

png

Conclusion

The compound annual growth rate of Plan 2 is lower than that of Plan 1. This is because Plan 2 invested more money at the later stage of the investment. But, in both plans, the compound annual growth rates are more than 5%.

The internal return rate of both plans are quite similiar to each other (almost 10%).

It seems that SP500 is a good enough index that can bring us long term return. However, the historical data cannot predict the future. Please do your own research before investing.

In the next notebook, we will further discuss which strategy can further increase IRR.