Creating a time series chart

By Alexis Cancino

February 11, 2022

Foreword on Reproducibility

In order to run this notebook, you need to have R installed, ideally RStudio too. Also, you would need to install the following packages:

# NOT RUN
# install.packages(c("tidyverse", "lubridate", "glue", "skimr", "highcharter", "janitor"))

1. Introduction

The task is to develop a backtest for a strategy that trades constituents of the S&P 500 index from January 1st, 2010 through January 1st, 2020. Each day the strategy selects the 10 stocks with the highest Price Rate of Change Indicator (ROC) and buys them on an equally-weighted basis.

It’s important to verify that it only selects stocks that are part of the S&P 500 index at each moment in time, as these change every quarter. A file should be generated with the values of the equity curve (Net Asset Value) over time and a file of weights or positions where you can see the stocks that were selected each day.

1.1 The S&P 500 Index

The S&P 500 index is one of the most-widely known indexes in finance. According to S&P Dow Jones Indices: ‘the S&P 500® is widely regarded as the best single gauge of large-cap U.S. equities. There is over USD 11.2 trillion indexed or benchmarked to the index, with indexed assets comprising approximately USD 4.6 trillion of this total. The index includes 500 leading companies and covers approximately 80% of available market capitalization.’

The index measures the performance of the large-cap segment of the market. Considered to be a proxy of the U.S. equity market, the index is composed of 500 constituent companies.

According to the S&P 500’s methodology, the index is weighted by float-adjusted market capitalization and is rebalanced on a quarterly basis.

1.2 The Strategy

The strategy consists of selecting the 10 stocks with the highest Price Rate of Change Indicator (ROC) and buys them on an equally-weighted basis. Each day before the market opens, we look back on the Price ROCs of the previous close and buy the 10 stocks with the highest ROC.

1.3 The Price Rate of Change (ROC)

The Price ROC for 200 days is calculated as:

\begin{equation} \text{ROC}_{200d} = \left(\dfrac{P_t}{P_{t-200}}-1\right) \times 100 \end{equation}

2. Constituents come first

We were provided with two datasets: closes.csv and constituents.csv. We will begin by exploring the latter.

2.1 Constituents Data EDA

We import the constituents.csv data.

constituents <- read_csv("data/constituents.csv", col_types = "dDcccccc") %>% 
  # Remove first column, which has the index
  select(-1)

## New names:
## * `` -> ...1

constituents

## # A tibble: 48,388 x 7
##    date       action     ticker name               contraticker contraname note 
##    <date>     <chr>      <chr>  <chr>              <chr>        <chr>      <chr>
##  1 2020-12-31 historical ZTS    Zoetis Inc         <NA>         <NA>       <NA> 
##  2 2020-12-31 current    ZTS    Zoetis Inc         <NA>         <NA>       <NA> 
##  3 2020-12-31 historical ZION   Zions Bancorporat~ <NA>         <NA>       <NA> 
##  4 2020-12-31 current    ZION   Zions Bancorporat~ <NA>         <NA>       <NA> 
##  5 2020-12-31 historical ZBRA   Zebra Technologie~ <NA>         <NA>       <NA> 
##  6 2020-12-31 current    ZBRA   Zebra Technologie~ <NA>         <NA>       <NA> 
##  7 2020-12-31 historical ZBH    Zimmer Biomet Hol~ <NA>         <NA>       <NA> 
##  8 2020-12-31 current    ZBH    Zimmer Biomet Hol~ <NA>         <NA>       <NA> 
##  9 2020-12-31 historical YUM    Yum Brands Inc     <NA>         <NA>       <NA> 
## 10 2020-12-31 current    YUM    Yum Brands Inc     <NA>         <NA>       <NA> 
## # ... with 48,378 more rows

Let’s use the skimr package to print a quick summary of missing and unique values of the data:

constituents %>% 
  skimr::skim()


Name	Piped data
Number of rows	48388
Number of columns	7
_______________________
Column type frequency:
character	6
Date	1
________________________
Group variables	None

Table 1: Data summary

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
action	0	1.00	5	10	4
ticker	0	1.00	1	5	1112
name	0	1.00	5	48	1093
contraticker	47118	0.03	1	5	927
contraname	47118	0.03	5	48	913
note	48381	0.00	6	29	2

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	1957-03-04	2020-12-31	2009-06-30	856

Some notes on the table above:

The dataset consists of 6 character-type variables and 1 date-type variable.
According to the table above, there are 4 unique values for the action variable. Let’s see what are those:

constituents %>% distinct(action)

## # A tibble: 4 x 1
##   action    
##   <chr>     
## 1 historical
## 2 current   
## 3 added     
## 4 removed

The data goes from 1957-03-04 through 2020-12-31 but we don’t have daily data since there are only 856 unique dates.
The columns contraticker, contraname and note are the only columns that have missing values, which is both logical and acceptable.
The contraticker and contraname columns see to be the opposite of the stocks that are added/removed to the index. Let’s check that with some steps:

Check all possible different action values when there are no missing values of either the contraticker or contraname variables.

constituents %>% 
  filter(!is.na(contraticker) | !is.na(contraname)) %>% 
  distinct(action)

## # A tibble: 2 x 1
##   action 
##   <chr>  
## 1 added  
## 2 removed

Inspect some rows where there are no missing values of either the contraticker or contraname variables.

constituents %>% 
  filter(!is.na(contraticker) | !is.na(contraname))

## # A tibble: 1,270 x 7
##    date       action  ticker name                  contraticker contraname note 
##    <date>     <chr>   <chr>  <chr>                 <chr>        <chr>      <chr>
##  1 2020-12-21 added   TSLA   Tesla Inc             AIV          Apartment~ <NA> 
##  2 2020-12-21 removed AIV    Apartment Investment~ TSLA         Tesla Inc  <NA> 
##  3 2020-10-12 added   VNT    Vontier Corp          NBL          Noble Ene~ <NA> 
##  4 2020-10-12 removed NBL    Noble Energy Inc      VNT          Vontier C~ <NA> 
##  5 2020-09-21 added   TER    Teradyne Inc          COTY         Coty Inc   <NA> 
##  6 2020-09-21 removed KSS    KOHLS Corp            CTLT         Catalent ~ <NA> 
##  7 2020-09-21 removed HRB    H&R Block Inc         ETSY         Etsy Inc   <NA> 
##  8 2020-09-21 added   ETSY   Etsy Inc              HRB          H&R Block~ <NA> 
##  9 2020-09-21 added   CTLT   Catalent Inc          KSS          KOHLS Corp <NA> 
## 10 2020-09-21 removed COTY   Coty Inc              TER          Teradyne ~ <NA> 
## # ... with 1,260 more rows

Indeed this is the case. We can see, for example, on the first row: TSLA is being added and replaced by AIV. On the next row we see the opposite operation: AIV is being removed and substituted by TSLA.

Another important check we must do is that there are no errors in the ticker-name relationship. That is, for every (ticker, name) pair there must be just one count per value-key pair.

We first get the count for each key-value pair:

constituents %>% distinct(ticker, name) %>% count(ticker, name, name = "count")

## # A tibble: 1,112 x 3
##    ticker name                        count
##    <chr>  <chr>                       <int>
##  1 A      Agilent Technologies Inc        1
##  2 AAL    American Airlines Group Inc     1
##  3 AAMRQ  Amr Corp                        1
##  4 AAP    Advance Auto Parts Inc          1
##  5 AAPL   Apple Inc                       1
##  6 ABBV   AbbVie Inc                      1
##  7 ABC    Amerisourcebergen Corp          1
##  8 ABI1   Applied Biosystems Inc          1
##  9 ABKFQ  Ambac Financial Group Inc       1
## 10 ABMD   Abiomed Inc                     1
## # ... with 1,102 more rows

Now we filter the counts that are greater than 1:

constituents %>% 
  distinct(ticker, name) %>% 
  count(ticker, name, name = "count") %>% 
  filter(count > 1)

## # A tibble: 0 x 3
## # ... with 3 variables: ticker <chr>, name <chr>, count <int>

We get no rows, which means that the data is correct.

2.1.1 Time variables and `action` in the `constituents` dataset

In this section we’ll create a time series that will contain, for each date, the constituents of the S&P 500 index.

Since we are focusing on time, we’ll add some time-related variables, such as:

year: The year of the time series date.
half: The half component of the index (i.e. to which semester does the date belong to).
quarter: The quarter component of the index (i.e. to which quarter does the date belong to).
month: The month component of the index (with base 1 - that is, January = 1 and so on).
month_label: The three-letter month label as an ordered categorical variable. It begins with Jan and ends with Dec.
day: The day component of the date.
wday: The day of the week with base 1. Monday = 1 and Sunday = 7.
wday_label: The three-letter label for day of the week as an ordered categorical variable. It begins with Mon and ends with Sun.
qday: The day of the quarter.
yday: The day of the year.
qid: The quarter’s ID.

All variables will have the date_ prefix so they are easy to identify. Since we can use this creation of variables for later, we create the add_ts_signature() function (so we can recycle the code).

add_ts_signature <- function(df, .date_col) {
  df %>% 
    # Create time series signature
    mutate(
      date_year        = year({{.date_col}}),
      date_half        = semester({{.date_col}}, with_year = FALSE),
      date_quarter     = quarter({{.date_col}}),
      date_month       = month({{.date_col}}),
      date_month_label = month({{.date_col}}, label = TRUE),
      date_day         = day({{.date_col}}),
      date_wday        = wday({{.date_col}}, week_start = 1),
      date_wday_label  = wday({{.date_col}}, label = TRUE),
      date_qday        = qday({{.date_col}}),
      date_yday        = yday({{.date_col}}),
      date_qid         = str_c(date_quarter,"Q", str_sub(date_year, 3L)) %>% as_factor()
    )
}

constituents_ts <- constituents %>% 
  add_ts_signature(.date_col = date)

How are actions distributed along the data?

constituents_ts %>% count(action, sort = T)

## # A tibble: 4 x 2
##   action         n
##   <chr>      <int>
## 1 historical 46110
## 2 added       1137
## 3 removed      636
## 4 current      505

current appears to correspond to the index’s most recent constituents. We can check if this is the case by taking the unique values for the date variable when the action variable has the current value.

constituents_ts %>% 
  filter(
    action == "current"
  ) %>% 
  distinct(date)

## # A tibble: 1 x 1
##   date      
##   <date>    
## 1 2020-12-31

Indeed, the only date where a current value appears is 2020-12-31

Let’s inspect the data for the first date 1957-03-04:

constituents_ts %>% 
  filter(
    date == ymd("1957-03-04")
  )

## # A tibble: 97 x 18
##    date       action  ticker name        contraticker contraname note  date_year
##    <date>     <chr>   <chr>  <chr>       <chr>        <chr>      <chr>     <dbl>
##  1 1957-03-04 added   WWY    Wrigley Wm~ <NA>         <NA>       <NA>       1957
##  2 1957-03-04 added   WNDXQ  Winn Dixie~ <NA>         <NA>       <NA>       1957
##  3 1957-03-04 added   WLA    Warner Lam~ <NA>         <NA>       <NA>       1957
##  4 1957-03-04 added   WGL    Wgl Holdin~ <NA>         <NA>       <NA>       1957
##  5 1957-03-04 removed WAB    Westinghou~ GT           Goodyear ~ <NA>       1957
##  6 1957-03-04 added   VO1    Seagram Co~ <NA>         <NA>       <NA>       1957
##  7 1957-03-04 added   USG    Usg Corp    <NA>         <NA>       <NA>       1957
##  8 1957-03-04 added   UNP    Union Paci~ <NA>         <NA>       <NA>       1957
##  9 1957-03-04 added   UK1    Union Carb~ <NA>         <NA>       <NA>       1957
## 10 1957-03-04 added   UCC1   Union Camp~ <NA>         <NA>       <NA>       1957
## # ... with 87 more rows, and 10 more variables: date_half <int>,
## #   date_quarter <int>, date_month <dbl>, date_month_label <ord>,
## #   date_day <int>, date_wday <dbl>, date_wday_label <ord>, date_qday <dbl>,
## #   date_yday <dbl>, date_qid <fct>

It’s clear that we don’t have the initial S&P 500 sample, but we could reverse engineer the constituents at each date. Let’s start by the final date 2020-12-31:

constituents %>% 
  filter(date == ymd("2020-12-31"))

## # A tibble: 1,010 x 7
##    date       action     ticker name               contraticker contraname note 
##    <date>     <chr>      <chr>  <chr>              <chr>        <chr>      <chr>
##  1 2020-12-31 historical ZTS    Zoetis Inc         <NA>         <NA>       <NA> 
##  2 2020-12-31 current    ZTS    Zoetis Inc         <NA>         <NA>       <NA> 
##  3 2020-12-31 historical ZION   Zions Bancorporat~ <NA>         <NA>       <NA> 
##  4 2020-12-31 current    ZION   Zions Bancorporat~ <NA>         <NA>       <NA> 
##  5 2020-12-31 historical ZBRA   Zebra Technologie~ <NA>         <NA>       <NA> 
##  6 2020-12-31 current    ZBRA   Zebra Technologie~ <NA>         <NA>       <NA> 
##  7 2020-12-31 historical ZBH    Zimmer Biomet Hol~ <NA>         <NA>       <NA> 
##  8 2020-12-31 current    ZBH    Zimmer Biomet Hol~ <NA>         <NA>       <NA> 
##  9 2020-12-31 historical YUM    Yum Brands Inc     <NA>         <NA>       <NA> 
## 10 2020-12-31 current    YUM    Yum Brands Inc     <NA>         <NA>       <NA> 
## # ... with 1,000 more rows

We see that we have sort-of-duplicated rows; we note that one action label corresponds to historical and the other one to current. Let’s check that we actually have the same number per category:

constituents %>% 
  filter(date == ymd("2020-12-31")) %>% 
  count(action)

## # A tibble: 2 x 2
##   action         n
##   <chr>      <int>
## 1 current      505
## 2 historical   505

In order to get a better sense of what’s going on with the action column, we pivot the data to a wide format so we can clearly see the difference between each action value. By doing this, we can have a single column for each of the 4 unique values in the action variable.

constituents_wide <- constituents_ts %>% 
  pivot_wider(
    names_from = action,
    values_from = name
  ) %>% 
  relocate(historical, current, added, removed, .after = ticker)

constituents_wide

## # A tibble: 47,883 x 20
##    date       ticker historical    current added removed contraticker contraname
##    <date>     <chr>  <chr>         <chr>   <chr> <chr>   <chr>        <chr>     
##  1 2020-12-31 ZTS    Zoetis Inc    Zoetis~ <NA>  <NA>    <NA>         <NA>      
##  2 2020-12-31 ZION   Zions Bancor~ Zions ~ <NA>  <NA>    <NA>         <NA>      
##  3 2020-12-31 ZBRA   Zebra Techno~ Zebra ~ <NA>  <NA>    <NA>         <NA>      
##  4 2020-12-31 ZBH    Zimmer Biome~ Zimmer~ <NA>  <NA>    <NA>         <NA>      
##  5 2020-12-31 YUM    Yum Brands I~ Yum Br~ <NA>  <NA>    <NA>         <NA>      
##  6 2020-12-31 XYL    Xylem Inc     Xylem ~ <NA>  <NA>    <NA>         <NA>      
##  7 2020-12-31 XRX    Xerox Holdin~ Xerox ~ <NA>  <NA>    <NA>         <NA>      
##  8 2020-12-31 XRAY   DENTSPLY SIR~ DENTSP~ <NA>  <NA>    <NA>         <NA>      
##  9 2020-12-31 XOM    Exxon Mobil ~ Exxon ~ <NA>  <NA>    <NA>         <NA>      
## 10 2020-12-31 XLNX   Xilinx Inc    Xilinx~ <NA>  <NA>    <NA>         <NA>      
## # ... with 47,873 more rows, and 12 more variables: note <chr>,
## #   date_year <dbl>, date_half <int>, date_quarter <int>, date_month <dbl>,
## #   date_month_label <ord>, date_day <int>, date_wday <dbl>,
## #   date_wday_label <ord>, date_qday <dbl>, date_yday <dbl>, date_qid <fct>

Inspecting the data, it appears that the historical variable gives us the constituents at each quarter. This is backed by the fact that on the latest available date, we have the current constituents which correspond to the 4Q20. However, as mentioned above, we have the exact same rows but with a historical label.

Let’s inspect the dates on which there are no missing values for the historical action. We can check the unique values for the month number:

constituents_wide %>% 
  filter(!is.na(historical)) %>%
  distinct(date_month, date_quarter)

## # A tibble: 4 x 2
##   date_quarter date_month
##          <int>      <dbl>
## 1            4         12
## 2            3          9
## 3            2          6
## 4            1          3

Indeed, the only months were historical actions are not missing are mod 3 months (i.e. the end of each quarter).

This points to the fact that probably the historical value indicates the constituents per quarter on the S&P 500. We can further check the amount of historical values per quarter. The number should be around 500, of course.

constituents_wide %>% 
  select(date_year, date_qid, historical) %>% 
  filter(date_year > 2008, !is.na(historical)) %>% 
  group_by(date_year, date_qid) %>%  
  nest() %>% 
  mutate(constituent_count = map_int(data, nrow)) %>% 
  select(-data) %>% 
  print(n = 45)

## # A tibble: 48 x 3
## # Groups:   date_year, date_qid [48]
##    date_year date_qid constituent_count
##        <dbl> <fct>                <int>
##  1      2020 4Q20                   505
##  2      2020 3Q20                   505
##  3      2020 2Q20                   505
##  4      2020 1Q20                   505
##  5      2019 4Q19                   505
##  6      2019 3Q19                   505
##  7      2019 2Q19                   505
##  8      2019 1Q19                   505
##  9      2018 4Q18                   505
## 10      2018 3Q18                   505
## 11      2018 2Q18                   505
## 12      2018 1Q18                   505
## 13      2017 4Q17                   505
## 14      2017 3Q17                   505
## 15      2017 2Q17                   505
## 16      2017 1Q17                   505
## 17      2016 4Q16                   505
## 18      2016 3Q16                   505
## 19      2016 2Q16                   505
## 20      2016 1Q16                   504
## 21      2015 4Q15                   504
## 22      2015 3Q15                   505
## 23      2015 2Q15                   502
## 24      2015 1Q15                   502
## 25      2014 4Q14                   502
## 26      2014 3Q14                   502
## 27      2014 2Q14                   500
## 28      2014 1Q14                   500
## 29      2013 4Q13                   500
## 30      2013 3Q13                   500
## 31      2013 2Q13                   500
## 32      2013 1Q13                   500
## 33      2012 4Q12                   500
## 34      2012 3Q12                   500
## 35      2012 2Q12                   500
## 36      2012 1Q12                   500
## 37      2011 4Q11                   500
## 38      2011 3Q11                   500
## 39      2011 2Q11                   500
## 40      2011 1Q11                   500
## 41      2010 4Q10                   500
## 42      2010 3Q10                   500
## 43      2010 2Q10                   500
## 44      2010 1Q10                   494
## 45      2009 4Q09                   500
## # ... with 3 more rows

Indeed the number is very close to 500 among all quarters.

One last check we can perform before we can use the historical rows as our constituents has to do with the current rows. We can reverse engineer the 3Q20 constituents by starting from the 4Q20 (the current constituents).

We can first see what the changes on constituents were between the 3Q20 and 4Q20. We can separate the adds and drops into two data frames, like so:

adds_drops <- constituents_wide %>% 
  filter(!is.na(added) | !is.na(removed) , date_qid %in% c("4Q20")) %>% 
  select(date, ticker, added, removed, date_qid)

adds_drops %>% 
  filter(!is.na(added)) %>% 
  select(-removed)

## # A tibble: 2 x 4
##   date       ticker added        date_qid
##   <date>     <chr>  <chr>        <fct>   
## 1 2020-12-21 TSLA   Tesla Inc    4Q20    
## 2 2020-10-12 VNT    Vontier Corp 4Q20

So TSLA and VNT were added in the 4Q20.

adds_drops %>% 
  filter(!is.na(removed)) %>% 
  select(-added)

## # A tibble: 2 x 4
##   date       ticker removed                              date_qid
##   <date>     <chr>  <chr>                                <fct>   
## 1 2020-12-21 AIV    Apartment Investment & Management Co 4Q20    
## 2 2020-10-12 NBL    Noble Energy Inc                     4Q20

NBL and AIV were removed in the 4Q20. Apparently, substituted by TSLA and VNT.

We can now do the reverse action for the adds and the drops, respectively. Thus, the drops from 3Q20-4Q20 will become our adds and the adds will become our drops.

We start by getting the adds needed to go from 4Q20 to 3Q20:

# Extract the adds from 4Q20 to 3Q20
adds_3Q <- adds_drops %>% 
  filter(!is.na(removed)) %>% 
  pull(removed)

adds_3Q

## [1] "Apartment Investment & Management Co"
## [2] "Noble Energy Inc"

Next, we get the drops needed to go from 4Q20 to 3Q20:

# Extract the drops from 4Q20 to 3Q20
drops_3Q <- adds_drops %>% 
  filter(!is.na(added)) %>% 
  pull(added)

drops_3Q

## [1] "Tesla Inc"    "Vontier Corp"

Finally, we get the constituents in 4Q20, include the adds and remove the drops, and compare against the constituents in 3Q20. We should get the same result.

# Get all the constituents for 4Q20
constituents_4Q20 <- constituents_wide %>% 
  filter(date_qid == "4Q20", !is.na(historical)) %>% 
  pull(historical) %>% sort()

# Get all the constituents for 3Q20
constituents_3Q20 <- constituents_wide %>% 
  filter(date_qid == "3Q20", !is.na(historical)) %>% 
  pull(historical) %>% sort()

# include the adds from 4Q20
calculated_3Q20 <- c(constituents_4Q20, adds_3Q)

# remove the drops from 4Q20
calculated_3Q20 <- calculated_3Q20[!calculated_3Q20 %in% drops_3Q] %>% sort()

# Check if both results are equal
all.equal(constituents_3Q20, calculated_3Q20)

## [1] TRUE

Voilà! We did get a TRUE value when comparing the reverse-engineered approach and the mere historical values. This confirms that we can use the historical values for our constituents.

2.1.2 The time series of S&P 500 constituents

We will now proceed to use these historical values for every day in the closes.csv dataset from 2010-01-01 through 2020-01-01.

# Get the required dates from the closes.csv file
required_dates <- read_csv(file = "data/closes.csv") %>% 
  # Focus on the date only
  select(date) %>% 
  
  # Focus on 2010 or later
  filter(year(date) > 2008)

## New names:
## * `` -> ...1

## Rows: 5796 Columns: 1086

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl  (1085): ...1, ZTS, ZION, ZBRA, ZBH, YUM, XYL, XRX, XRAY, XOM, XLNX, XEL...
## date    (1): date

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

We now nest a data frame for each date in the constituents dataset that contains the S&P 500 constituents.

constituents_nested <- constituents_wide %>% 
  # Focus on just the historicals and 2010 or later
  filter(date_year > 2008, !is.na(historical)) %>% 
  
  # Just take the date, ticker, and historical columns
  select(date, ticker, historical) %>% 
  
  # Nest the data
  nest(constituents = c(ticker, historical)) %>% 
  relocate(constituents, .after = 1) %>% 
  arrange(date) %>% 
  
  # Uncomment in case we want to get the number of constituents per date
  # mutate(constituent_count = map_int(constituents, nrow)) %>% 
  identity()

constituents_nested

## # A tibble: 48 x 2
##    date       constituents      
##    <date>     <list>            
##  1 2009-03-31 <tibble [500 x 2]>
##  2 2009-06-30 <tibble [500 x 2]>
##  3 2009-09-30 <tibble [500 x 2]>
##  4 2009-12-31 <tibble [500 x 2]>
##  5 2010-03-31 <tibble [494 x 2]>
##  6 2010-06-30 <tibble [500 x 2]>
##  7 2010-09-30 <tibble [500 x 2]>
##  8 2010-12-31 <tibble [500 x 2]>
##  9 2011-03-31 <tibble [500 x 2]>
## 10 2011-06-30 <tibble [500 x 2]>
## # ... with 38 more rows

We can now left_join the datasets:

constituents_nested <- required_dates %>% 
  left_join(constituents_nested, by = "date") %>% 
  fill(contains("constituent"), .direction = "up")

constituents_nested

## # A tibble: 3,028 x 2
##    date       constituents      
##    <date>     <list>            
##  1 2009-01-02 <tibble [500 x 2]>
##  2 2009-01-05 <tibble [500 x 2]>
##  3 2009-01-06 <tibble [500 x 2]>
##  4 2009-01-07 <tibble [500 x 2]>
##  5 2009-01-08 <tibble [500 x 2]>
##  6 2009-01-09 <tibble [500 x 2]>
##  7 2009-01-12 <tibble [500 x 2]>
##  8 2009-01-13 <tibble [500 x 2]>
##  9 2009-01-14 <tibble [500 x 2]>
## 10 2009-01-15 <tibble [500 x 2]>
## # ... with 3,018 more rows

Finally! We now have a time series with the corresponding composition of the S&P 500 for each date.

3. Implementing the ROC Strategy

We proceed in the following way:

First, import the data:

# Use read_csv() function to read the close data
close_tbl <- read_csv("data/closes.csv") %>% 
  select(-1)

## New names:
## * `` -> ...1

## Rows: 5796 Columns: 1086

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl  (1085): ...1, ZTS, ZION, ZBRA, ZBH, YUM, XYL, XRX, XRAY, XOM, XLNX, XEL...
## date    (1): date

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

close_tbl

## # A tibble: 5,796 x 1,085
##    date         ZTS  ZION  ZBRA   ZBH   YUM   XYL   XRX  XRAY   XOM  XLNX   XEL
##    <date>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 1997-12-31  31.0  45.4  13.2  29.8  7.26  24.2  148. 10.2   30.6  8.76  29.1
##  2 1998-01-02  31.0  45.5  13.2  29.8  7.28  24.2  150.  9.92  30.9  9.66  28.9
##  3 1998-01-05  31.0  46.4  13.1  29.8  7.06  24.2  152.  9.54  30.6  9.44  28.7
##  4 1998-01-06  31.0  45.6  13.1  29.8  6.76  24.2  151.  9.71  29.5  9.35  28.3
##  5 1998-01-07  31.0  44.9  13.0  29.8  6.75  24.2  145.  9.54  30.4  8.95  28  
##  6 1998-01-08  31.0  41.5  13    29.8  6.48  24.2  142.  9.48  29.8  9.03  28.1
##  7 1998-01-09  31.0  39.6  11.8  29.8  6.5   24.2  137.  9.02  29.1  8.66  27.8
##  8 1998-01-12  31.0  40.5  11.8  29.8  6.53  24.2  137.  9     29.4  8.52  28.0
##  9 1998-01-13  31.0  40.8  12.3  29.8  6.53  24.2  141.  9.17  30.0  9.31  28.0
## 10 1998-01-14  31.0  42.2  11.6  29.8  6.53  24.2  142.  9.29  30.4  9.22  27.9
## # ... with 5,786 more rows, and 1,073 more variables: WYNN <dbl>, WY <dbl>,
## #   WU <dbl>, WST <dbl>, WRK <dbl>, WRB <dbl>, WMT <dbl>, WMB <dbl>, WM <dbl>,
## #   WLTW <dbl>, WHR <dbl>, WFC <dbl>, WELL <dbl>, WEC <dbl>, WDC <dbl>,
## #   WBA <dbl>, WAT <dbl>, WAB <dbl>, VZ <dbl>, VTRS <dbl>, VTR <dbl>,
## #   VRTX <dbl>, VRSN <dbl>, VRSK <dbl>, VNT <dbl>, VNO <dbl>, VMC <dbl>,
## #   VLO <dbl>, VIAC <dbl>, VFC <dbl>, VAR <dbl>, V <dbl>, USB <dbl>, URI <dbl>,
## #   UPS <dbl>, UNP <dbl>, UNM <dbl>, UNH <dbl>, ULTA <dbl>, UHS <dbl>, ...

Apparently, we have daily closing data from 1997-12-31 through 2021-01-12 for 1,084 different securities.

3.1 Calculating the 200-day ROC and daily returns

Secondly, calculate the 200-day ROC. To achieve this, we create the add_price_roc() function, which takes an ndays argument to indicate the number of days for the ROC calculation. This would be useful if we’d like to explore different days for the ROC.

# This functions needs a data frame that has a date column and the rest of the columns must have the price of the securities.
add_price_roc <- function(df, .date_col, .ndays) {
  
  df %>% 
    mutate(
      across(
        .cols = -{{.date_col}},
        .fns = ~ ((.x / lag(.x, n = .ndays)) - 1) * 100,
        .names = "ROC_{.col}"
      )
    ) %>% 
    # keep rows where at least one of the ROC columns is not NA
    filter(
      if_any(contains("ROC"), ~ !is.na(.x))
    )
}

Now let’s use this function on our closes.csv data. This creates a column for each security that contains the n-day ROC for each security. The ROC numbers are in columns with the ROC prefix to easily identify them.

We also calculate the daily returns. The return numbers are located in columns with the return prefix to easily identify them.

roc_close <- close_tbl %>% 
  # Add 200-day ROC
  add_price_roc(.date_col = date, 
                .ndays = 200) %>%
  relocate(date, contains("ROC")) %>% 
  
  # Add the returns
  mutate(
    across(
      .cols = -c(date, contains("ROC")),
      .fns = ~ (.x / lag(.x)) - 1,
      .names = "return_{.col}"
    )
  )

roc_close

## # A tibble: 5,596 x 3,253
##    date       ROC_ZTS ROC_ZION ROC_ZBRA ROC_ZBH ROC_YUM ROC_XYL ROC_XRX ROC_XRAY
##    <date>       <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl>
##  1 1998-10-16       0     3.02   10.3         0    37.9       0    29.4    -32.0
##  2 1998-10-19       0     5.91   10.1         0    45.7       0    28.2    -27.7
##  3 1998-10-20       0     7.93    8.90        0    58.0       0    26.8    -23.6
##  4 1998-10-21       0     9.86   13.6         0    63.1       0    24.2    -23.4
##  5 1998-10-22       0    11.3     2.35        0    62.0       0    28.4    -21.0
##  6 1998-10-23       0    20.5     0.446       0    71.5       0    32.2    -19.5
##  7 1998-10-26       0    29.1     7.96        0    68.8       0    40.8    -13.9
##  8 1998-10-27       0    27.5    10.1         0    68.4       0    42.7    -10.6
##  9 1998-10-28       0    26.7     5.43        0    66.5       0    34.9    -10.9
## 10 1998-10-29       0    22.2    19.6         0    64.8       0    34.0    -10.3
## # ... with 5,586 more rows, and 3,244 more variables: ROC_XOM <dbl>,
## #   ROC_XLNX <dbl>, ROC_XEL <dbl>, ROC_WYNN <dbl>, ROC_WY <dbl>, ROC_WU <dbl>,
## #   ROC_WST <dbl>, ROC_WRK <dbl>, ROC_WRB <dbl>, ROC_WMT <dbl>, ROC_WMB <dbl>,
## #   ROC_WM <dbl>, ROC_WLTW <dbl>, ROC_WHR <dbl>, ROC_WFC <dbl>, ROC_WELL <dbl>,
## #   ROC_WEC <dbl>, ROC_WDC <dbl>, ROC_WBA <dbl>, ROC_WAT <dbl>, ROC_WAB <dbl>,
## #   ROC_VZ <dbl>, ROC_VTRS <dbl>, ROC_VTR <dbl>, ROC_VRTX <dbl>,
## #   ROC_VRSN <dbl>, ROC_VRSK <dbl>, ROC_VNT <dbl>, ROC_VNO <dbl>, ...

3.2 Filtering data from 2010 onward

Since the strategy will be implemented from 2010 onward, we can now filter out the data for previous years:

roc_close <- roc_close %>% 
  filter(year(date) > 2008)

3.3 Getting the data into a ‘long’ format

We get the data into a long format (instead of the wide format with 2,169 columns). Also, a separation is made in three data frames: one for ROCs, one for prices, and the other one for returns We call these roc_long, close_long, and `returns_long`` respectively.

# Build a dataframe with ROC values
roc_long <- roc_close %>% 
  select(date, contains("ROC")) %>% 
  pivot_longer(
    cols = contains("ROC"),
    names_to = "ticker",
    names_pattern = "_(.*)",
    values_to = "roc"
  ) %>% 
  add_column(data_type = "ROC")

roc_long

## # A tibble: 3,282,352 x 4
##    date       ticker    roc data_type
##    <date>     <chr>   <dbl> <chr>    
##  1 2009-01-02 ZTS      0    ROC      
##  2 2009-01-02 ZION   -47.2  ROC      
##  3 2009-01-02 ZBRA   -36.7  ROC      
##  4 2009-01-02 ZBH    -44.7  ROC      
##  5 2009-01-02 YUM    -11.6  ROC      
##  6 2009-01-02 XYL      0    ROC      
##  7 2009-01-02 XRX    -43.7  ROC      
##  8 2009-01-02 XRAY   -19.7  ROC      
##  9 2009-01-02 XOM     -3.30 ROC      
## 10 2009-01-02 XLNX   -22.5  ROC      
## # ... with 3,282,342 more rows

# Build a dataframe with price values
close_long <- roc_close %>% 
  select(-contains("ROC")) %>% 
  select(-contains("return")) %>% 
  pivot_longer(
    cols = -date,
    names_to = "ticker",
    values_to = "price"
  ) %>% 
  add_column(data_type = "close")

close_long

## # A tibble: 3,282,352 x 4
##    date       ticker price data_type
##    <date>     <chr>  <dbl> <chr>    
##  1 2009-01-02 ZTS     31.0 close    
##  2 2009-01-02 ZION    25   close    
##  3 2009-01-02 ZBRA    21.0 close    
##  4 2009-01-02 ZBH     41.8 close    
##  5 2009-01-02 YUM     32   close    
##  6 2009-01-02 XYL     24.2 close    
##  7 2009-01-02 XRX     33.5 close    
##  8 2009-01-02 XRAY    29.1 close    
##  9 2009-01-02 XOM     81.6 close    
## 10 2009-01-02 XLNX    18.4 close    
## # ... with 3,282,342 more rows

# Build a dataframe with ROC values
returns_long <- roc_close %>% 
  select(date, contains("return")) %>% 
  pivot_longer(
    cols = contains("return"),
    names_to = "ticker",
    names_pattern = "_(.*)",
    values_to = "return"
  ) %>% 
  add_column(data_type = "return")

returns_long

## # A tibble: 3,282,352 x 4
##    date       ticker return data_type
##    <date>     <chr>   <dbl> <chr>    
##  1 2009-01-02 ZTS    0      return   
##  2 2009-01-02 ZION   0.0200 return   
##  3 2009-01-02 ZBRA   0.0380 return   
##  4 2009-01-02 ZBH    0.0332 return   
##  5 2009-01-02 YUM    0.0159 return   
##  6 2009-01-02 XYL    0      return   
##  7 2009-01-02 XRX    0.0514 return   
##  8 2009-01-02 XRAY   0.0308 return   
##  9 2009-01-02 XOM    0.0227 return   
## 10 2009-01-02 XLNX   0.0303 return   
## # ... with 3,282,342 more rows

We now nest the three dataframes so we have a similar structure to the constituents data.

roc_nested <- roc_long %>% 
  nest(rocs = c(ticker, roc, data_type))
  
roc_nested

## # A tibble: 3,028 x 2
##    date       rocs                
##    <date>     <list>              
##  1 2009-01-02 <tibble [1,084 x 3]>
##  2 2009-01-05 <tibble [1,084 x 3]>
##  3 2009-01-06 <tibble [1,084 x 3]>
##  4 2009-01-07 <tibble [1,084 x 3]>
##  5 2009-01-08 <tibble [1,084 x 3]>
##  6 2009-01-09 <tibble [1,084 x 3]>
##  7 2009-01-12 <tibble [1,084 x 3]>
##  8 2009-01-13 <tibble [1,084 x 3]>
##  9 2009-01-14 <tibble [1,084 x 3]>
## 10 2009-01-15 <tibble [1,084 x 3]>
## # ... with 3,018 more rows

close_nested <- close_long %>% 
  nest(prices = c(ticker, price, data_type))

close_nested

## # A tibble: 3,028 x 2
##    date       prices              
##    <date>     <list>              
##  1 2009-01-02 <tibble [1,084 x 3]>
##  2 2009-01-05 <tibble [1,084 x 3]>
##  3 2009-01-06 <tibble [1,084 x 3]>
##  4 2009-01-07 <tibble [1,084 x 3]>
##  5 2009-01-08 <tibble [1,084 x 3]>
##  6 2009-01-09 <tibble [1,084 x 3]>
##  7 2009-01-12 <tibble [1,084 x 3]>
##  8 2009-01-13 <tibble [1,084 x 3]>
##  9 2009-01-14 <tibble [1,084 x 3]>
## 10 2009-01-15 <tibble [1,084 x 3]>
## # ... with 3,018 more rows

returns_nested <- returns_long %>% 
  nest(returns = c(ticker, return, data_type))

returns_nested

## # A tibble: 3,028 x 2
##    date       returns             
##    <date>     <list>              
##  1 2009-01-02 <tibble [1,084 x 3]>
##  2 2009-01-05 <tibble [1,084 x 3]>
##  3 2009-01-06 <tibble [1,084 x 3]>
##  4 2009-01-07 <tibble [1,084 x 3]>
##  5 2009-01-08 <tibble [1,084 x 3]>
##  6 2009-01-09 <tibble [1,084 x 3]>
##  7 2009-01-12 <tibble [1,084 x 3]>
##  8 2009-01-13 <tibble [1,084 x 3]>
##  9 2009-01-14 <tibble [1,084 x 3]>
## 10 2009-01-15 <tibble [1,084 x 3]>
## # ... with 3,018 more rows

3.4 Consolidated, Nested Data

Having all three nested dataframes (constituents, prices and rocs) we can consolidate everything into a single dataframe. This will be used to analyze the strategy.

consolidated_nested <- constituents_nested %>% 
  left_join(close_nested, by = "date") %>% 
  left_join(roc_nested, by = "date") %>% 
  left_join(returns_nested, by = "date") %>% 
  mutate(row_id = row_number()) %>% 
  relocate(row_id, everything())

consolidated_nested

## # A tibble: 3,028 x 6
##    row_id date       constituents       prices               rocs     returns 
##     <int> <date>     <list>             <list>               <list>   <list>  
##  1      1 2009-01-02 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  2      2 2009-01-05 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  3      3 2009-01-06 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  4      4 2009-01-07 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  5      5 2009-01-08 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  6      6 2009-01-09 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  7      7 2009-01-12 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  8      8 2009-01-13 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
##  9      9 2009-01-14 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 10     10 2009-01-15 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## # ... with 3,018 more rows

In this consolidated_nested dataframe each row represents a single date. The advantage of this dataframe is that it contains the constituents, the prices, and rocs for each date.

3.5 Stock Selection Per Day

To code the stock selection, we start by defining the starting and ending dates.

start_date <- ymd("2009-12-31")
end_date <- ymd("2020-01-02")

Then, we can filter the consolidated_nested dataframe so we have data around this range of dates.

consolidated_nested <- consolidated_nested %>% 
  filter(date %>% between(start_date - 10, end_date + 10)
  )

We then execute the following steps:

For each day, get the constituents of the S&P 500.

For this purpose, we create the extract_constituents() function:

extract_constituents <- function(df, extraction_date) {
  df %>% 
    filter(date == extraction_date) %>% 
    select(row_id, date, constituents) %>% 
    unnest(cols = constituents) %>% 
    pull(ticker)
}

We can test it for any date and show just the first 6 tickers:

extract_constituents(consolidated_nested, ymd("2019-12-31")) %>% 
  head()

## [1] "ZTS"  "ZION" "ZBRA" "ZBH"  "YUM"  "XYL"

Then, rank each of those constituents by their previous closing 200-day ROC. Choose the top 10 stocks.

For this purpose, we create two functions. We also set as an argument the number of stocks that can be picked. In that way, the strategy can be changed to pick the top ‘n’ stocks. The default is set to 10.

the show_top_roc() function, which shows us the top 10 tickers ranked by ROC, for any given date.
the extract_top_roc() function, which extracts the top 10 tickers ranked by ROC, for any given date.

Here’s the show_top_roc() function:

show_top_roc <- function(df, extraction_date, .top_n = 10L) {
  if (!is.numeric(.top_n)) {
    stop("You must provide an integer for the top 'n' constituents.")
  }
  
  # Get the ID for the previous close
  previous_row <- df %>% 
    filter(date == extraction_date) %>% 
    pull(row_id) - 1
  
  # Proceed to get the top n tickers
  df %>% 
    filter(row_id == previous_row) %>% 
    select(row_id, date, rocs) %>% 
    unnest(cols = rocs) %>% 
    
    # Filter to focus on the S&P 500 constituents. Use extract_constituents function
    filter(ticker %in% extract_constituents(df, extraction_date)) %>% 
    
    # Get the top n, as ordered by ROC
    slice_max(order_by = roc, n = .top_n)
  
}

Let’s test this function for a given date:

show_top_roc(consolidated_nested, ymd("2019-12-31"), .top_n = 10)

## # A tibble: 10 x 5
##    row_id date       ticker   roc data_type
##     <int> <date>     <chr>  <dbl> <chr>    
##  1   2767 2019-12-30 AMD     95.4 ROC      
##  2   2767 2019-12-30 TGT     68.1 ROC      
##  3   2767 2019-12-30 QRVO    66.8 ROC      
##  4   2767 2019-12-30 HWM     62.6 ROC      
##  5   2767 2019-12-30 MKTX    60.4 ROC      
##  6   2767 2019-12-30 LRCX    60.4 ROC      
##  7   2767 2019-12-30 AAPL    56.6 ROC      
##  8   2767 2019-12-30 QCOM    56.2 ROC      
##  9   2767 2019-12-30 LDOS    56.1 ROC      
## 10   2767 2019-12-30 RMD     54.0 ROC

Indeed it shows the top 10 ROC tickers for the previous close.

Now here’s the extract_top_roc() function:

extract_top_roc <- function(df, extraction_date, .top_n = 10L) {
  
  show_top_roc(df, extraction_date, .top_n) %>% 
    pull(ticker)
}

Again, we take it for a spin:

extract_top_roc(consolidated_nested, ymd("2019-12-31"), .top_n = 10)

##  [1] "AMD"  "TGT"  "QRVO" "HWM"  "MKTX" "LRCX" "AAPL" "QCOM" "LDOS" "RMD"

It works perfectly; we got the exact same tickers as with the show_top_roc() function.

Get the corresponding weights for each stock. For this step, we create the calculate_weights() function:

calculate_weights <- function(df, .date, .top_n = 10L) {
  chosen_tickers <- extract_top_roc(df, .date, .top_n)
  
  weights_tbl <- tibble(
    date = .date,
    tickers = extract_constituents(df, .date),
    weights = if_else(
      tickers %in% chosen_tickers, 
      1 / length(chosen_tickers), # equally weighted portfolio
      0                           # if ticker is not chose, assign 0 weight
    )
  ) %>% 
    # Focus on stocks that are chosen
    filter(weights > 0) %>% 
    
    arrange(tickers)
  
  
  return(weights_tbl)
}

We also implement the ‘wide’ version of this function:

calculate_weights_wide <- function(df, .date, .top_n = 10L) {
  
  weights_tbl <- calculate_weights(df, .date, .top_n = .top_n)
  
  tickers_tbl <- weights_tbl %>%
    add_column("ticker_id" = paste0("ticker_", 1:.top_n)) %>% 
    pivot_wider(
      id_cols = date,
      names_from = ticker_id,
      values_from = tickers
    )
  
  weights_tbl <- weights_tbl %>%
    add_column("weight_id" = paste0("weight_", 1:.top_n)) %>% 
    pivot_wider(
      id_cols = date,
      names_from = weight_id,
      values_from = weights
    ) 

  return(left_join(tickers_tbl, weights_tbl, by = "date"))
    
}

Again, we test the functions:

calculate_weights(consolidated_nested, ymd("2019-12-31"), .top_n = 10)

## # A tibble: 10 x 3
##    date       tickers weights
##    <date>     <chr>     <dbl>
##  1 2019-12-31 AAPL        0.1
##  2 2019-12-31 AMD         0.1
##  3 2019-12-31 HWM         0.1
##  4 2019-12-31 LDOS        0.1
##  5 2019-12-31 LRCX        0.1
##  6 2019-12-31 MKTX        0.1
##  7 2019-12-31 QCOM        0.1
##  8 2019-12-31 QRVO        0.1
##  9 2019-12-31 RMD         0.1
## 10 2019-12-31 TGT         0.1

calculate_weights_wide(consolidated_nested, ymd("2019-12-31"), .top_n = 10)

## # A tibble: 1 x 21
##   date       ticker_1 ticker_2 ticker_3 ticker_4 ticker_5 ticker_6 ticker_7
##   <date>     <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
## 1 2019-12-31 AAPL     AMD      HWM      LDOS     LRCX     MKTX     QCOM    
## # ... with 13 more variables: ticker_8 <chr>, ticker_9 <chr>, ticker_10 <chr>,
## #   weight_1 <dbl>, weight_2 <dbl>, weight_3 <dbl>, weight_4 <dbl>,
## #   weight_5 <dbl>, weight_6 <dbl>, weight_7 <dbl>, weight_8 <dbl>,
## #   weight_9 <dbl>, weight_10 <dbl>

This is the main function needed for the implementation. At each date, the function tells you what are the weights needed on each of the S&P 500 constituents, ranking those by 200-day ROC and equally weighing the top ‘n.’

3.6 Strategy Weights

We are finally ready to implement the aforementioned strategy on each day. We proceed by getting a vector of dates for which we want to implement the strategy (defined by the start_date and end_date variables).

# Extract the dates needed for the analysis
analysis_dates <- consolidated_nested %>% 
  select(date) %>% 
  filter(
    date %>% between(ymd("2009-12-31"), ymd("2020-01-02"))
    ) %>% pull(date)

analysis_dates %>% head()

## [1] "2009-12-31" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07"
## [6] "2010-01-08"

Then, we loop loop over that vector to use it on our calculate_weights_wide() function.

# WARNING: This takes ~ 3-4 min to to run
strategy_weights <- analysis_dates %>% 
  map_dfr(.f = ~ calculate_weights_wide(consolidated_nested, .x, .top_n = 10))

strategy_weights

## # A tibble: 2,518 x 21
##    date       ticker_1 ticker_2 ticker_3 ticker_4 ticker_5 ticker_6 ticker_7
##    <date>     <chr>    <chr>    <chr>    <chr>    <chr>    <chr>    <chr>   
##  1 2009-12-31 CBRE     F        FITB     GNW      JBL      ODP      TGNA    
##  2 2010-01-04 CBRE     F        FITB     GNW      JBL      MEE      ODP     
##  3 2010-01-05 CBRE     F        FITB     GNW      JBL      ODP      TGNA    
##  4 2010-01-06 CBRE     FITB     GNW      JBL      MEE      ODP      TGNA    
##  5 2010-01-07 CBRE     F        FITB     GNW      JBL      MEE      ODP     
##  6 2010-01-08 CBRE     F        FITB     GNW      JBL      MEE      ODP     
##  7 2010-01-11 DD       F        FITB     GNW      JBL      MEE      ODP     
##  8 2010-01-12 DD       F        FITB     GNW      ISRG     MEE      ODP     
##  9 2010-01-13 DD       F        FITB     GNW      IP       MEE      ODP     
## 10 2010-01-14 F        FITB     GNW      LNC      MEE      ODP      TGNA    
## # ... with 2,508 more rows, and 13 more variables: ticker_8 <chr>,
## #   ticker_9 <chr>, ticker_10 <chr>, weight_1 <dbl>, weight_2 <dbl>,
## #   weight_3 <dbl>, weight_4 <dbl>, weight_5 <dbl>, weight_6 <dbl>,
## #   weight_7 <dbl>, weight_8 <dbl>, weight_9 <dbl>, weight_10 <dbl>

strategy_weights_long <- analysis_dates %>% 
  map_dfr(.f = ~ calculate_weights(consolidated_nested, .x, .top_n = 10))

3.7 Time series of the Net Asset Value

In order to get the time series for the NAV, we now focus on the daily returns of each selected stock.

We first join the returns_long dataframe with the strategy_weights_long dataframe.

weights_n_returns <- strategy_weights_long %>% 
  rename(ticker = tickers) %>% 
  left_join(returns_long, by = c("date", "ticker")) %>% 
  select(-data_type)

weights_n_returns

## # A tibble: 25,180 x 4
##    date       ticker weights   return
##    <date>     <chr>    <dbl>    <dbl>
##  1 2009-12-31 CBRE       0.1 -0.0335 
##  2 2009-12-31 F          0.1  0.00100
##  3 2009-12-31 FITB       0.1 -0.00409
##  4 2009-12-31 GNW        0.1 -0.0224 
##  5 2009-12-31 JBL        0.1 -0.0170 
##  6 2009-12-31 ODP        0.1 -0.0153 
##  7 2009-12-31 TGNA       0.1 -0.0113 
##  8 2009-12-31 THC        0.1  0.0365 
##  9 2009-12-31 WYND       0.1 -0.0113 
## 10 2009-12-31 XL1        0.1 -0.00435
## # ... with 25,170 more rows

We then get the portfolio’s return for each date:

strategy_returns <- weights_n_returns %>% 
  
  # calculate the weighted return
  mutate(weighted_return = weights * return) %>% 
  group_by(date) %>% 
  
  # sum the returns for each date
  summarise(strategy_return = sum(weighted_return)) %>% 
  
  # get a column with the NAV multiplier.
  mutate(multiplier = if_else(date == analysis_dates[[1]], 
                              1, 
                              strategy_return + 1),
         cumulative_multiplier = cumprod(multiplier))

strategy_returns

## # A tibble: 2,518 x 4
##    date       strategy_return multiplier cumulative_multiplier
##    <date>               <dbl>      <dbl>                 <dbl>
##  1 2009-12-31        -0.00816      1                      1   
##  2 2010-01-04         0.0265       1.03                   1.03
##  3 2010-01-05         0.0382       1.04                   1.07
##  4 2010-01-06         0.00498      1.00                   1.07
##  5 2010-01-07         0.0223       1.02                   1.09
##  6 2010-01-08         0.00465      1.00                   1.10
##  7 2010-01-11        -0.00343      0.997                  1.10
##  8 2010-01-12        -0.0218       0.978                  1.07
##  9 2010-01-13         0.0243       1.02                   1.10
## 10 2010-01-14         0.0184       1.02                   1.12
## # ... with 2,508 more rows

Now, after defining the initial capital with the starting_capital variable, we see how that amount evolves by following this strategy.

starting_capital <- 100

strategy_nav <- strategy_returns %>% 
  mutate(nav = starting_capital * cumulative_multiplier) %>% 
  select(date, strategy_return, nav)

strategy_nav

## # A tibble: 2,518 x 3
##    date       strategy_return   nav
##    <date>               <dbl> <dbl>
##  1 2009-12-31        -0.00816  100 
##  2 2010-01-04         0.0265   103.
##  3 2010-01-05         0.0382   107.
##  4 2010-01-06         0.00498  107.
##  5 2010-01-07         0.0223   109.
##  6 2010-01-08         0.00465  110.
##  7 2010-01-11        -0.00343  110.
##  8 2010-01-12        -0.0218   107.
##  9 2010-01-13         0.0243   110.
## 10 2010-01-14         0.0184   112.
## # ... with 2,508 more rows

Let’s finish off with a nice interactive plot of the strategy’s NAV:

strategy_nav %>% 
  hchart("line",
         hcaes(date, nav),
         color = "#002C54") %>% 
         # color = "#258039") %>% 
    hc_title(
    text = "200-day ROC Strategy's Net Asset Value"
    ) %>% 
  hc_add_theme(hc_theme_538())

3.8 Comparing against the S&P 500 Index

One final step would be to compare these results against holding the whole S&P 500.

We can download daily returns for the S&P 500 from January 02, 1986 through June 29, 2018. The data comes from this Kaggle webpage. We note that this dataset has a year and half less data than the provided dataset. However, it’s still useful to contrast the strategies.

# Import csv file
spx <- read_csv("data/spx.csv")

## Rows: 8192 Columns: 2

## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): date
## dbl (1): close

## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

spx_tbl <- spx %>% 
  
  mutate(
    
  # Parse the date to a "date" type of object
    date = parse_date(date, format = "%d-%b-%y"),
    
  # Calculate the returns
  returns = (close / lag(close)) - 1
  ) %>% 
  
  # Filter out the first value
  filter(!is.na(returns)) %>%
  
  # Filter for dates in the analysis
  filter(date %in% analysis_dates) %>% 
  
  # Get lower-case and no-space names
  clean_names()

spx_tbl

## # A tibble: 2,139 x 3
##    date       close   returns
##    <date>     <dbl>     <dbl>
##  1 2009-12-31 1115. -0.0100  
##  2 2010-01-04 1133.  0.0160  
##  3 2010-01-05 1137.  0.00312 
##  4 2010-01-06 1137.  0.000546
##  5 2010-01-07 1142.  0.00400 
##  6 2010-01-08 1145.  0.00288 
##  7 2010-01-11 1147.  0.00175 
##  8 2010-01-12 1136. -0.00938 
##  9 2010-01-13 1146.  0.00833 
## 10 2010-01-14 1148.  0.00243 
## # ... with 2,129 more rows

We can create the S&P 500 NAV for the duration of the strategy.

spx_multiplier <- spx_tbl %>% 
  group_by(date) %>% 
  
  # sum the returns for each date
  summarise(sp500_return = sum(returns)) %>% 
  
  # get a column with the NAV multiplier.
  mutate(multiplier = if_else(date == analysis_dates[[1]], 
                              1, 
                              sp500_return + 1),
         cumulative_multiplier = cumprod(multiplier))

spx_nav <-  spx_multiplier %>% 
  mutate(nav = starting_capital * cumulative_multiplier) %>% 
  select(date, sp500_return, nav)

spx_nav

## # A tibble: 2,139 x 3
##    date       sp500_return   nav
##    <date>            <dbl> <dbl>
##  1 2009-12-31    -0.0100    100 
##  2 2010-01-04     0.0160    102.
##  3 2010-01-05     0.00312   102.
##  4 2010-01-06     0.000546  102.
##  5 2010-01-07     0.00400   102.
##  6 2010-01-08     0.00288   103.
##  7 2010-01-11     0.00175   103.
##  8 2010-01-12    -0.00938   102.
##  9 2010-01-13     0.00833   103.
## 10 2010-01-14     0.00243   103.
## # ... with 2,129 more rows

Now let’s plot the NAV of holding the S&P 500 index:

spx_nav %>% 
  hchart("line",
         hcaes(date, nav),
         # color = "#002C54") %>% 
         color = "#258039") %>%
    hc_title(
    text = "Holding the S&P 500 Index"
    ) %>% 
  hc_add_theme(hc_theme_538())

To see both strategies side to side, we can substract the NAV from each other:

relative_nav <- spx_nav %>% 
  rename(spx_nav = nav) %>% 
  left_join(strategy_nav %>% rename(strat_nav = nav), by = "date") %>% 
  mutate(nav_diff = strat_nav - spx_nav)

In the chart below, we plot the difference between the NAV’s growth produced by the 200-day ROC strategy against holding the S&P 500 index. We can see that, for a brief window of time, holding the index was more profitable. However, after 10 years, the 200-day ROC strategy clearly beats the index.

relative_nav %>% 
  hchart("line",
         hcaes(date, nav_diff),
         # color = "#002C54") %>% 
         color = "#258039") %>%
    hc_title(
    text = "NAV difference between strategies"
    ) %>% 
  hc_add_theme(hc_theme_538())

We could also visualize both NAVs. Below we provide a chart where we do just that. The blue line corresponds to the strategy’s NAV, while the red line corresponds to the S&P 500 index. The x-axis denotes the amount of days after the strategy is implemented.

highchart(type = "chart") %>% 
  hc_add_series(relative_nav$spx_nav) %>% 
  hc_add_series(relative_nav$strat_nav) %>% 
      hc_title(
    text = "Contrast of NAVs between S&P500 index and ROC strategy"
    ) %>% 
  hc_add_theme(hc_theme_538())

4. Conclusions

This strategy clearly beat the S&P 500 during the analysis period. However, the execution of this strategy, as presented on this document, relies upon some unrealistic assumptions. Namely:

No trading friction (i.e. there are no commissions and taxes for buying/selling stocks).
The closing price of the previous day is equal to the opening price of the current day.
The investor can buy fractions of shares.

The code can be further improved to ‘relax’ all of the previous assumptions and reflect a ‘more realistic’ implementation.

Posted on:: February 11, 2022

Length:: 38 minute read, 7913 words

See Also: