Creating a time series chart
By Alexis Cancino
February 11, 2022
Foreword on Reproducibility
In order to run this notebook, you need to have R installed, ideally RStudio too. Also, you would need to install the following packages:
# NOT RUN
# install.packages(c("tidyverse", "lubridate", "glue", "skimr", "highcharter", "janitor"))
1. Introduction
The task is to develop a backtest for a strategy that trades constituents of the S&P 500 index from January 1st, 2010 through January 1st, 2020. Each day the strategy selects the 10 stocks with the highest Price Rate of Change Indicator (ROC) and buys them on an equally-weighted basis.
It’s important to verify that it only selects stocks that are part of the S&P 500 index at each moment in time, as these change every quarter. A file should be generated with the values of the equity curve (Net Asset Value) over time and a file of weights or positions where you can see the stocks that were selected each day.
1.1 The S&P 500 Index
The S&P 500 index is one of the most-widely known indexes in finance. According to S&P Dow Jones Indices: ‘the S&P 500® is widely regarded as the best single gauge of large-cap U.S. equities. There is over USD 11.2 trillion indexed or benchmarked to the index, with indexed assets comprising approximately USD 4.6 trillion of this total. The index includes 500 leading companies and covers approximately 80% of available market capitalization.’
The index measures the performance of the large-cap segment of the market. Considered to be a proxy of the U.S. equity market, the index is composed of 500 constituent companies.
According to the S&P 500’s methodology, the index is weighted by float-adjusted market capitalization and is rebalanced on a quarterly basis.
1.2 The Strategy
The strategy consists of selecting the 10 stocks with the highest Price Rate of Change Indicator (ROC) and buys them on an equally-weighted basis. Each day before the market opens, we look back on the Price ROCs of the previous close and buy the 10 stocks with the highest ROC.
1.3 The Price Rate of Change (ROC)
The Price ROC for 200 days is calculated as:
\begin{equation} \text{ROC}_{200d} = \left(\dfrac{P_t}{P_{t-200}}-1\right) \times 100 \end{equation}
2. Constituents come first
We were provided with two datasets: closes.csv
and constituents.csv
. We will begin by exploring the latter.
2.1 Constituents Data EDA
We import the constituents.csv
data.
constituents <- read_csv("data/constituents.csv", col_types = "dDcccccc") %>%
# Remove first column, which has the index
select(-1)
## New names:
## * `` -> ...1
constituents
## # A tibble: 48,388 x 7
## date action ticker name contraticker contraname note
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2020-12-31 historical ZTS Zoetis Inc <NA> <NA> <NA>
## 2 2020-12-31 current ZTS Zoetis Inc <NA> <NA> <NA>
## 3 2020-12-31 historical ZION Zions Bancorporat~ <NA> <NA> <NA>
## 4 2020-12-31 current ZION Zions Bancorporat~ <NA> <NA> <NA>
## 5 2020-12-31 historical ZBRA Zebra Technologie~ <NA> <NA> <NA>
## 6 2020-12-31 current ZBRA Zebra Technologie~ <NA> <NA> <NA>
## 7 2020-12-31 historical ZBH Zimmer Biomet Hol~ <NA> <NA> <NA>
## 8 2020-12-31 current ZBH Zimmer Biomet Hol~ <NA> <NA> <NA>
## 9 2020-12-31 historical YUM Yum Brands Inc <NA> <NA> <NA>
## 10 2020-12-31 current YUM Yum Brands Inc <NA> <NA> <NA>
## # ... with 48,378 more rows
Let’s use the skimr
package to print a quick summary of missing and unique values of the data:
constituents %>%
skimr::skim()
Name | Piped data |
Number of rows | 48388 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
character | 6 |
Date | 1 |
________________________ | |
Group variables | None |
Table 1: Data summary
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
action | 0 | 1.00 | 5 | 10 | 0 | 4 | 0 |
ticker | 0 | 1.00 | 1 | 5 | 0 | 1112 | 0 |
name | 0 | 1.00 | 5 | 48 | 0 | 1093 | 0 |
contraticker | 47118 | 0.03 | 1 | 5 | 0 | 927 | 0 |
contraname | 47118 | 0.03 | 5 | 48 | 0 | 913 | 0 |
note | 48381 | 0.00 | 6 | 29 | 0 | 2 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
date | 0 | 1 | 1957-03-04 | 2020-12-31 | 2009-06-30 | 856 |
Some notes on the table above:
-
The dataset consists of 6 character-type variables and 1 date-type variable.
-
According to the table above, there are 4 unique values for the
action
variable. Let’s see what are those:
constituents %>% distinct(action)
## # A tibble: 4 x 1
## action
## <chr>
## 1 historical
## 2 current
## 3 added
## 4 removed
-
The data goes from 1957-03-04 through 2020-12-31 but we don’t have daily data since there are only 856 unique dates.
-
The columns
contraticker
,contraname
andnote
are the only columns that have missing values, which is both logical and acceptable. -
The
contraticker
andcontraname
columns see to be the opposite of the stocks that are added/removed to the index. Let’s check that with some steps:
- Check all possible different
action
values when there are no missing values of either thecontraticker
orcontraname
variables.
constituents %>%
filter(!is.na(contraticker) | !is.na(contraname)) %>%
distinct(action)
## # A tibble: 2 x 1
## action
## <chr>
## 1 added
## 2 removed
- Inspect some rows where there are no missing values of either the
contraticker
orcontraname
variables.
constituents %>%
filter(!is.na(contraticker) | !is.na(contraname))
## # A tibble: 1,270 x 7
## date action ticker name contraticker contraname note
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2020-12-21 added TSLA Tesla Inc AIV Apartment~ <NA>
## 2 2020-12-21 removed AIV Apartment Investment~ TSLA Tesla Inc <NA>
## 3 2020-10-12 added VNT Vontier Corp NBL Noble Ene~ <NA>
## 4 2020-10-12 removed NBL Noble Energy Inc VNT Vontier C~ <NA>
## 5 2020-09-21 added TER Teradyne Inc COTY Coty Inc <NA>
## 6 2020-09-21 removed KSS KOHLS Corp CTLT Catalent ~ <NA>
## 7 2020-09-21 removed HRB H&R Block Inc ETSY Etsy Inc <NA>
## 8 2020-09-21 added ETSY Etsy Inc HRB H&R Block~ <NA>
## 9 2020-09-21 added CTLT Catalent Inc KSS KOHLS Corp <NA>
## 10 2020-09-21 removed COTY Coty Inc TER Teradyne ~ <NA>
## # ... with 1,260 more rows
Indeed this is the case. We can see, for example, on the first row: TSLA is being added and replaced by AIV. On the next row we see the opposite operation: AIV is being removed and substituted by TSLA.
- Another important check we must do is that there are no errors in the ticker-name relationship. That is, for every (ticker, name) pair there must be just one count per value-key pair.
We first get the count for each key-value pair:
constituents %>% distinct(ticker, name) %>% count(ticker, name, name = "count")
## # A tibble: 1,112 x 3
## ticker name count
## <chr> <chr> <int>
## 1 A Agilent Technologies Inc 1
## 2 AAL American Airlines Group Inc 1
## 3 AAMRQ Amr Corp 1
## 4 AAP Advance Auto Parts Inc 1
## 5 AAPL Apple Inc 1
## 6 ABBV AbbVie Inc 1
## 7 ABC Amerisourcebergen Corp 1
## 8 ABI1 Applied Biosystems Inc 1
## 9 ABKFQ Ambac Financial Group Inc 1
## 10 ABMD Abiomed Inc 1
## # ... with 1,102 more rows
Now we filter the counts that are greater than 1:
constituents %>%
distinct(ticker, name) %>%
count(ticker, name, name = "count") %>%
filter(count > 1)
## # A tibble: 0 x 3
## # ... with 3 variables: ticker <chr>, name <chr>, count <int>
We get no rows, which means that the data is correct.
2.1.1 Time variables and action
in the constituents
dataset
In this section we’ll create a time series that will contain, for each date, the constituents of the S&P 500 index.
Since we are focusing on time, we’ll add some time-related variables, such as:
year
: The year of the time seriesdate
.half
: The half component of the index (i.e. to which semester does the date belong to).quarter
: The quarter component of the index (i.e. to which quarter does the date belong to).month
: The month component of the index (with base 1 - that is, January = 1 and so on).month_label
: The three-letter month label as an ordered categorical variable. It begins with Jan and ends with Dec.day
: The day component of thedate
.wday
: The day of the week with base 1. Monday = 1 and Sunday = 7.wday_label
: The three-letter label for day of the week as an ordered categorical variable. It begins withMon
and ends withSun
.qday
: The day of the quarter.yday
: The day of the year.qid
: The quarter’s ID.
All variables will have the date_
prefix so they are easy to identify. Since we can use this creation of variables for later, we create the add_ts_signature()
function (so we can recycle the code).
add_ts_signature <- function(df, .date_col) {
df %>%
# Create time series signature
mutate(
date_year = year({{.date_col}}),
date_half = semester({{.date_col}}, with_year = FALSE),
date_quarter = quarter({{.date_col}}),
date_month = month({{.date_col}}),
date_month_label = month({{.date_col}}, label = TRUE),
date_day = day({{.date_col}}),
date_wday = wday({{.date_col}}, week_start = 1),
date_wday_label = wday({{.date_col}}, label = TRUE),
date_qday = qday({{.date_col}}),
date_yday = yday({{.date_col}}),
date_qid = str_c(date_quarter,"Q", str_sub(date_year, 3L)) %>% as_factor()
)
}
constituents_ts <- constituents %>%
add_ts_signature(.date_col = date)
How are actions
distributed along the data?
constituents_ts %>% count(action, sort = T)
## # A tibble: 4 x 2
## action n
## <chr> <int>
## 1 historical 46110
## 2 added 1137
## 3 removed 636
## 4 current 505
current
appears to correspond to the index’s most recent constituents. We can check if this is the case by taking the unique values for the date
variable when the action
variable has the current value.
constituents_ts %>%
filter(
action == "current"
) %>%
distinct(date)
## # A tibble: 1 x 1
## date
## <date>
## 1 2020-12-31
Indeed, the only date where a current value appears is 2020-12-31
Let’s inspect the data for the first date 1957-03-04
:
constituents_ts %>%
filter(
date == ymd("1957-03-04")
)
## # A tibble: 97 x 18
## date action ticker name contraticker contraname note date_year
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 1957-03-04 added WWY Wrigley Wm~ <NA> <NA> <NA> 1957
## 2 1957-03-04 added WNDXQ Winn Dixie~ <NA> <NA> <NA> 1957
## 3 1957-03-04 added WLA Warner Lam~ <NA> <NA> <NA> 1957
## 4 1957-03-04 added WGL Wgl Holdin~ <NA> <NA> <NA> 1957
## 5 1957-03-04 removed WAB Westinghou~ GT Goodyear ~ <NA> 1957
## 6 1957-03-04 added VO1 Seagram Co~ <NA> <NA> <NA> 1957
## 7 1957-03-04 added USG Usg Corp <NA> <NA> <NA> 1957
## 8 1957-03-04 added UNP Union Paci~ <NA> <NA> <NA> 1957
## 9 1957-03-04 added UK1 Union Carb~ <NA> <NA> <NA> 1957
## 10 1957-03-04 added UCC1 Union Camp~ <NA> <NA> <NA> 1957
## # ... with 87 more rows, and 10 more variables: date_half <int>,
## # date_quarter <int>, date_month <dbl>, date_month_label <ord>,
## # date_day <int>, date_wday <dbl>, date_wday_label <ord>, date_qday <dbl>,
## # date_yday <dbl>, date_qid <fct>
It’s clear that we don’t have the initial S&P 500 sample, but we could reverse engineer the constituents at each date. Let’s start by the final date 2020-12-31
:
constituents %>%
filter(date == ymd("2020-12-31"))
## # A tibble: 1,010 x 7
## date action ticker name contraticker contraname note
## <date> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2020-12-31 historical ZTS Zoetis Inc <NA> <NA> <NA>
## 2 2020-12-31 current ZTS Zoetis Inc <NA> <NA> <NA>
## 3 2020-12-31 historical ZION Zions Bancorporat~ <NA> <NA> <NA>
## 4 2020-12-31 current ZION Zions Bancorporat~ <NA> <NA> <NA>
## 5 2020-12-31 historical ZBRA Zebra Technologie~ <NA> <NA> <NA>
## 6 2020-12-31 current ZBRA Zebra Technologie~ <NA> <NA> <NA>
## 7 2020-12-31 historical ZBH Zimmer Biomet Hol~ <NA> <NA> <NA>
## 8 2020-12-31 current ZBH Zimmer Biomet Hol~ <NA> <NA> <NA>
## 9 2020-12-31 historical YUM Yum Brands Inc <NA> <NA> <NA>
## 10 2020-12-31 current YUM Yum Brands Inc <NA> <NA> <NA>
## # ... with 1,000 more rows
We see that we have sort-of-duplicated rows; we note that one action
label corresponds to historical
and the other one to current
. Let’s check that we actually have the same number per category:
constituents %>%
filter(date == ymd("2020-12-31")) %>%
count(action)
## # A tibble: 2 x 2
## action n
## <chr> <int>
## 1 current 505
## 2 historical 505
In order to get a better sense of what’s going on with the action
column, we pivot the data to a wide format so we can clearly see the difference between each action
value. By doing this, we can have a single column for each of the 4 unique values in the action
variable.
constituents_wide <- constituents_ts %>%
pivot_wider(
names_from = action,
values_from = name
) %>%
relocate(historical, current, added, removed, .after = ticker)
constituents_wide
## # A tibble: 47,883 x 20
## date ticker historical current added removed contraticker contraname
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2020-12-31 ZTS Zoetis Inc Zoetis~ <NA> <NA> <NA> <NA>
## 2 2020-12-31 ZION Zions Bancor~ Zions ~ <NA> <NA> <NA> <NA>
## 3 2020-12-31 ZBRA Zebra Techno~ Zebra ~ <NA> <NA> <NA> <NA>
## 4 2020-12-31 ZBH Zimmer Biome~ Zimmer~ <NA> <NA> <NA> <NA>
## 5 2020-12-31 YUM Yum Brands I~ Yum Br~ <NA> <NA> <NA> <NA>
## 6 2020-12-31 XYL Xylem Inc Xylem ~ <NA> <NA> <NA> <NA>
## 7 2020-12-31 XRX Xerox Holdin~ Xerox ~ <NA> <NA> <NA> <NA>
## 8 2020-12-31 XRAY DENTSPLY SIR~ DENTSP~ <NA> <NA> <NA> <NA>
## 9 2020-12-31 XOM Exxon Mobil ~ Exxon ~ <NA> <NA> <NA> <NA>
## 10 2020-12-31 XLNX Xilinx Inc Xilinx~ <NA> <NA> <NA> <NA>
## # ... with 47,873 more rows, and 12 more variables: note <chr>,
## # date_year <dbl>, date_half <int>, date_quarter <int>, date_month <dbl>,
## # date_month_label <ord>, date_day <int>, date_wday <dbl>,
## # date_wday_label <ord>, date_qday <dbl>, date_yday <dbl>, date_qid <fct>
Inspecting the data, it appears that the historical
variable gives us the constituents at each quarter. This is backed by the fact that on the latest available date, we have the current
constituents which correspond to the 4Q20. However, as mentioned above, we have the exact same rows but with a historical
label.
Let’s inspect the dates on which there are no missing values for the historical
action. We can check the unique values for the month number:
constituents_wide %>%
filter(!is.na(historical)) %>%
distinct(date_month, date_quarter)
## # A tibble: 4 x 2
## date_quarter date_month
## <int> <dbl>
## 1 4 12
## 2 3 9
## 3 2 6
## 4 1 3
Indeed, the only months were historical
actions are not missing are mod 3 months (i.e. the end of each quarter).
This points to the fact that probably the historical
value indicates the constituents per quarter on the S&P 500. We can further check the amount of historical
values per quarter. The number should be around 500, of course.
constituents_wide %>%
select(date_year, date_qid, historical) %>%
filter(date_year > 2008, !is.na(historical)) %>%
group_by(date_year, date_qid) %>%
nest() %>%
mutate(constituent_count = map_int(data, nrow)) %>%
select(-data) %>%
print(n = 45)
## # A tibble: 48 x 3
## # Groups: date_year, date_qid [48]
## date_year date_qid constituent_count
## <dbl> <fct> <int>
## 1 2020 4Q20 505
## 2 2020 3Q20 505
## 3 2020 2Q20 505
## 4 2020 1Q20 505
## 5 2019 4Q19 505
## 6 2019 3Q19 505
## 7 2019 2Q19 505
## 8 2019 1Q19 505
## 9 2018 4Q18 505
## 10 2018 3Q18 505
## 11 2018 2Q18 505
## 12 2018 1Q18 505
## 13 2017 4Q17 505
## 14 2017 3Q17 505
## 15 2017 2Q17 505
## 16 2017 1Q17 505
## 17 2016 4Q16 505
## 18 2016 3Q16 505
## 19 2016 2Q16 505
## 20 2016 1Q16 504
## 21 2015 4Q15 504
## 22 2015 3Q15 505
## 23 2015 2Q15 502
## 24 2015 1Q15 502
## 25 2014 4Q14 502
## 26 2014 3Q14 502
## 27 2014 2Q14 500
## 28 2014 1Q14 500
## 29 2013 4Q13 500
## 30 2013 3Q13 500
## 31 2013 2Q13 500
## 32 2013 1Q13 500
## 33 2012 4Q12 500
## 34 2012 3Q12 500
## 35 2012 2Q12 500
## 36 2012 1Q12 500
## 37 2011 4Q11 500
## 38 2011 3Q11 500
## 39 2011 2Q11 500
## 40 2011 1Q11 500
## 41 2010 4Q10 500
## 42 2010 3Q10 500
## 43 2010 2Q10 500
## 44 2010 1Q10 494
## 45 2009 4Q09 500
## # ... with 3 more rows
Indeed the number is very close to 500 among all quarters.
One last check we can perform before we can use the historical
rows as our constituents has to do with the current
rows. We can reverse engineer the 3Q20 constituents by starting from the 4Q20 (the current constituents).
We can first see what the changes on constituents were between the 3Q20 and 4Q20. We can separate the adds and drops into two data frames, like so:
adds_drops <- constituents_wide %>%
filter(!is.na(added) | !is.na(removed) , date_qid %in% c("4Q20")) %>%
select(date, ticker, added, removed, date_qid)
adds_drops %>%
filter(!is.na(added)) %>%
select(-removed)
## # A tibble: 2 x 4
## date ticker added date_qid
## <date> <chr> <chr> <fct>
## 1 2020-12-21 TSLA Tesla Inc 4Q20
## 2 2020-10-12 VNT Vontier Corp 4Q20
So TSLA and VNT were added in the 4Q20.
adds_drops %>%
filter(!is.na(removed)) %>%
select(-added)
## # A tibble: 2 x 4
## date ticker removed date_qid
## <date> <chr> <chr> <fct>
## 1 2020-12-21 AIV Apartment Investment & Management Co 4Q20
## 2 2020-10-12 NBL Noble Energy Inc 4Q20
NBL and AIV were removed in the 4Q20. Apparently, substituted by TSLA and VNT.
We can now do the reverse action for the adds and the drops, respectively. Thus, the drops from 3Q20-4Q20 will become our adds and the adds will become our drops.
We start by getting the adds needed to go from 4Q20 to 3Q20:
# Extract the adds from 4Q20 to 3Q20
adds_3Q <- adds_drops %>%
filter(!is.na(removed)) %>%
pull(removed)
adds_3Q
## [1] "Apartment Investment & Management Co"
## [2] "Noble Energy Inc"
Next, we get the drops needed to go from 4Q20 to 3Q20:
# Extract the drops from 4Q20 to 3Q20
drops_3Q <- adds_drops %>%
filter(!is.na(added)) %>%
pull(added)
drops_3Q
## [1] "Tesla Inc" "Vontier Corp"
Finally, we get the constituents in 4Q20, include the adds and remove the drops, and compare against the constituents in 3Q20. We should get the same result.
# Get all the constituents for 4Q20
constituents_4Q20 <- constituents_wide %>%
filter(date_qid == "4Q20", !is.na(historical)) %>%
pull(historical) %>% sort()
# Get all the constituents for 3Q20
constituents_3Q20 <- constituents_wide %>%
filter(date_qid == "3Q20", !is.na(historical)) %>%
pull(historical) %>% sort()
# include the adds from 4Q20
calculated_3Q20 <- c(constituents_4Q20, adds_3Q)
# remove the drops from 4Q20
calculated_3Q20 <- calculated_3Q20[!calculated_3Q20 %in% drops_3Q] %>% sort()
# Check if both results are equal
all.equal(constituents_3Q20, calculated_3Q20)
## [1] TRUE
Voilà! We did get a TRUE
value when comparing the reverse-engineered approach and the mere historical
values. This confirms that we can use the historical
values for our constituents.
2.1.2 The time series of S&P 500 constituents
We will now proceed to use these historical
values for every day in the closes.csv
dataset from 2010-01-01
through 2020-01-01
.
# Get the required dates from the closes.csv file
required_dates <- read_csv(file = "data/closes.csv") %>%
# Focus on the date only
select(date) %>%
# Focus on 2010 or later
filter(year(date) > 2008)
## New names:
## * `` -> ...1
## Rows: 5796 Columns: 1086
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1085): ...1, ZTS, ZION, ZBRA, ZBH, YUM, XYL, XRX, XRAY, XOM, XLNX, XEL...
## date (1): date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
We now nest a data frame for each date in the constituents
dataset that contains the S&P 500 constituents.
constituents_nested <- constituents_wide %>%
# Focus on just the historicals and 2010 or later
filter(date_year > 2008, !is.na(historical)) %>%
# Just take the date, ticker, and historical columns
select(date, ticker, historical) %>%
# Nest the data
nest(constituents = c(ticker, historical)) %>%
relocate(constituents, .after = 1) %>%
arrange(date) %>%
# Uncomment in case we want to get the number of constituents per date
# mutate(constituent_count = map_int(constituents, nrow)) %>%
identity()
constituents_nested
## # A tibble: 48 x 2
## date constituents
## <date> <list>
## 1 2009-03-31 <tibble [500 x 2]>
## 2 2009-06-30 <tibble [500 x 2]>
## 3 2009-09-30 <tibble [500 x 2]>
## 4 2009-12-31 <tibble [500 x 2]>
## 5 2010-03-31 <tibble [494 x 2]>
## 6 2010-06-30 <tibble [500 x 2]>
## 7 2010-09-30 <tibble [500 x 2]>
## 8 2010-12-31 <tibble [500 x 2]>
## 9 2011-03-31 <tibble [500 x 2]>
## 10 2011-06-30 <tibble [500 x 2]>
## # ... with 38 more rows
We can now left_join the datasets:
constituents_nested <- required_dates %>%
left_join(constituents_nested, by = "date") %>%
fill(contains("constituent"), .direction = "up")
constituents_nested
## # A tibble: 3,028 x 2
## date constituents
## <date> <list>
## 1 2009-01-02 <tibble [500 x 2]>
## 2 2009-01-05 <tibble [500 x 2]>
## 3 2009-01-06 <tibble [500 x 2]>
## 4 2009-01-07 <tibble [500 x 2]>
## 5 2009-01-08 <tibble [500 x 2]>
## 6 2009-01-09 <tibble [500 x 2]>
## 7 2009-01-12 <tibble [500 x 2]>
## 8 2009-01-13 <tibble [500 x 2]>
## 9 2009-01-14 <tibble [500 x 2]>
## 10 2009-01-15 <tibble [500 x 2]>
## # ... with 3,018 more rows
Finally! We now have a time series with the corresponding composition of the S&P 500 for each date.
3. Implementing the ROC Strategy
We proceed in the following way:
First, import the data:
# Use read_csv() function to read the close data
close_tbl <- read_csv("data/closes.csv") %>%
select(-1)
## New names:
## * `` -> ...1
## Rows: 5796 Columns: 1086
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (1085): ...1, ZTS, ZION, ZBRA, ZBH, YUM, XYL, XRX, XRAY, XOM, XLNX, XEL...
## date (1): date
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
close_tbl
## # A tibble: 5,796 x 1,085
## date ZTS ZION ZBRA ZBH YUM XYL XRX XRAY XOM XLNX XEL
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1997-12-31 31.0 45.4 13.2 29.8 7.26 24.2 148. 10.2 30.6 8.76 29.1
## 2 1998-01-02 31.0 45.5 13.2 29.8 7.28 24.2 150. 9.92 30.9 9.66 28.9
## 3 1998-01-05 31.0 46.4 13.1 29.8 7.06 24.2 152. 9.54 30.6 9.44 28.7
## 4 1998-01-06 31.0 45.6 13.1 29.8 6.76 24.2 151. 9.71 29.5 9.35 28.3
## 5 1998-01-07 31.0 44.9 13.0 29.8 6.75 24.2 145. 9.54 30.4 8.95 28
## 6 1998-01-08 31.0 41.5 13 29.8 6.48 24.2 142. 9.48 29.8 9.03 28.1
## 7 1998-01-09 31.0 39.6 11.8 29.8 6.5 24.2 137. 9.02 29.1 8.66 27.8
## 8 1998-01-12 31.0 40.5 11.8 29.8 6.53 24.2 137. 9 29.4 8.52 28.0
## 9 1998-01-13 31.0 40.8 12.3 29.8 6.53 24.2 141. 9.17 30.0 9.31 28.0
## 10 1998-01-14 31.0 42.2 11.6 29.8 6.53 24.2 142. 9.29 30.4 9.22 27.9
## # ... with 5,786 more rows, and 1,073 more variables: WYNN <dbl>, WY <dbl>,
## # WU <dbl>, WST <dbl>, WRK <dbl>, WRB <dbl>, WMT <dbl>, WMB <dbl>, WM <dbl>,
## # WLTW <dbl>, WHR <dbl>, WFC <dbl>, WELL <dbl>, WEC <dbl>, WDC <dbl>,
## # WBA <dbl>, WAT <dbl>, WAB <dbl>, VZ <dbl>, VTRS <dbl>, VTR <dbl>,
## # VRTX <dbl>, VRSN <dbl>, VRSK <dbl>, VNT <dbl>, VNO <dbl>, VMC <dbl>,
## # VLO <dbl>, VIAC <dbl>, VFC <dbl>, VAR <dbl>, V <dbl>, USB <dbl>, URI <dbl>,
## # UPS <dbl>, UNP <dbl>, UNM <dbl>, UNH <dbl>, ULTA <dbl>, UHS <dbl>, ...
Apparently, we have daily closing data from 1997-12-31
through 2021-01-12
for 1,084
different securities.
3.1 Calculating the 200-day ROC and daily returns
Secondly, calculate the 200-day ROC. To achieve this, we create the add_price_roc()
function, which takes an ndays
argument to indicate the number of days for the ROC calculation. This would be useful if we’d like to explore different days for the ROC.
# This functions needs a data frame that has a date column and the rest of the columns must have the price of the securities.
add_price_roc <- function(df, .date_col, .ndays) {
df %>%
mutate(
across(
.cols = -{{.date_col}},
.fns = ~ ((.x / lag(.x, n = .ndays)) - 1) * 100,
.names = "ROC_{.col}"
)
) %>%
# keep rows where at least one of the ROC columns is not NA
filter(
if_any(contains("ROC"), ~ !is.na(.x))
)
}
Now let’s use this function on our closes.csv
data. This creates a column for each security that contains the n-day
ROC for each security. The ROC numbers are in columns with the ROC prefix to easily identify them.
We also calculate the daily returns. The return numbers are located in columns with the return prefix to easily identify them.
roc_close <- close_tbl %>%
# Add 200-day ROC
add_price_roc(.date_col = date,
.ndays = 200) %>%
relocate(date, contains("ROC")) %>%
# Add the returns
mutate(
across(
.cols = -c(date, contains("ROC")),
.fns = ~ (.x / lag(.x)) - 1,
.names = "return_{.col}"
)
)
roc_close
## # A tibble: 5,596 x 3,253
## date ROC_ZTS ROC_ZION ROC_ZBRA ROC_ZBH ROC_YUM ROC_XYL ROC_XRX ROC_XRAY
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1998-10-16 0 3.02 10.3 0 37.9 0 29.4 -32.0
## 2 1998-10-19 0 5.91 10.1 0 45.7 0 28.2 -27.7
## 3 1998-10-20 0 7.93 8.90 0 58.0 0 26.8 -23.6
## 4 1998-10-21 0 9.86 13.6 0 63.1 0 24.2 -23.4
## 5 1998-10-22 0 11.3 2.35 0 62.0 0 28.4 -21.0
## 6 1998-10-23 0 20.5 0.446 0 71.5 0 32.2 -19.5
## 7 1998-10-26 0 29.1 7.96 0 68.8 0 40.8 -13.9
## 8 1998-10-27 0 27.5 10.1 0 68.4 0 42.7 -10.6
## 9 1998-10-28 0 26.7 5.43 0 66.5 0 34.9 -10.9
## 10 1998-10-29 0 22.2 19.6 0 64.8 0 34.0 -10.3
## # ... with 5,586 more rows, and 3,244 more variables: ROC_XOM <dbl>,
## # ROC_XLNX <dbl>, ROC_XEL <dbl>, ROC_WYNN <dbl>, ROC_WY <dbl>, ROC_WU <dbl>,
## # ROC_WST <dbl>, ROC_WRK <dbl>, ROC_WRB <dbl>, ROC_WMT <dbl>, ROC_WMB <dbl>,
## # ROC_WM <dbl>, ROC_WLTW <dbl>, ROC_WHR <dbl>, ROC_WFC <dbl>, ROC_WELL <dbl>,
## # ROC_WEC <dbl>, ROC_WDC <dbl>, ROC_WBA <dbl>, ROC_WAT <dbl>, ROC_WAB <dbl>,
## # ROC_VZ <dbl>, ROC_VTRS <dbl>, ROC_VTR <dbl>, ROC_VRTX <dbl>,
## # ROC_VRSN <dbl>, ROC_VRSK <dbl>, ROC_VNT <dbl>, ROC_VNO <dbl>, ...
3.2 Filtering data from 2010 onward
Since the strategy will be implemented from 2010 onward, we can now filter out the data for previous years:
roc_close <- roc_close %>%
filter(year(date) > 2008)
3.3 Getting the data into a ‘long’ format
We get the data into a long format (instead of the wide format with 2,169
columns). Also, a separation is made in three data frames: one for ROCs, one for prices, and the other one for returns We call these roc_long
, close_long
, and `returns_long`` respectively.
# Build a dataframe with ROC values
roc_long <- roc_close %>%
select(date, contains("ROC")) %>%
pivot_longer(
cols = contains("ROC"),
names_to = "ticker",
names_pattern = "_(.*)",
values_to = "roc"
) %>%
add_column(data_type = "ROC")
roc_long
## # A tibble: 3,282,352 x 4
## date ticker roc data_type
## <date> <chr> <dbl> <chr>
## 1 2009-01-02 ZTS 0 ROC
## 2 2009-01-02 ZION -47.2 ROC
## 3 2009-01-02 ZBRA -36.7 ROC
## 4 2009-01-02 ZBH -44.7 ROC
## 5 2009-01-02 YUM -11.6 ROC
## 6 2009-01-02 XYL 0 ROC
## 7 2009-01-02 XRX -43.7 ROC
## 8 2009-01-02 XRAY -19.7 ROC
## 9 2009-01-02 XOM -3.30 ROC
## 10 2009-01-02 XLNX -22.5 ROC
## # ... with 3,282,342 more rows
# Build a dataframe with price values
close_long <- roc_close %>%
select(-contains("ROC")) %>%
select(-contains("return")) %>%
pivot_longer(
cols = -date,
names_to = "ticker",
values_to = "price"
) %>%
add_column(data_type = "close")
close_long
## # A tibble: 3,282,352 x 4
## date ticker price data_type
## <date> <chr> <dbl> <chr>
## 1 2009-01-02 ZTS 31.0 close
## 2 2009-01-02 ZION 25 close
## 3 2009-01-02 ZBRA 21.0 close
## 4 2009-01-02 ZBH 41.8 close
## 5 2009-01-02 YUM 32 close
## 6 2009-01-02 XYL 24.2 close
## 7 2009-01-02 XRX 33.5 close
## 8 2009-01-02 XRAY 29.1 close
## 9 2009-01-02 XOM 81.6 close
## 10 2009-01-02 XLNX 18.4 close
## # ... with 3,282,342 more rows
# Build a dataframe with ROC values
returns_long <- roc_close %>%
select(date, contains("return")) %>%
pivot_longer(
cols = contains("return"),
names_to = "ticker",
names_pattern = "_(.*)",
values_to = "return"
) %>%
add_column(data_type = "return")
returns_long
## # A tibble: 3,282,352 x 4
## date ticker return data_type
## <date> <chr> <dbl> <chr>
## 1 2009-01-02 ZTS 0 return
## 2 2009-01-02 ZION 0.0200 return
## 3 2009-01-02 ZBRA 0.0380 return
## 4 2009-01-02 ZBH 0.0332 return
## 5 2009-01-02 YUM 0.0159 return
## 6 2009-01-02 XYL 0 return
## 7 2009-01-02 XRX 0.0514 return
## 8 2009-01-02 XRAY 0.0308 return
## 9 2009-01-02 XOM 0.0227 return
## 10 2009-01-02 XLNX 0.0303 return
## # ... with 3,282,342 more rows
We now nest the three dataframes so we have a similar structure to the constituents
data.
roc_nested <- roc_long %>%
nest(rocs = c(ticker, roc, data_type))
roc_nested
## # A tibble: 3,028 x 2
## date rocs
## <date> <list>
## 1 2009-01-02 <tibble [1,084 x 3]>
## 2 2009-01-05 <tibble [1,084 x 3]>
## 3 2009-01-06 <tibble [1,084 x 3]>
## 4 2009-01-07 <tibble [1,084 x 3]>
## 5 2009-01-08 <tibble [1,084 x 3]>
## 6 2009-01-09 <tibble [1,084 x 3]>
## 7 2009-01-12 <tibble [1,084 x 3]>
## 8 2009-01-13 <tibble [1,084 x 3]>
## 9 2009-01-14 <tibble [1,084 x 3]>
## 10 2009-01-15 <tibble [1,084 x 3]>
## # ... with 3,018 more rows
close_nested <- close_long %>%
nest(prices = c(ticker, price, data_type))
close_nested
## # A tibble: 3,028 x 2
## date prices
## <date> <list>
## 1 2009-01-02 <tibble [1,084 x 3]>
## 2 2009-01-05 <tibble [1,084 x 3]>
## 3 2009-01-06 <tibble [1,084 x 3]>
## 4 2009-01-07 <tibble [1,084 x 3]>
## 5 2009-01-08 <tibble [1,084 x 3]>
## 6 2009-01-09 <tibble [1,084 x 3]>
## 7 2009-01-12 <tibble [1,084 x 3]>
## 8 2009-01-13 <tibble [1,084 x 3]>
## 9 2009-01-14 <tibble [1,084 x 3]>
## 10 2009-01-15 <tibble [1,084 x 3]>
## # ... with 3,018 more rows
returns_nested <- returns_long %>%
nest(returns = c(ticker, return, data_type))
returns_nested
## # A tibble: 3,028 x 2
## date returns
## <date> <list>
## 1 2009-01-02 <tibble [1,084 x 3]>
## 2 2009-01-05 <tibble [1,084 x 3]>
## 3 2009-01-06 <tibble [1,084 x 3]>
## 4 2009-01-07 <tibble [1,084 x 3]>
## 5 2009-01-08 <tibble [1,084 x 3]>
## 6 2009-01-09 <tibble [1,084 x 3]>
## 7 2009-01-12 <tibble [1,084 x 3]>
## 8 2009-01-13 <tibble [1,084 x 3]>
## 9 2009-01-14 <tibble [1,084 x 3]>
## 10 2009-01-15 <tibble [1,084 x 3]>
## # ... with 3,018 more rows
3.4 Consolidated, Nested Data
Having all three nested dataframes (constituents, prices and rocs) we can consolidate everything into a single dataframe. This will be used to analyze the strategy.
consolidated_nested <- constituents_nested %>%
left_join(close_nested, by = "date") %>%
left_join(roc_nested, by = "date") %>%
left_join(returns_nested, by = "date") %>%
mutate(row_id = row_number()) %>%
relocate(row_id, everything())
consolidated_nested
## # A tibble: 3,028 x 6
## row_id date constituents prices rocs returns
## <int> <date> <list> <list> <list> <list>
## 1 1 2009-01-02 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 2 2 2009-01-05 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 3 3 2009-01-06 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 4 4 2009-01-07 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 5 5 2009-01-08 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 6 6 2009-01-09 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 7 7 2009-01-12 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 8 8 2009-01-13 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 9 9 2009-01-14 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## 10 10 2009-01-15 <tibble [500 x 2]> <tibble [1,084 x 3]> <tibble> <tibble>
## # ... with 3,018 more rows
In this consolidated_nested
dataframe each row represents a single date. The advantage of this dataframe is that it contains the constituents, the prices, and rocs for each date.
3.5 Stock Selection Per Day
To code the stock selection, we start by defining the starting and ending dates.
start_date <- ymd("2009-12-31")
end_date <- ymd("2020-01-02")
Then, we can filter the consolidated_nested
dataframe so we have data around this range of dates.
consolidated_nested <- consolidated_nested %>%
filter(date %>% between(start_date - 10, end_date + 10)
)
We then execute the following steps:
- For each day, get the constituents of the S&P 500.
For this purpose, we create the extract_constituents()
function:
extract_constituents <- function(df, extraction_date) {
df %>%
filter(date == extraction_date) %>%
select(row_id, date, constituents) %>%
unnest(cols = constituents) %>%
pull(ticker)
}
We can test it for any date and show just the first 6 tickers:
extract_constituents(consolidated_nested, ymd("2019-12-31")) %>%
head()
## [1] "ZTS" "ZION" "ZBRA" "ZBH" "YUM" "XYL"
- Then, rank each of those constituents by their previous closing 200-day ROC. Choose the top 10 stocks.
For this purpose, we create two functions. We also set as an argument the number of stocks that can be picked. In that way, the strategy can be changed to pick the top ‘n’ stocks. The default is set to 10.
-
the
show_top_roc()
function, which shows us the top 10 tickers ranked by ROC, for any given date. -
the
extract_top_roc()
function, which extracts the top 10 tickers ranked by ROC, for any given date.
Here’s the show_top_roc()
function:
show_top_roc <- function(df, extraction_date, .top_n = 10L) {
if (!is.numeric(.top_n)) {
stop("You must provide an integer for the top 'n' constituents.")
}
# Get the ID for the previous close
previous_row <- df %>%
filter(date == extraction_date) %>%
pull(row_id) - 1
# Proceed to get the top n tickers
df %>%
filter(row_id == previous_row) %>%
select(row_id, date, rocs) %>%
unnest(cols = rocs) %>%
# Filter to focus on the S&P 500 constituents. Use extract_constituents function
filter(ticker %in% extract_constituents(df, extraction_date)) %>%
# Get the top n, as ordered by ROC
slice_max(order_by = roc, n = .top_n)
}
Let’s test this function for a given date:
show_top_roc(consolidated_nested, ymd("2019-12-31"), .top_n = 10)
## # A tibble: 10 x 5
## row_id date ticker roc data_type
## <int> <date> <chr> <dbl> <chr>
## 1 2767 2019-12-30 AMD 95.4 ROC
## 2 2767 2019-12-30 TGT 68.1 ROC
## 3 2767 2019-12-30 QRVO 66.8 ROC
## 4 2767 2019-12-30 HWM 62.6 ROC
## 5 2767 2019-12-30 MKTX 60.4 ROC
## 6 2767 2019-12-30 LRCX 60.4 ROC
## 7 2767 2019-12-30 AAPL 56.6 ROC
## 8 2767 2019-12-30 QCOM 56.2 ROC
## 9 2767 2019-12-30 LDOS 56.1 ROC
## 10 2767 2019-12-30 RMD 54.0 ROC
Indeed it shows the top 10 ROC tickers for the previous close.
Now here’s the extract_top_roc()
function:
extract_top_roc <- function(df, extraction_date, .top_n = 10L) {
show_top_roc(df, extraction_date, .top_n) %>%
pull(ticker)
}
Again, we take it for a spin:
extract_top_roc(consolidated_nested, ymd("2019-12-31"), .top_n = 10)
## [1] "AMD" "TGT" "QRVO" "HWM" "MKTX" "LRCX" "AAPL" "QCOM" "LDOS" "RMD"
It works perfectly; we got the exact same tickers as with the show_top_roc()
function.
- Get the corresponding weights for each stock. For this step, we create the
calculate_weights()
function:
calculate_weights <- function(df, .date, .top_n = 10L) {
chosen_tickers <- extract_top_roc(df, .date, .top_n)
weights_tbl <- tibble(
date = .date,
tickers = extract_constituents(df, .date),
weights = if_else(
tickers %in% chosen_tickers,
1 / length(chosen_tickers), # equally weighted portfolio
0 # if ticker is not chose, assign 0 weight
)
) %>%
# Focus on stocks that are chosen
filter(weights > 0) %>%
arrange(tickers)
return(weights_tbl)
}
We also implement the ‘wide’ version of this function:
calculate_weights_wide <- function(df, .date, .top_n = 10L) {
weights_tbl <- calculate_weights(df, .date, .top_n = .top_n)
tickers_tbl <- weights_tbl %>%
add_column("ticker_id" = paste0("ticker_", 1:.top_n)) %>%
pivot_wider(
id_cols = date,
names_from = ticker_id,
values_from = tickers
)
weights_tbl <- weights_tbl %>%
add_column("weight_id" = paste0("weight_", 1:.top_n)) %>%
pivot_wider(
id_cols = date,
names_from = weight_id,
values_from = weights
)
return(left_join(tickers_tbl, weights_tbl, by = "date"))
}
Again, we test the functions:
calculate_weights(consolidated_nested, ymd("2019-12-31"), .top_n = 10)
## # A tibble: 10 x 3
## date tickers weights
## <date> <chr> <dbl>
## 1 2019-12-31 AAPL 0.1
## 2 2019-12-31 AMD 0.1
## 3 2019-12-31 HWM 0.1
## 4 2019-12-31 LDOS 0.1
## 5 2019-12-31 LRCX 0.1
## 6 2019-12-31 MKTX 0.1
## 7 2019-12-31 QCOM 0.1
## 8 2019-12-31 QRVO 0.1
## 9 2019-12-31 RMD 0.1
## 10 2019-12-31 TGT 0.1
calculate_weights_wide(consolidated_nested, ymd("2019-12-31"), .top_n = 10)
## # A tibble: 1 x 21
## date ticker_1 ticker_2 ticker_3 ticker_4 ticker_5 ticker_6 ticker_7
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2019-12-31 AAPL AMD HWM LDOS LRCX MKTX QCOM
## # ... with 13 more variables: ticker_8 <chr>, ticker_9 <chr>, ticker_10 <chr>,
## # weight_1 <dbl>, weight_2 <dbl>, weight_3 <dbl>, weight_4 <dbl>,
## # weight_5 <dbl>, weight_6 <dbl>, weight_7 <dbl>, weight_8 <dbl>,
## # weight_9 <dbl>, weight_10 <dbl>
This is the main function needed for the implementation. At each date, the function tells you what are the weights needed on each of the S&P 500 constituents, ranking those by 200-day ROC and equally weighing the top ‘n.’
3.6 Strategy Weights
We are finally ready to implement the aforementioned strategy on each day. We proceed by getting a vector of dates for which we want to implement the strategy (defined by the start_date
and end_date
variables).
# Extract the dates needed for the analysis
analysis_dates <- consolidated_nested %>%
select(date) %>%
filter(
date %>% between(ymd("2009-12-31"), ymd("2020-01-02"))
) %>% pull(date)
analysis_dates %>% head()
## [1] "2009-12-31" "2010-01-04" "2010-01-05" "2010-01-06" "2010-01-07"
## [6] "2010-01-08"
Then, we loop loop over that vector to use it on our calculate_weights_wide()
function.
# WARNING: This takes ~ 3-4 min to to run
strategy_weights <- analysis_dates %>%
map_dfr(.f = ~ calculate_weights_wide(consolidated_nested, .x, .top_n = 10))
strategy_weights
## # A tibble: 2,518 x 21
## date ticker_1 ticker_2 ticker_3 ticker_4 ticker_5 ticker_6 ticker_7
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2009-12-31 CBRE F FITB GNW JBL ODP TGNA
## 2 2010-01-04 CBRE F FITB GNW JBL MEE ODP
## 3 2010-01-05 CBRE F FITB GNW JBL ODP TGNA
## 4 2010-01-06 CBRE FITB GNW JBL MEE ODP TGNA
## 5 2010-01-07 CBRE F FITB GNW JBL MEE ODP
## 6 2010-01-08 CBRE F FITB GNW JBL MEE ODP
## 7 2010-01-11 DD F FITB GNW JBL MEE ODP
## 8 2010-01-12 DD F FITB GNW ISRG MEE ODP
## 9 2010-01-13 DD F FITB GNW IP MEE ODP
## 10 2010-01-14 F FITB GNW LNC MEE ODP TGNA
## # ... with 2,508 more rows, and 13 more variables: ticker_8 <chr>,
## # ticker_9 <chr>, ticker_10 <chr>, weight_1 <dbl>, weight_2 <dbl>,
## # weight_3 <dbl>, weight_4 <dbl>, weight_5 <dbl>, weight_6 <dbl>,
## # weight_7 <dbl>, weight_8 <dbl>, weight_9 <dbl>, weight_10 <dbl>
strategy_weights_long <- analysis_dates %>%
map_dfr(.f = ~ calculate_weights(consolidated_nested, .x, .top_n = 10))
3.7 Time series of the Net Asset Value
In order to get the time series for the NAV, we now focus on the daily returns of each selected stock.
We first join the returns_long
dataframe with the strategy_weights_long
dataframe.
weights_n_returns <- strategy_weights_long %>%
rename(ticker = tickers) %>%
left_join(returns_long, by = c("date", "ticker")) %>%
select(-data_type)
weights_n_returns
## # A tibble: 25,180 x 4
## date ticker weights return
## <date> <chr> <dbl> <dbl>
## 1 2009-12-31 CBRE 0.1 -0.0335
## 2 2009-12-31 F 0.1 0.00100
## 3 2009-12-31 FITB 0.1 -0.00409
## 4 2009-12-31 GNW 0.1 -0.0224
## 5 2009-12-31 JBL 0.1 -0.0170
## 6 2009-12-31 ODP 0.1 -0.0153
## 7 2009-12-31 TGNA 0.1 -0.0113
## 8 2009-12-31 THC 0.1 0.0365
## 9 2009-12-31 WYND 0.1 -0.0113
## 10 2009-12-31 XL1 0.1 -0.00435
## # ... with 25,170 more rows
We then get the portfolio’s return for each date:
strategy_returns <- weights_n_returns %>%
# calculate the weighted return
mutate(weighted_return = weights * return) %>%
group_by(date) %>%
# sum the returns for each date
summarise(strategy_return = sum(weighted_return)) %>%
# get a column with the NAV multiplier.
mutate(multiplier = if_else(date == analysis_dates[[1]],
1,
strategy_return + 1),
cumulative_multiplier = cumprod(multiplier))
strategy_returns
## # A tibble: 2,518 x 4
## date strategy_return multiplier cumulative_multiplier
## <date> <dbl> <dbl> <dbl>
## 1 2009-12-31 -0.00816 1 1
## 2 2010-01-04 0.0265 1.03 1.03
## 3 2010-01-05 0.0382 1.04 1.07
## 4 2010-01-06 0.00498 1.00 1.07
## 5 2010-01-07 0.0223 1.02 1.09
## 6 2010-01-08 0.00465 1.00 1.10
## 7 2010-01-11 -0.00343 0.997 1.10
## 8 2010-01-12 -0.0218 0.978 1.07
## 9 2010-01-13 0.0243 1.02 1.10
## 10 2010-01-14 0.0184 1.02 1.12
## # ... with 2,508 more rows
Now, after defining the initial capital with the starting_capital
variable, we see how that amount evolves by following this strategy.
starting_capital <- 100
strategy_nav <- strategy_returns %>%
mutate(nav = starting_capital * cumulative_multiplier) %>%
select(date, strategy_return, nav)
strategy_nav
## # A tibble: 2,518 x 3
## date strategy_return nav
## <date> <dbl> <dbl>
## 1 2009-12-31 -0.00816 100
## 2 2010-01-04 0.0265 103.
## 3 2010-01-05 0.0382 107.
## 4 2010-01-06 0.00498 107.
## 5 2010-01-07 0.0223 109.
## 6 2010-01-08 0.00465 110.
## 7 2010-01-11 -0.00343 110.
## 8 2010-01-12 -0.0218 107.
## 9 2010-01-13 0.0243 110.
## 10 2010-01-14 0.0184 112.
## # ... with 2,508 more rows
Let’s finish off with a nice interactive plot of the strategy’s NAV:
strategy_nav %>%
hchart("line",
hcaes(date, nav),
color = "#002C54") %>%
# color = "#258039") %>%
hc_title(
text = "200-day ROC Strategy's Net Asset Value"
) %>%
hc_add_theme(hc_theme_538())
3.8 Comparing against the S&P 500 Index
One final step would be to compare these results against holding the whole S&P 500.
We can download daily returns for the S&P 500 from January 02, 1986 through June 29, 2018. The data comes from this Kaggle webpage. We note that this dataset has a year and half less data than the provided dataset. However, it’s still useful to contrast the strategies.
# Import csv file
spx <- read_csv("data/spx.csv")
## Rows: 8192 Columns: 2
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): date
## dbl (1): close
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
spx_tbl <- spx %>%
mutate(
# Parse the date to a "date" type of object
date = parse_date(date, format = "%d-%b-%y"),
# Calculate the returns
returns = (close / lag(close)) - 1
) %>%
# Filter out the first value
filter(!is.na(returns)) %>%
# Filter for dates in the analysis
filter(date %in% analysis_dates) %>%
# Get lower-case and no-space names
clean_names()
spx_tbl
## # A tibble: 2,139 x 3
## date close returns
## <date> <dbl> <dbl>
## 1 2009-12-31 1115. -0.0100
## 2 2010-01-04 1133. 0.0160
## 3 2010-01-05 1137. 0.00312
## 4 2010-01-06 1137. 0.000546
## 5 2010-01-07 1142. 0.00400
## 6 2010-01-08 1145. 0.00288
## 7 2010-01-11 1147. 0.00175
## 8 2010-01-12 1136. -0.00938
## 9 2010-01-13 1146. 0.00833
## 10 2010-01-14 1148. 0.00243
## # ... with 2,129 more rows
We can create the S&P 500 NAV for the duration of the strategy.
spx_multiplier <- spx_tbl %>%
group_by(date) %>%
# sum the returns for each date
summarise(sp500_return = sum(returns)) %>%
# get a column with the NAV multiplier.
mutate(multiplier = if_else(date == analysis_dates[[1]],
1,
sp500_return + 1),
cumulative_multiplier = cumprod(multiplier))
spx_nav <- spx_multiplier %>%
mutate(nav = starting_capital * cumulative_multiplier) %>%
select(date, sp500_return, nav)
spx_nav
## # A tibble: 2,139 x 3
## date sp500_return nav
## <date> <dbl> <dbl>
## 1 2009-12-31 -0.0100 100
## 2 2010-01-04 0.0160 102.
## 3 2010-01-05 0.00312 102.
## 4 2010-01-06 0.000546 102.
## 5 2010-01-07 0.00400 102.
## 6 2010-01-08 0.00288 103.
## 7 2010-01-11 0.00175 103.
## 8 2010-01-12 -0.00938 102.
## 9 2010-01-13 0.00833 103.
## 10 2010-01-14 0.00243 103.
## # ... with 2,129 more rows
Now let’s plot the NAV of holding the S&P 500 index:
spx_nav %>%
hchart("line",
hcaes(date, nav),
# color = "#002C54") %>%
color = "#258039") %>%
hc_title(
text = "Holding the S&P 500 Index"
) %>%
hc_add_theme(hc_theme_538())
To see both strategies side to side, we can substract the NAV from each other:
relative_nav <- spx_nav %>%
rename(spx_nav = nav) %>%
left_join(strategy_nav %>% rename(strat_nav = nav), by = "date") %>%
mutate(nav_diff = strat_nav - spx_nav)
In the chart below, we plot the difference between the NAV’s growth produced by the 200-day ROC strategy against holding the S&P 500 index. We can see that, for a brief window of time, holding the index was more profitable. However, after 10 years, the 200-day ROC strategy clearly beats the index.
relative_nav %>%
hchart("line",
hcaes(date, nav_diff),
# color = "#002C54") %>%
color = "#258039") %>%
hc_title(
text = "NAV difference between strategies"
) %>%
hc_add_theme(hc_theme_538())
We could also visualize both NAVs. Below we provide a chart where we do just that. The blue line corresponds to the strategy’s NAV, while the red line corresponds to the S&P 500 index. The x-axis denotes the amount of days after the strategy is implemented.
highchart(type = "chart") %>%
hc_add_series(relative_nav$spx_nav) %>%
hc_add_series(relative_nav$strat_nav) %>%
hc_title(
text = "Contrast of NAVs between S&P500 index and ROC strategy"
) %>%
hc_add_theme(hc_theme_538())
4. Conclusions
This strategy clearly beat the S&P 500 during the analysis period. However, the execution of this strategy, as presented on this document, relies upon some unrealistic assumptions. Namely:
- No trading friction (i.e. there are no commissions and taxes for buying/selling stocks).
- The closing price of the previous day is equal to the opening price of the current day.
- The investor can buy fractions of shares.
The code can be further improved to ‘relax’ all of the previous assumptions and reflect a ‘more realistic’ implementation.
- Posted on:
- February 11, 2022
- Length:
- 38 minute read, 7913 words
- See Also: