R과 금융데이터 (1) 및 R 연습

Roh

2020/10/27

데이터 불러오기 및 관련 function 연습

Note: 이 페이지는 Reproducible Finance with R 책을 (과하게) 참고 하였습니다. 다른 Reference 는 밑에 적어 두겠습니다.

패키지 준비하기

데이터 불러오기

아래의 티커를 이용해서 야후 파이낸스로 부터 데이터를 받아옵니다.

library(purrr)
symbols = c('SPY','EFA','IJS','EEM','AGG')
prices = 
  getSymbols(symbols,
                    src = 'yahoo',
                    from = '2012-12-31',
                    to = '2017-12-31',
                    auto.assign = TRUE,
                    warnings = FALSE) %>%
  map(~Ad((get(.)))) %>% # get adjusted price from each of our individual price series
  reduce(merge) %>% # looks for the date index shared by our objects and uses that index to align the data
  'colnames<-'(symbols) # rename the columns according to the symbols object
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
## 
## This message is shown once per session and may be disabled by setting 
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
head(prices)
##                 SPY      EFA      IJS      EEM      AGG
## 2012-12-31 122.3169 45.52038 35.92746 37.72139 91.52252
## 2013-01-02 125.4519 46.22488 36.91325 38.46137 91.41542
## 2013-01-03 125.1685 45.77657 36.86440 38.18919 91.18474
## 2013-01-04 125.7181 46.00873 37.15303 38.26574 91.28361
## 2013-01-07 125.3746 45.80859 36.93544 37.97656 91.23418
## 2013-01-08 125.0139 45.55241 36.79335 37.63634 91.31657

map function - apply a function to each element of a list or atomic vector - map() always returns a list

1:3 %>% map(function(x) rnorm(5, mean=x, sd=1))
## [[1]]
## [1] 1.4646201 0.6149433 0.2485235 1.0308619 1.3684993
## 
## [[2]]
## [1] 4.690468 2.194752 2.067244 0.798433 2.671599
## 
## [[3]]
## [1] 3.7040382 4.3450097 1.6941359 0.8323598 2.7185455
1:3 %>% map(~ .x**2)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9

Usage of Reduce() and merge() - Reduce() takes a function f of two arguments and a list or vector x which is to be ‘reduced’ using f. The function is first called on the first two components of x, then with the result of that as the first argument and the third component of x as the second argument, then again with the result of the second step as first argument and the fourth component of x as the second argument etc. The process is continued until all elements of x have been processed.

cumsum(1:5)
## [1]  1  3  6 10 15
Reduce(x=1:5, f="+")
## [1] 15
Reduce(x=1:5, f="+", accumulate = TRUE)
## [1]  1  3  6 10 15
  • Joining multiple data frames with the same ids but different other columns. While there is a ready-made function join_all() for this in the plyr package, we will see shortly how to solve this task using Reduce() using the merge() function from base R.

  • Sums of matrix powers

P = rbind(c(0.9,0.1),c(1,0))
library(expm)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Attaching package: 'expm'
## The following object is masked from 'package:Matrix':
## 
##     expm
P_power = lapply(0:10, function(k) P %^% k)
head(P_power)
## [[1]]
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
## 
## [[2]]
##      [,1] [,2]
## [1,]  0.9  0.1
## [2,]  1.0  0.0
## 
## [[3]]
##      [,1] [,2]
## [1,] 0.91 0.09
## [2,] 0.90 0.10
## 
## [[4]]
##       [,1]  [,2]
## [1,] 0.909 0.091
## [2,] 0.910 0.090
## 
## [[5]]
##        [,1]   [,2]
## [1,] 0.9091 0.0909
## [2,] 0.9090 0.0910
## 
## [[6]]
##         [,1]    [,2]
## [1,] 0.90909 0.09091
## [2,] 0.90910 0.09090
Reduce(P_power, f="+")
##           [,1]      [,2]
## [1,] 10.082645 0.9173554
## [2,]  9.173554 1.8264463
  • Simulating a trajectory of a Markov Chain without a loop
P <- matrix(c(0, 0.1, 0.9, 0.2, 0.5, 0.3, 0, 0.5, 0.5), ncol = 3, byrow = T);
P
##      [,1] [,2] [,3]
## [1,]  0.0  0.1  0.9
## [2,]  0.2  0.5  0.3
## [3,]  0.0  0.5  0.5
x_0 <- 2
newstate <- function(oldstate,u) {
  which.min(u>cumsum(P[oldstate,]))
}
x_1 = newstate(x_0, runif(1)); x_1
## [1] 3
x_2 = newstate(x_1, runif(1)); x_2
## [1] 2

We can continue to do this without looping by using the function Reduce. We just have to provide a vector that contains the starting state as first component and the K uniform random numbers needed to generate the following states. We also set accumulate=TRUE, because we want the whole trajectory and not only the state at time K.

set.seed(1)
K = 500
mc_without_loop = Reduce(newstate, c(x_0,runif(K)),accumulate = TRUE)
# Distribution of states:
table(mc_without_loop)/length(mc_without_loop)
## mc_without_loop
##          1          2          3 
## 0.09580838 0.48303393 0.42115768

csv파일 불러오기

The function tk_xts(date_var = date) converts the data frame to an xts object.

prices = 
  read_csv('Reproducible Finance.csv',
           col_types = 
             cols(date = col_date(format='%Y-%m-%d'))) %>%
  tk_xts(date_var = date)
## Warning: Non-numeric columns being dropped: date

엑셀 파일 불러오기

  • mutate: Create new variables with functions of existing variables
prices = 
  read_excel('Reproducible Finance.xlsx',
             col_types = c('text','numeric','numeric','numeric','numeric','numeric')) %>%
  mutate(date = ymd(date)) %>%
  tk_xts(date_var = date)
## Warning: Non-numeric columns being dropped: date
head(prices)
##                 SPY      EFA      IJS      EEM      AGG
## 2012-12-31 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-02 131.5977 49.92501 77.12553 40.67489 98.08131
## 2013-01-03 131.3004 49.44080 77.02349 40.38705 97.83374
## 2013-01-04 131.8771 49.69155 77.62656 40.46800 97.93983
## 2013-01-07 131.5166 49.47539 77.17194 40.16218 97.88681
## 2013-01-08 131.1382 49.19869 76.87505 39.80238 97.97523

지금까지 raw 데이터를 R에 불러오는 3가지 방법을 알아보았습니다.

Daily 가격 데이터를 Monthly 데이터로 변환하기

월말 가격

prices_month_last = to.monthly(prices,
                            indexAt = 'lastof',
                            OHLC=FALSE)
head(prices_month_last)
##                 SPY      EFA      IJS      EEM      AGG
## 2012-12-31 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-31 134.8773 50.99717 79.08315 39.77539 97.58625
## 2013-02-28 136.5982 50.34004 80.37274 38.86691 98.16285
## 2013-03-31 141.7850 50.99717 83.67441 38.47113 98.25957
## 2013-04-30 144.5090 53.55654 83.77677 38.93888 99.21130
## 2013-05-31 147.9209 51.93964 87.36826 37.05894 97.22598

월초 가격

prices_month_first = to.monthly(prices,
                            indexAt = 'firstof',
                            OHLC=FALSE)
head(prices_month_first)
##                 SPY      EFA      IJS      EEM      AGG
## 2012-12-01 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-01 134.8773 50.99717 79.08315 39.77539 97.58625
## 2013-02-01 136.5982 50.34004 80.37274 38.86691 98.16285
## 2013-03-01 141.7850 50.99717 83.67441 38.47113 98.25957
## 2013-04-01 144.5090 53.55654 83.77677 38.93888 99.21130
## 2013-05-01 147.9209 51.93964 87.36826 37.05894 97.22598

월평균 가격

prices_month_avg = apply.monthly(prices, mean)
head(prices_month_avg)
##                 SPY      EFA      IJS      EEM      AGG
## 2012-12-31 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-31 133.2525 50.21898 78.04847 40.04696 97.90281
## 2013-02-28 136.3583 50.50933 80.34736 39.30245 97.69156
## 2013-03-28 140.1191 51.09358 82.83109 38.56513 97.90329
## 2013-04-30 142.0660 51.66610 82.18741 37.76994 98.83934
## 2013-05-31 148.5901 53.87725 86.78886 38.74793 98.26976

수익률 데이터 만들기

asset_returns_xts = 
  Return.calculate(prices_month_last,
                   method = 'log') %>%
  na.omit()
head(asset_returns_xts,3)
##                   SPY         EFA        IJS          EEM          AGG
## 2013-01-31 0.04992297  0.03660636 0.05213343 -0.002935495 -0.006231517
## 2013-02-28 0.01267831 -0.01296938 0.01617522 -0.023105260  0.005891222
## 2013-03-31 0.03726793  0.01296938 0.04025808 -0.010235026  0.000984796

현재까지 일일 가격데이터를 가지고 와서, adjusted price를 추출한 후, 월별 가격 데이터로 변환한 후에 마지막으로 수익률로 바꾸어 보았다.

xts 와 data frame 의 차이

  • xts 같은 경우에는 index에서 시간에 해당하는 정보를 가지고 있는 반면에, data frame 같은 경우는 date 라는 열에 시간 정보를 가지고 있다.

xts에서 data frame 으로 바꾸는 경우, 기본적으로 시작은 - data.frame(date=index(.)), 로 하는데 이는 - coerces our object into a data frame - adds a date column based on the index - 이러면 xts의 index에 해당하는 시간정보는 보존이 되며 추후에 remove_rownames를 통해 지워줘야한다.

R Tips R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules as follows:

  1. Each variable in the data set is placed in its own column
  2. Each observation is placed in its own row
  3. Each value is placed in its own cell

Data that satisfies these rules is known as tidy data.

The table 1 is tidy data.

## 
## Attaching package: 'DSR'
## The following objects are masked from 'package:tidyr':
## 
##     table1, table2, table3, table5, who
## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

In practice, raw data is rarely tidy and is much harder to work with as a result.

Data set two intermingles the values of population and cases in the same columns. As a result, you will need to untangle the values whenever you want to work with each variable separately.

  • Notice that for the first two columns, for each key column contains the values belongs to the key. However, key column contains only keys and value column contains the values associated with those keys.
table2
## # A tibble: 12 x 4
##    country      year key             value
##    <fct>       <int> <fct>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Data set three combines the values of cases and population into the same cells. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with ‘basic’ R syntax

table3
## # A tibble: 6 x 3
##   country      year rate             
##   <fct>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Spread()

spread() turns a pair of key;value columns into a set of tidy columns. To use spread(), pass it the name of a dataframe, then the name of key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes.

spread(table2,key,value)
## # A tibble: 6 x 4
##   country      year  cases population
##   <fct>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

gather()

gather() does the reverse of spread(). It collects a set of column names and places thme into a single key column. gather() collects a set of column names and places them into a single key column. It also collects the cells of those columns and places them into a single value column.

For table4,

table4
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <fct>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
gather(table4, 'year', 'cases', 2:3)
## # A tibble: 6 x 3
##   country     year   cases
##   <fct>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Brazil      1999   37737
## 3 China       1999  212258
## 4 Afghanistan 2000    2666
## 5 Brazil      2000   80488
## 6 China       2000  213766

Come back to price data

asset_returns_dplyr_byhand = prices %>%
  to.monthly(indexAt = 'lastof', OHLC=FALSE) %>%
  # convert the index to a date colun
  data.frame(date=index(.))%>%
  remove_rownames()%>%
  gather(asset, prices, -date) %>%
  group_by(asset) %>%
  mutate(returns = (log(prices) - log(lag(prices)))) %>%
  select(-prices) %>%
  spread(asset, returns) %>%
  select(date, symbols)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(symbols)` instead of `symbols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
head(asset_returns_dplyr_byhand, 3)
## # A tibble: 3 x 6
##   date           SPY     EFA     IJS      EEM      AGG
##   <date>       <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1 2012-12-31 NA      NA      NA      NA       NA      
## 2 2013-01-31  0.0499  0.0366  0.0521 -0.00294 -0.00623
## 3 2013-02-28  0.0127 -0.0130  0.0162 -0.0231   0.00589