데이터 불러오기 및 관련 function 연습
Note: 이 페이지는 Reproducible Finance with R 책을 (과하게) 참고 하였습니다. 다른 Reference 는 밑에 적어 두겠습니다.
패키지 준비하기
데이터 불러오기
아래의 티커를 이용해서 야후 파이낸스로 부터 데이터를 받아옵니다.
library(purrr)
symbols = c('SPY','EFA','IJS','EEM','AGG')
prices =
getSymbols(symbols,
src = 'yahoo',
from = '2012-12-31',
to = '2017-12-31',
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad((get(.)))) %>% # get adjusted price from each of our individual price series
reduce(merge) %>% # looks for the date index shared by our objects and uses that index to align the data
'colnames<-'(symbols) # rename the columns according to the symbols object
## 'getSymbols' currently uses auto.assign=TRUE by default, but will
## use auto.assign=FALSE in 0.5-0. You will still be able to use
## 'loadSymbols' to automatically load data. getOption("getSymbols.env")
## and getOption("getSymbols.auto.assign") will still be checked for
## alternate defaults.
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for details.
head(prices)
## SPY EFA IJS EEM AGG
## 2012-12-31 122.3169 45.52038 35.92746 37.72139 91.52252
## 2013-01-02 125.4519 46.22488 36.91325 38.46137 91.41542
## 2013-01-03 125.1685 45.77657 36.86440 38.18919 91.18474
## 2013-01-04 125.7181 46.00873 37.15303 38.26574 91.28361
## 2013-01-07 125.3746 45.80859 36.93544 37.97656 91.23418
## 2013-01-08 125.0139 45.55241 36.79335 37.63634 91.31657
map function - apply a function to each element of a list or atomic vector - map() always returns a list
1:3 %>% map(function(x) rnorm(5, mean=x, sd=1))
## [[1]]
## [1] 1.4646201 0.6149433 0.2485235 1.0308619 1.3684993
##
## [[2]]
## [1] 4.690468 2.194752 2.067244 0.798433 2.671599
##
## [[3]]
## [1] 3.7040382 4.3450097 1.6941359 0.8323598 2.7185455
1:3 %>% map(~ .x**2)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
Usage of Reduce() and merge()
- Reduce()
takes a function f
of two arguments and a list or vector x
which is to be ‘reduced’ using f
. The function is first called on the first two components of x, then with the result of that as the first argument and the third component of x as the second argument, then again with the result of the second step as first argument and the fourth component of x as the second argument etc. The process is continued until all elements of x have been processed.
cumsum(1:5)
## [1] 1 3 6 10 15
Reduce(x=1:5, f="+")
## [1] 15
Reduce(x=1:5, f="+", accumulate = TRUE)
## [1] 1 3 6 10 15
Joining multiple data frames with the same ids but different other columns. While there is a ready-made function
join_all()
for this in theplyr
package, we will see shortly how to solve this task usingReduce()
using themerge()
function from baseR
.Sums of matrix powers
P = rbind(c(0.9,0.1),c(1,0))
library(expm)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Attaching package: 'expm'
## The following object is masked from 'package:Matrix':
##
## expm
P_power = lapply(0:10, function(k) P %^% k)
head(P_power)
## [[1]]
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
##
## [[2]]
## [,1] [,2]
## [1,] 0.9 0.1
## [2,] 1.0 0.0
##
## [[3]]
## [,1] [,2]
## [1,] 0.91 0.09
## [2,] 0.90 0.10
##
## [[4]]
## [,1] [,2]
## [1,] 0.909 0.091
## [2,] 0.910 0.090
##
## [[5]]
## [,1] [,2]
## [1,] 0.9091 0.0909
## [2,] 0.9090 0.0910
##
## [[6]]
## [,1] [,2]
## [1,] 0.90909 0.09091
## [2,] 0.90910 0.09090
Reduce(P_power, f="+")
## [,1] [,2]
## [1,] 10.082645 0.9173554
## [2,] 9.173554 1.8264463
- Simulating a trajectory of a Markov Chain without a loop
P <- matrix(c(0, 0.1, 0.9, 0.2, 0.5, 0.3, 0, 0.5, 0.5), ncol = 3, byrow = T);
P
## [,1] [,2] [,3]
## [1,] 0.0 0.1 0.9
## [2,] 0.2 0.5 0.3
## [3,] 0.0 0.5 0.5
x_0 <- 2
newstate <- function(oldstate,u) {
which.min(u>cumsum(P[oldstate,]))
}
x_1 = newstate(x_0, runif(1)); x_1
## [1] 3
x_2 = newstate(x_1, runif(1)); x_2
## [1] 2
We can continue to do this without looping by using the function Reduce
.
We just have to provide a vector that contains the starting state as first component and the K uniform random numbers needed to generate the following states. We also set accumulate=TRUE
, because we want the whole trajectory and not only the state at time K.
set.seed(1)
K = 500
mc_without_loop = Reduce(newstate, c(x_0,runif(K)),accumulate = TRUE)
# Distribution of states:
table(mc_without_loop)/length(mc_without_loop)
## mc_without_loop
## 1 2 3
## 0.09580838 0.48303393 0.42115768
csv파일 불러오기
The function tk_xts(date_var = date)
converts the data frame to an xts object.
prices =
read_csv('Reproducible Finance.csv',
col_types =
cols(date = col_date(format='%Y-%m-%d'))) %>%
tk_xts(date_var = date)
## Warning: Non-numeric columns being dropped: date
엑셀 파일 불러오기
mutate
: Create new variables with functions of existing variables
prices =
read_excel('Reproducible Finance.xlsx',
col_types = c('text','numeric','numeric','numeric','numeric','numeric')) %>%
mutate(date = ymd(date)) %>%
tk_xts(date_var = date)
## Warning: Non-numeric columns being dropped: date
head(prices)
## SPY EFA IJS EEM AGG
## 2012-12-31 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-02 131.5977 49.92501 77.12553 40.67489 98.08131
## 2013-01-03 131.3004 49.44080 77.02349 40.38705 97.83374
## 2013-01-04 131.8771 49.69155 77.62656 40.46800 97.93983
## 2013-01-07 131.5166 49.47539 77.17194 40.16218 97.88681
## 2013-01-08 131.1382 49.19869 76.87505 39.80238 97.97523
지금까지 raw 데이터를 R에 불러오는 3가지 방법을 알아보았습니다.
Daily 가격 데이터를 Monthly 데이터로 변환하기
- to.monthly 함수를 사용하면 간편하게 변환할 수 있다.
- 한달치 일일 가격을 월별 데이터로 변환할때, 예를들어 12월 가격 데이터인 경우,
- 12월 1일 데이터를 12월의 가격 데이터로 사용할지
- 12월 31일 데이터를 12월 가격 데이터로 사용할지
- 12월 전체 가격들의 평균을 12월 가격 데이터로 사용할지 정해야한다.
월말 가격
prices_month_last = to.monthly(prices,
indexAt = 'lastof',
OHLC=FALSE)
head(prices_month_last)
## SPY EFA IJS EEM AGG
## 2012-12-31 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-31 134.8773 50.99717 79.08315 39.77539 97.58625
## 2013-02-28 136.5982 50.34004 80.37274 38.86691 98.16285
## 2013-03-31 141.7850 50.99717 83.67441 38.47113 98.25957
## 2013-04-30 144.5090 53.55654 83.77677 38.93888 99.21130
## 2013-05-31 147.9209 51.93964 87.36826 37.05894 97.22598
월초 가격
prices_month_first = to.monthly(prices,
indexAt = 'firstof',
OHLC=FALSE)
head(prices_month_first)
## SPY EFA IJS EEM AGG
## 2012-12-01 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-01 134.8773 50.99717 79.08315 39.77539 97.58625
## 2013-02-01 136.5982 50.34004 80.37274 38.86691 98.16285
## 2013-03-01 141.7850 50.99717 83.67441 38.47113 98.25957
## 2013-04-01 144.5090 53.55654 83.77677 38.93888 99.21130
## 2013-05-01 147.9209 51.93964 87.36826 37.05894 97.22598
월평균 가격
prices_month_avg = apply.monthly(prices, mean)
head(prices_month_avg)
## SPY EFA IJS EEM AGG
## 2012-12-31 128.3092 49.16410 75.06590 39.89233 98.19626
## 2013-01-31 133.2525 50.21898 78.04847 40.04696 97.90281
## 2013-02-28 136.3583 50.50933 80.34736 39.30245 97.69156
## 2013-03-28 140.1191 51.09358 82.83109 38.56513 97.90329
## 2013-04-30 142.0660 51.66610 82.18741 37.76994 98.83934
## 2013-05-31 148.5901 53.87725 86.78886 38.74793 98.26976
수익률 데이터 만들기
asset_returns_xts =
Return.calculate(prices_month_last,
method = 'log') %>%
na.omit()
head(asset_returns_xts,3)
## SPY EFA IJS EEM AGG
## 2013-01-31 0.04992297 0.03660636 0.05213343 -0.002935495 -0.006231517
## 2013-02-28 0.01267831 -0.01296938 0.01617522 -0.023105260 0.005891222
## 2013-03-31 0.03726793 0.01296938 0.04025808 -0.010235026 0.000984796
현재까지 일일 가격데이터를 가지고 와서, adjusted price를 추출한 후, 월별 가격 데이터로 변환한 후에 마지막으로 수익률로 바꾸어 보았다.
xts 와 data frame 의 차이
- xts 같은 경우에는 index에서 시간에 해당하는 정보를 가지고 있는 반면에, data frame 같은 경우는 date 라는 열에 시간 정보를 가지고 있다.
xts에서 data frame 으로 바꾸는 경우, 기본적으로 시작은 - data.frame(date=index(.)), 로 하는데 이는 - coerces our object into a data frame - adds a date column based on the index - 이러면 xts의 index에 해당하는 시간정보는 보존이 되며 추후에 remove_rownames를 통해 지워줘야한다.
R Tips R follows a set of conventions that makes one layout of tabular data much easier to work with than others. Your data will be easier to work with in R if it follows three rules as follows:
- Each variable in the data set is placed in its own column
- Each observation is placed in its own row
- Each value is placed in its own cell
Data that satisfies these rules is known as tidy data.
The table 1 is tidy data.
##
## Attaching package: 'DSR'
## The following objects are masked from 'package:tidyr':
##
## table1, table2, table3, table5, who
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
In practice, raw data is rarely tidy and is much harder to work with as a result.
Data set two intermingles the values of population and cases in the same columns. As a result, you will need to untangle the values whenever you want to work with each variable separately.
- Notice that for the first two columns, for each key column contains the values belongs to the key. However,
key
column contains only keys andvalue
column contains the values associated with those keys.
table2
## # A tibble: 12 x 4
## country year key value
## <fct> <int> <fct> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
Data set three combines the values of cases and population into the same cells. You will need to separate the population values from the cases values if you wish to do math with them. This can be done, but not with ‘basic’ R syntax
table3
## # A tibble: 6 x 3
## country year rate
## <fct> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Spread()
spread()
turns a pair of key;value columns into a set of tidy columns. To use spread()
, pass it the name of a dataframe, then the name of key column in the data frame, and then the name of the value column. Pass the column names as they are; do not use quotes.
spread(table2,key,value)
## # A tibble: 6 x 4
## country year cases population
## <fct> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
gather()
gather()
does the reverse of spread()
. It collects a set of column names and places thme into a single key
column. gather()
collects a set of column names and places them into a single key
column. It also collects the cells of those columns and places them into a single value column.
For table4,
table4
## # A tibble: 3 x 3
## country `1999` `2000`
## <fct> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
gather(table4, 'year', 'cases', 2:3)
## # A tibble: 6 x 3
## country year cases
## <fct> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Come back to price data
asset_returns_dplyr_byhand = prices %>%
to.monthly(indexAt = 'lastof', OHLC=FALSE) %>%
# convert the index to a date colun
data.frame(date=index(.))%>%
remove_rownames()%>%
gather(asset, prices, -date) %>%
group_by(asset) %>%
mutate(returns = (log(prices) - log(lag(prices)))) %>%
select(-prices) %>%
spread(asset, returns) %>%
select(date, symbols)
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(symbols)` instead of `symbols` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
head(asset_returns_dplyr_byhand, 3)
## # A tibble: 3 x 6
## date SPY EFA IJS EEM AGG
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2012-12-31 NA NA NA NA NA
## 2 2013-01-31 0.0499 0.0366 0.0521 -0.00294 -0.00623
## 3 2013-02-28 0.0127 -0.0130 0.0162 -0.0231 0.00589