키움 API Tick Data - Task1 - Day2

Contents

Note: 이 페이지는 키움 API를 통해 Real Time Tick Data를 받아온 후 데이터를 클리닝 하는 과정을 보여주고 있습니다. 오류 및 문의사항은 metrics@kakao.com 으로 메일주시면 감사하겠습니다

데이터 자체에 대한 질문과 데이터 제공에 관한 문의는 000 으로 메일 주시면 감사하겠습니다

R code 블럭과 Python code 블럭은 다음과 같이 색깔로 구분하겠습니다. 결과창은 동일하게 Ivory 색 블럭으로 표시됩니다.

# "이것은 R 코드 입니다."

# "이것은 Python 코드 입니다."

Package

library(dplyr)
library(tidyverse)
library(DT)
library(reticulate) # Python
#py_install(packages = "matplotlib")
#py_install(packages = "pandas")
#py_install(packages = 'dfply')

options(scipen=999)
options(max.print = 99999999)
options(digits=10)

Import Data

Check NA

wd = "G:/공유 드라이브/Project_TBD/Stock_Data/real_time/kiwoom_stocks/2021-01-14"
wd2 = "/Volumes/GoogleDrive/공유 드라이브/Project_TBD/Stock_Data/real_time/kiwoom_stocks/2021-01-14"
fn = list.files(path = wd2,
                pattern = '.*stocks_trade.*\\.csv') # 마지막 30분 틱데이터 가지고오기
path = paste(wd2,fn,sep = '/')

# data = readr::read_csv(file = path, 
#                        col_names = c('code','trade_date','timestamp','price','open','high','low',
#                                      'size','cum_size','ask1','bid1'))

tbl =
  list.files(path = wd2, pattern = '.*stocks_trade.*\\.csv') %>%
  map_df(~readr::read_csv(paste(wd2,.,sep = '/'),
                   col_names = c('code','trade_date','timestamp','price','open','high','low',
                                     'size','cum_size','ask1','bid1') ))

sum(is.na(tbl))

## [1] 0

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime, timedelta
#py_install(packages = "matplotlib")
#py_install(packages = "pandas")
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_rows', 100)
df_py = r.tbl
df_py.head()

##      code  trade_date           timestamp  ...    cum_size        ask1        bid1
## 0  000990 100105.0000 20210114100105.3164  ... 679839.0000  60000.0000  59800.0000
## 1  000990 100105.0000 20210114100105.3164  ... 679798.0000  59900.0000  59800.0000
## 2  000300 100105.0000 20210114100105.3203  ...  97702.0000    791.0000    790.0000
## 3  000080 100105.0000 20210114100105.3711  ... 164732.0000  35950.0000  35900.0000
## 4  001630 100105.0000 20210114100105.5117  ... 108391.0000 127500.0000 127000.0000
## 
## [5 rows x 11 columns]

1. Filter for a certain Stocks [특정주식으로 필터링 하기]

KOSPI Code for Samsung Electronics is ‘005930’

ss = df_py[df_py.code == '005930'].reset_index(drop=True)
ss.shape

## (120366, 11)

2. Use time object in python

Delete rows where “second” does not lie between 0 and 59.
Substring the timstamp to get each time component.


ss = ss[ss['timestamp'].apply(lambda x: int(str(int(x*1000000))[12:14])) < 60 ]

ss = ss.assign( microsecond = ss['timestamp'].apply(lambda x: str(int(x*1000000))[14:20]),
                second = ss['timestamp'].apply(lambda x: str(int(x*1000000))[12:14]),
                minute = ss['timestamp'].apply(lambda x: str(int(x*1000000))[10:12]),
                hour = ss['timestamp'].apply(lambda x: str(int(x*1000000))[8:10]),
                day = ss['timestamp'].apply(lambda x: str(int(x*1000000))[6:8]),
                month = ss['timestamp'].apply(lambda x: str(int(x*1000000))[4:6]),
                year = ss['timestamp'].apply(lambda x: str(int(x*1000000))[0:4]),
                kw_time = ss['trade_date'].apply(lambda x: str(int(x))))
ss.head()

##      code  trade_date           timestamp      price  ...  day  month  year  kw_time
## 0  005930 100105.0000 20210114100105.7930 89500.0000  ...   14     01  2021   100105
## 1  005930 100105.0000 20210114100105.8438 89600.0000  ...   14     01  2021   100105
## 2  005930 100106.0000 20210114100106.0195 89600.0000  ...   14     01  2021   100106
## 3  005930 100106.0000 20210114100106.0195 89500.0000  ...   14     01  2021   100106
## 4  005930 100106.0000 20210114100106.1836 89600.0000  ...   14     01  2021   100106
## 
## [5 rows x 19 columns]

Transform to python datetime!

ss['time'] = ss.apply(lambda x: datetime(
                  year = int(x['year']),
                  month = int(x['month']),
                  day = int(x['day']),
                  hour = int(x['hour']),
                  minute = int(x['minute']),
                  second = int(x['second']),
                  microsecond = int(x['microsecond'])), #.strftime("%H:%M:%S.%f")[:-3], 
                  axis=1)
ss.head(10)

##      code  trade_date  ...  kw_time                       time
## 0  005930 100105.0000  ...   100105 2021-01-14 10:01:05.793536
## 1  005930 100105.0000  ...   100105 2021-01-14 10:01:05.842688
## 2  005930 100106.0000  ...   100106 2021-01-14 10:01:06.018816
## 3  005930 100106.0000  ...   100106 2021-01-14 10:01:06.018816
## 4  005930 100106.0000  ...   100106 2021-01-14 10:01:06.182656
## 5  005930 100106.0000  ...   100106 2021-01-14 10:01:06.338304
## 6  005930 100106.0000  ...   100106 2021-01-14 10:01:06.366976
## 7  005930 100106.0000  ...   100106 2021-01-14 10:01:06.678272
## 8  005930 100106.0000  ...   100106 2021-01-14 10:01:06.805248
## 9  005930 100106.0000  ...   100106 2021-01-14 10:01:07.264000
## 
## [10 rows x 20 columns]

Calculate the latency

ss['kwtime'] = ss.apply(lambda x: datetime(year = int(x['year']),
                                           month = int(x['month']),
                                           day = int(x['day']),
                                           hour = int(x['kw_time'][0:2]),
                                           minute = int(x['kw_time'][2:4]),
                                           second = int(x['kw_time'][4:6])),
                                           axis = 1
                                           )
ss['latency'] = ss.apply(lambda x: x['kwtime'].second - x['time'].second,
                          axis =1
                          )
ss1 = ss[['time', 'kwtime', 'hour', 'minute', 'second', 'price','size','ask1','bid1','latency']]
ss1.head(10)

##                         time              kwtime  ...       bid1 latency
## 0 2021-01-14 10:01:05.793536 2021-01-14 10:01:05  ... 89500.0000       0
## 1 2021-01-14 10:01:05.842688 2021-01-14 10:01:05  ... 89500.0000       0
## 2 2021-01-14 10:01:06.018816 2021-01-14 10:01:06  ... 89500.0000       0
## 3 2021-01-14 10:01:06.018816 2021-01-14 10:01:06  ... 89500.0000       0
## 4 2021-01-14 10:01:06.182656 2021-01-14 10:01:06  ... 89500.0000       0
## 5 2021-01-14 10:01:06.338304 2021-01-14 10:01:06  ... 89500.0000       0
## 6 2021-01-14 10:01:06.366976 2021-01-14 10:01:06  ... 89500.0000       0
## 7 2021-01-14 10:01:06.678272 2021-01-14 10:01:06  ... 89500.0000       0
## 8 2021-01-14 10:01:06.805248 2021-01-14 10:01:06  ... 89500.0000       0
## 9 2021-01-14 10:01:07.264000 2021-01-14 10:01:06  ... 89500.0000      -1
## 
## [10 rows x 10 columns]

3. Calculate latency and clean it [레이턴시 계산 및 정리]

*Calculate the latency by substracting the tiem provided by Kiwoom dataset and self-recorded time at the moment of processing the data delivered through the Kiwoom API.

# Task 2 & 3
print(ss1.latency.value_counts())

##  0     85028
##  1     22814
## -1       861
## -2       707
## -3       421
##        ...  
## -55        5
##  57        4
##  54        3
## -52        2
##  38        2
## Name: latency, Length: 113, dtype: int64

print(ss1.latency.nsmallest(10))

## 2518    -59
## 3233    -59
## 3868    -59
## 13691   -59
## 13692   -59
## 13693   -59
## 13694   -59
## 13695   -59
## 17413   -59
## 22419   -59
## Name: latency, dtype: int64

print(ss1.latency.nlargest(10))

## 703      59
## 704      59
## 705      59
## 706      59
## 707      59
## 708      59
## 709      59
## 18443    59
## 20431    59
## 20432    59
## Name: latency, dtype: int64

plt.plot(ss1.latency)
plt.title('Latency Over Time Before Cleaning')

Check the time where large discrepancy happened!
It turns out that when the second of time equal 59, it has large latency. It might not be case for the other day, which has to be checked.

When latency is -59, it actually is 1 second difference

ss1[ss1.latency == -59][['kwtime','hour','minute','second','latency']]

##                     kwtime hour minute second  latency
## 2518   2021-01-14 10:08:00   10     07     59      -59
## 3233   2021-01-14 10:10:00   10     09     59      -59
## 3868   2021-01-14 10:12:00   10     11     59      -59
## 13691  2021-01-14 10:38:00   10     38     59      -59
## 13692  2021-01-14 10:38:00   10     38     59      -59
## ...                    ...  ...    ...    ...      ...
## 118298 2021-01-14 15:18:00   15     17     59      -59
## 118299 2021-01-14 15:18:00   15     17     59      -59
## 119234 2021-01-14 15:19:00   15     18     59      -59
## 119235 2021-01-14 15:19:00   15     18     59      -59
## 119236 2021-01-14 15:19:00   15     18     59      -59
## 
## [384 rows x 5 columns]

When latency is 59, it actually is 1 second difference

ss1[ss1.latency == 59][['kwtime','hour','minute','second','latency']]

##                    kwtime hour minute second  latency
## 703   2021-01-14 10:02:59   10     03     00       59
## 704   2021-01-14 10:02:59   10     03     00       59
## 705   2021-01-14 10:02:59   10     03     00       59
## 706   2021-01-14 10:02:59   10     03     00       59
## 707   2021-01-14 10:02:59   10     03     00       59
## 708   2021-01-14 10:02:59   10     03     00       59
## 709   2021-01-14 10:02:59   10     03     00       59
## 18443 2021-01-14 10:47:59   10     48     00       59
## 20431 2021-01-14 10:52:59   10     53     00       59
## 20432 2021-01-14 10:52:59   10     53     00       59
## 20433 2021-01-14 10:52:59   10     53     00       59
## 20434 2021-01-14 10:52:59   10     53     00       59
## 20435 2021-01-14 10:52:59   10     53     00       59

Delete where time difference goes beyond one second between Kiwoom api time and self-recorded time. In other words, we allow latency +/- five second at maximum.

ss2 = ss1[abs(ss1.latency) <=5]
print(ss2.latency.value_counts())

##  0    85028
##  1    22814
## -1      861
## -2      707
## -3      421
## -4      369
## -5      271
##  4      157
##  2      148
##  5      146
##  3      131
## Name: latency, dtype: int64

plt.plot(ss2.latency)
plt.title('Latency Over Time After Cleaning')

4. Get Trade Direction Indicator

If price is large than midprice, then it is buyer-initiated. If price is less than midprice, then it is seller-initated. It is tricky if price is same as midprice. In that case, we use the following rule.

The tick rule is the most commonly used level-1 algorithm. This rule is rather simple and classifies a trade as buyer-initiated if the trade price is above the preceding trade price (an uptick trade) and as seller-initiated if the trade price is below the preceding trade price (a downtick trade). If the trade price is the same as the previous trade price (a zero-tick trade), the rule looks for the closest prior price that differs from the current trade price. Zero-uptick trades are classified as buys, and zero-downtick trades are classified as sells.

The rule above can be simply implemented through the code below.

ss = ss.assign(spread = ss['ask1'] - ss['bid1'],
               mid = (ss['bid1'] + ss['ask1'])*.5,
               ind = 0,
               price1 = ss['price'].shift(1, fill_value=0),
               price2 = ss['price'].shift(2, fill_value=0)
              )
              
buy = ((ss['price'] > ss['mid']) |
        ((ss['price'] == ss['mid']) & (ss['price'] > ss['price1'])) |
        ((ss['price'] == ss['mid']) & (ss['price'] == ss['price1']) & (ss['price'] > ss['price2'])))
        
ss.loc[buy, 'ind'] = 1
ss.loc[~buy, 'ind'] = -1

ss1 = ss[['time', 'price','size','ask1','bid1','ind','spread']]

ss1.head(5)

##                         time      price    size  ...       bid1  ind   spread
## 0 2021-01-14 10:01:05.793536 89500.0000  7.0000  ... 89500.0000   -1 100.0000
## 1 2021-01-14 10:01:05.842688 89600.0000  6.0000  ... 89500.0000    1 100.0000
## 2 2021-01-14 10:01:06.018816 89600.0000 10.0000  ... 89500.0000    1 100.0000
## 3 2021-01-14 10:01:06.018816 89500.0000 44.0000  ... 89500.0000   -1 100.0000
## 4 2021-01-14 10:01:06.182656 89600.0000 10.0000  ... 89500.0000    1 100.0000
## 
## [5 rows x 7 columns]

We will discuss what to do more with using trading indicator value along with other variables extracted from tick data.

Source

데이터 provided by 00 Team
http://dee.uib.es/digitalAssets/234/234006_Pascual.pdf