키움 API Tick Data - Task1 - Day3

Package
Import Data
Contents

Note: 이 페이지는 키움 API를 통해 Real Time Tick Data를 받아온 후 데이터를 클리닝 하는 과정을 보여주고 있습니다. 오류 및 문의사항은 metrics@kakao.com 으로 메일주시면 감사하겠습니다

데이터 자체에 대한 질문과 데이터 제공에 관한 문의는 000 으로 메일 주시면 감사하겠습니다

R code 블럭과 Python code 블럭은 다음과 같이 색깔로 구분하겠습니다. 결과창은 동일하게 Ivory 색 블럭으로 표시됩니다.

# "이것은 R 코드 입니다."

# "이것은 Python 코드 입니다."

Package

library(dplyr)
library(tidyverse)
library(DT)
library(reticulate) # Python
#py_install(packages = "matplotlib")
#py_install(packages = "pandas")
#py_install(packages = 'dfply')

options(scipen=999)
options(max.print = 99999999)
options(digits=10)

Import Data

Check NA

wd = "G:/공유 드라이브/Project_TBD/Stock_Data/real_time/kiwoom_stocks/2021-01-14"
wd2 = "/Volumes/GoogleDrive/공유 드라이브/Project_TBD/Stock_Data/real_time/kiwoom_stocks/2021-01-18"
fn = list.files(path = wd2,
                pattern = '.*stocks_trade.*\\.csv') # 마지막 30분 틱데이터 가지고오기
path = paste(wd2,fn[1:4],sep = '/')

# data = readr::read_csv(file = path, 
#                        col_names = c('code','trade_date','timestamp','price','open','high','low',
#                                      'size','cum_size','ask1','bid1'))

tbl =
  list.files(path = wd2, pattern = '.*stocks_trade.*\\.csv') %>%
  map_df(~readr::read_csv(paste(wd2,.,sep = '/'),
                   col_names = c('code','trade_date','timestamp','price','open','high','low',
                                     'size','cum_size','ask1','bid1') ))

sum(is.na(tbl))

## [1] 0

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime, timedelta
#py_install(packages = "matplotlib")
#py_install(packages = "pandas")
pd.options.display.float_format = '{:.4f}'.format
pd.set_option('display.max_rows', 100)
df_py = r.tbl
df_py.head()

##      code  trade_date           timestamp  ...      cum_size       ask1       bid1
## 0  000990 100031.0000 20210118100030.5742  ...  2389083.0000 64300.0000 64200.0000
## 1  001360 100031.0000 20210118100030.7305  ...  3365481.0000  9180.0000  9170.0000
## 2  001510 100031.0000 20210118100030.7461  ...  4455526.0000   907.0000   906.0000
## 3  003490 100031.0000 20210118100030.7773  ...  1538419.0000 31650.0000 31600.0000
## 4  002630 100031.0000 20210118100030.8320  ... 23354062.0000  1860.0000  1855.0000
## 
## [5 rows x 11 columns]

Samsung Stock

KOSPI Code for Samsung Electronics is ‘005930’

ss = df_py[df_py.code == '005930'].reset_index(drop=True)
ss.shape

## (255036, 11)

1. Use time object in python

Delete rows where “second” does not lie between 0 and 59.
Substring the timstamp to get each time component.

ss = ss[ss['timestamp'].apply(lambda x: int(str(int(x*1000000))[12:14])) < 60 ]

ss = ss.assign( microsecond = ss['timestamp'].apply(lambda x: str(int(x*1000000))[14:20]),
                second = ss['timestamp'].apply(lambda x: str(int(x*1000000))[12:14]),
                minute = ss['timestamp'].apply(lambda x: str(int(x*1000000))[10:12]),
                hour = ss['timestamp'].apply(lambda x: str(int(x*1000000))[8:10]),
                day = ss['timestamp'].apply(lambda x: str(int(x*1000000))[6:8]),
                month = ss['timestamp'].apply(lambda x: str(int(x*1000000))[4:6]),
                year = ss['timestamp'].apply(lambda x: str(int(x*1000000))[0:4]),
                kw_time = ss['trade_date'].apply(lambda x: str(int(x))))
ss.head()

##      code  trade_date           timestamp      price  ...  day  month  year  kw_time
## 0  005930 100031.0000 20210118100031.0391 87200.0000  ...   18     01  2021   100031
## 1  005930 100032.0000 20210118100031.3945 87100.0000  ...   18     01  2021   100032
## 2  005930 100032.0000 20210118100031.4062 87100.0000  ...   18     01  2021   100032
## 3  005930 100032.0000 20210118100031.4570 87100.0000  ...   18     01  2021   100032
## 4  005930 100032.0000 20210118100031.4648 87100.0000  ...   18     01  2021   100032
## 
## [5 rows x 19 columns]

Transform data type to str to use substring to make the format like “%Y-%m-%d %H:%M:%S.%f”
We calculate the latency by subtracting ‘trade_date’ from the ‘timestamp’

ss['time'] = ss.apply(lambda x: datetime(
                  year = int(x['year']),
                  month = int(x['month']),
                  day = int(x['day']),
                  hour = int(x['hour']),
                  minute = int(x['minute']),
                  second = int(x['second']),
                  microsecond = int(x['microsecond'])), #.strftime("%H:%M:%S.%f")[:-3], 
                  axis=1)
ss.head(10)

##      code  trade_date  ...  kw_time                       time
## 0  005930 100031.0000  ...   100031 2021-01-18 10:00:31.037440
## 1  005930 100032.0000  ...   100032 2021-01-18 10:00:31.393792
## 2  005930 100032.0000  ...   100032 2021-01-18 10:00:31.406080
## 3  005930 100032.0000  ...   100032 2021-01-18 10:00:31.455232
## 4  005930 100032.0000  ...   100032 2021-01-18 10:00:31.463424
## 5  005930 100032.0000  ...   100032 2021-01-18 10:00:31.803392
## 6  005930 100032.0000  ...   100032 2021-01-18 10:00:31.950848
## 7  005930 100032.0000  ...   100032 2021-01-18 10:00:32.462848
## 8  005930 100032.0000  ...   100032 2021-01-18 10:00:32.536576
## 9  005930 100032.0000  ...   100032 2021-01-18 10:00:32.548864
## 
## [10 rows x 20 columns]

2. Calculate latency and clean it [레이턴시 계산 및 정리]

Calculate the latency by subtracting the time provided by Kiwoom dataset and self-recorded time at the moment of processing the data delivered through the Kiwoom API.

ss['kwtime'] = ss.apply(lambda x: datetime(year = int(x['year']),
                                           month = int(x['month']),
                                           day = int(x['day']),
                                           hour = int(x['kw_time'][0:2]),
                                           minute = int(x['kw_time'][2:4]),
                                           second = int(x['kw_time'][4:6])),
                                           axis = 1
                                           )
ss['latency'] = ss.apply(lambda x: x['time'].second - x['kwtime'].second,
                          axis =1
                          )
ss1 = ss[['time', 'kwtime', 'hour', 'minute', 'second', 'price','size','ask1','bid1','latency']]
ss1.head(10)

##                         time              kwtime  ...       bid1 latency
## 0 2021-01-18 10:00:31.037440 2021-01-18 10:00:31  ... 87100.0000       0
## 1 2021-01-18 10:00:31.393792 2021-01-18 10:00:32  ... 87000.0000      -1
## 2 2021-01-18 10:00:31.406080 2021-01-18 10:00:32  ... 87000.0000      -1
## 3 2021-01-18 10:00:31.455232 2021-01-18 10:00:32  ... 87000.0000      -1
## 4 2021-01-18 10:00:31.463424 2021-01-18 10:00:32  ... 87000.0000      -1
## 5 2021-01-18 10:00:31.803392 2021-01-18 10:00:32  ... 87000.0000      -1
## 6 2021-01-18 10:00:31.950848 2021-01-18 10:00:32  ... 87000.0000      -1
## 7 2021-01-18 10:00:32.462848 2021-01-18 10:00:32  ... 87000.0000       0
## 8 2021-01-18 10:00:32.536576 2021-01-18 10:00:32  ... 87000.0000       0
## 9 2021-01-18 10:00:32.548864 2021-01-18 10:00:32  ... 87000.0000       0
## 
## [10 rows x 10 columns]

plt.scatter(ss1.reset_index().index, ss1.latency)
plt.title('Latency Over Time Before Cleaning')

print(ss1.latency.value_counts())

## -1     194052
##  0      57420
##  59      3447
##  1         99
##  2         15
## Name: latency, dtype: int64

print(ss1.latency.nsmallest(10))

## 1    -1
## 2    -1
## 3    -1
## 4    -1
## 5    -1
## 6    -1
## 21   -1
## 22   -1
## 23   -1
## 24   -1
## Name: latency, dtype: int64

print(ss1.latency.nlargest(10))

## 705    59
## 706    59
## 707    59
## 708    59
## 709    59
## 710    59
## 711    59
## 716    59
## 717    59
## 724    59
## Name: latency, dtype: int64

Check the time where large discrepancy happened!

When latency is 59, it actually is 1 second difference. We can check it as in below.

ss1[ss1.latency == 59][['kwtime','hour','minute','second','latency']]

##                     kwtime hour minute second  latency
## 705    2021-01-18 10:02:00   10     01     59       59
## 706    2021-01-18 10:02:00   10     01     59       59
## 707    2021-01-18 10:02:00   10     01     59       59
## 708    2021-01-18 10:02:00   10     01     59       59
## 709    2021-01-18 10:02:00   10     01     59       59
## ...                    ...  ...    ...    ...      ...
## 252351 2021-01-18 15:19:00   15     18     59       59
## 252352 2021-01-18 15:19:00   15     18     59       59
## 252353 2021-01-18 15:19:00   15     18     59       59
## 252354 2021-01-18 15:19:00   15     18     59       59
## 252355 2021-01-18 15:19:00   15     18     59       59
## 
## [3447 rows x 5 columns]

ss1[ss1.latency == 2][['kwtime','hour','minute','second','latency']]

##                   kwtime hour minute second  latency
## 74   2021-01-18 10:01:17   10     01     19        2
## 99   2021-01-18 10:01:17   10     01     19        2
## 100  2021-01-18 10:01:17   10     01     19        2
## 102  2021-01-18 10:01:18   10     01     20        2
## 103  2021-01-18 10:01:17   10     01     19        2
## 104  2021-01-18 10:01:17   10     01     19        2
## 105  2021-01-18 10:01:18   10     01     20        2
## 114  2021-01-18 10:01:18   10     01     20        2
## 115  2021-01-18 10:01:18   10     01     20        2
## 116  2021-01-18 10:01:18   10     01     20        2
## 122  2021-01-18 10:01:19   10     01     21        2
## 1629 2021-01-18 10:06:06   10     06     08        2
## 1630 2021-01-18 10:06:06   10     06     08        2
## 1639 2021-01-18 10:06:06   10     06     08        2
## 1650 2021-01-18 10:06:07   10     06     09        2

59초는 -1초로 변경

When latency is 59, it actually is -1 as we have seen above.

ss2 = ss1.copy()
ss2.loc[ss1['latency']==59, 'latency'] = -1

plt.scatter(ss2.reset_index().index, ss2.latency)
plt.title('Latency Over Time After Cleaning')

Hyundai Car Stock

KOSPI Code for Samsung Electronics is ‘005380’

ss = df_py[df_py.code == '005380'].reset_index(drop=True)
ss.shape

## (49963, 11)

1. Use time object in python

Delete rows where “second” does not lie between 0 and 59.
Substring the timstamp to get each time component.

ss = ss[ss['timestamp'].apply(lambda x: int(str(int(x*1000000))[12:14])) < 60 ]

ss = ss.assign( microsecond = ss['timestamp'].apply(lambda x: str(int(x*1000000))[14:20]),
                second = ss['timestamp'].apply(lambda x: str(int(x*1000000))[12:14]),
                minute = ss['timestamp'].apply(lambda x: str(int(x*1000000))[10:12]),
                hour = ss['timestamp'].apply(lambda x: str(int(x*1000000))[8:10]),
                day = ss['timestamp'].apply(lambda x: str(int(x*1000000))[6:8]),
                month = ss['timestamp'].apply(lambda x: str(int(x*1000000))[4:6]),
                year = ss['timestamp'].apply(lambda x: str(int(x*1000000))[0:4]),
                kw_time = ss['trade_date'].apply(lambda x: str(int(x))))
ss.head()

##      code  trade_date           timestamp       price  ...  day  month  year  kw_time
## 0  005380 100031.0000 20210118100031.3086 244500.0000  ...   18     01  2021   100031
## 1  005380 100033.0000 20210118100034.1055 245000.0000  ...   18     01  2021   100033
## 2  005380 100031.0000 20210118100031.0000 245000.0000  ...   18     01  2021   100031
## 3  005380 100032.0000 20210118100032.1445 245000.0000  ...   18     01  2021   100032
## 4  005380 100032.0000 20210118100032.2031 245000.0000  ...   18     01  2021   100032
## 
## [5 rows x 19 columns]

Transform data type to str to use substring to make the format like “%Y-%m-%d %H:%M:%S.%f”
We calculate the latency by subtracting ‘trade_date’ from the ‘timestamp’

ss['time'] = ss.apply(lambda x: datetime(
                  year = int(x['year']),
                  month = int(x['month']),
                  day = int(x['day']),
                  hour = int(x['hour']),
                  minute = int(x['minute']),
                  second = int(x['second']),
                  microsecond = int(x['microsecond'])), #.strftime("%H:%M:%S.%f")[:-3], 
                  axis=1)
ss.head(10)

##      code  trade_date  ...  kw_time                       time
## 0  005380 100031.0000  ...   100031 2021-01-18 10:00:31.307776
## 1  005380 100033.0000  ...   100033 2021-01-18 10:00:34.105344
## 2  005380 100031.0000  ...   100031 2021-01-18 10:00:31.000576
## 3  005380 100032.0000  ...   100032 2021-01-18 10:00:32.143360
## 4  005380 100032.0000  ...   100032 2021-01-18 10:00:32.204800
## 5  005380 100033.0000  ...   100033 2021-01-18 10:00:33.564672
## 6  005380 100115.0000  ...   100115 2021-01-18 10:01:14.587648
## 7  005380 100115.0000  ...   100115 2021-01-18 10:01:15.775488
## 8  005380 100116.0000  ...   100116 2021-01-18 10:01:17.385216
## 9  005380 100115.0000  ...   100115 2021-01-18 10:01:14.595840
## 
## [10 rows x 20 columns]

2. Calculate latency and clean it [레이턴시 계산 및 정리]

Calculate the latency by subtracting the time provided by Kiwoom dataset and self-recorded time at the moment of processing the data delivered through the Kiwoom API.

ss['kwtime'] = ss.apply(lambda x: datetime(year = int(x['year']),
                                           month = int(x['month']),
                                           day = int(x['day']),
                                           hour = int(x['kw_time'][0:2]),
                                           minute = int(x['kw_time'][2:4]),
                                           second = int(x['kw_time'][4:6])),
                                           axis = 1
                                           )
ss['latency'] = ss.apply(lambda x: x['time'].second - x['kwtime'].second,
                          axis =1
                          )
ss1 = ss[['time', 'kwtime', 'hour', 'minute', 'second', 'price','size','ask1','bid1','latency']]
ss1.head(10)

##                         time              kwtime  ...        bid1 latency
## 0 2021-01-18 10:00:31.307776 2021-01-18 10:00:31  ... 244500.0000       0
## 1 2021-01-18 10:00:34.105344 2021-01-18 10:00:33  ... 244500.0000       1
## 2 2021-01-18 10:00:31.000576 2021-01-18 10:00:31  ... 244500.0000       0
## 3 2021-01-18 10:00:32.143360 2021-01-18 10:00:32  ... 244500.0000       0
## 4 2021-01-18 10:00:32.204800 2021-01-18 10:00:32  ... 244500.0000       0
## 5 2021-01-18 10:00:33.564672 2021-01-18 10:00:33  ... 244500.0000       0
## 6 2021-01-18 10:01:14.587648 2021-01-18 10:01:15  ... 244500.0000      -1
## 7 2021-01-18 10:01:15.775488 2021-01-18 10:01:15  ... 244500.0000       0
## 8 2021-01-18 10:01:17.385216 2021-01-18 10:01:16  ... 244500.0000       1
## 9 2021-01-18 10:01:14.595840 2021-01-18 10:01:15  ... 244500.0000      -1
## 
## [10 rows x 10 columns]

plt.scatter(ss1.reset_index().index, ss1.latency)
plt.title('Latency Over Time Before Cleaning')

print(ss1.latency.value_counts())

## -1     36803
##  0     12435
##  59      703
##  1        18
##  2         4
## Name: latency, dtype: int64

print(ss1.latency.nsmallest(10))

## 6    -1
## 9    -1
## 10   -1
## 11   -1
## 12   -1
## 39   -1
## 40   -1
## 41   -1
## 46   -1
## 51   -1
## Name: latency, dtype: int64

print(ss1.latency.nlargest(10))

## 146     59
## 156     59
## 157     59
## 542     59
## 934     59
## 935     59
## 1114    59
## 1287    59
## 1585    59
## 1586    59
## Name: latency, dtype: int64

Check the time where large discrepancy happened!

When latency is 59, it actually is 1 second difference. We can check it as in below.

ss1[ss1.latency == 59][['kwtime','hour','minute','second','latency']]

##                    kwtime hour minute second  latency
## 146   2021-01-18 10:02:00   10     01     59       59
## 156   2021-01-18 10:02:00   10     01     59       59
## 157   2021-01-18 10:02:00   10     01     59       59
## 542   2021-01-18 10:07:00   10     06     59       59
## 934   2021-01-18 10:09:00   10     08     59       59
## ...                   ...  ...    ...    ...      ...
## 49543 2021-01-18 15:18:00   15     17     59       59
## 49544 2021-01-18 15:18:00   15     17     59       59
## 49736 2021-01-18 15:19:00   15     18     59       59
## 49739 2021-01-18 15:19:00   15     18     59       59
## 49741 2021-01-18 15:19:00   15     18     59       59
## 
## [703 rows x 5 columns]

ss1[ss1.latency == 2][['kwtime','hour','minute','second','latency']]

##                 kwtime hour minute second  latency
## 19 2021-01-18 10:01:17   10     01     19        2
## 21 2021-01-18 10:01:18   10     01     20        2
## 22 2021-01-18 10:01:17   10     01     19        2
## 24 2021-01-18 10:01:18   10     01     20        2

59초는 -1초로 변경

When latency is 59, it actually is -1 as we have seen above.

ss2 = ss1.copy()
ss2.loc[ss1['latency']==59, 'latency'] = -1

plt.scatter(ss2.reset_index().index, ss2.latency)
plt.title('Latency Over Time After Cleaning')

Source

데이터 provided by 00 Team

키움 API Tick Data - Task1 - Day3

Roh

2021/01/18

Package

Import Data

Contents

Samsung Stock

1. Use time object in python

2. Calculate latency and clean it [레이턴시 계산 및 정리]

When latency is 59, it actually is 1 second difference. We can check it as in below.

59초는 -1초로 변경

Hyundai Car Stock

1. Use time object in python

2. Calculate latency and clean it [레이턴시 계산 및 정리]

When latency is 59, it actually is 1 second difference. We can check it as in below.

59초는 -1초로 변경

Source