스마트인재개발원 5개월 교육과정중 벌써 3개월이 지나고 파이썬을 활용한 머신러닝 수업의 중후반 정도 배우는 중인데 팀대항으로 "전자 상거래 물품 배송 예측(분류)" 대회가 시작되었습니다.
팀당 6명으로 구성되어 4개팀이 대회에 참여하고 있으며 대회기간은 10일간입니다.
대회규칙
1. 대회기간 : 10일간(11월 30일~12월 9일)
2. 예측데이터는 하루에 팀당 10건을 제출할수 있다.
3. 1일기준은 밤12시가 아니며 (오전9시 ~ 익일 9시)기준이다.
4. Submit Prediction을 10건 모두 소진하면 다음날 오전 9시에 10건이 다시 생긴다.
5. 예측데이터를 제출하면 전체데이터중 60%만 채점을하여 점수를 보여주고 대회가 종료되면 전체데이터를 채점한다.
This is a page where you can include rules that participants must accept before joining. You may wish to include rules like:
- Don't cheat! ( 속이지말고 )
- Apply yourself! ( 스스로 적용해보고 )
- Have fun! ( 즐겨라! )
Data
Data 메뉴를 클릭하면 대회용 데이터와 데이터의 정보를 확인할수 있다.
- Train.csv(훈련 데이터) : 12 columns, 6999 Valid
- sampleSubmission.csv(예측데이터 제출용) : 2 columns, 2 Valid
- test.csv(테스트 데이터) : 11 columns, 4000 Valid
Leaderboard
메뉴중 Leaderboard를 누르면 실시간 팀별 순위를 확인할수 있으며 팀명, Score, 제출건수, 최종 제출시간이 표시된다.
Activity
메뉴중 Team을 클릭하여 들어가서 ID를 클릭하면 본인 및 팀원들의 예측데이터 제출 기록을 확인할수 있다.
세로칸은 7개로 일주일을 의미하며 제출을 한날은 민트색으로 표시되며 제출일과 제출건수가 표시된다.
Submit Predictions
Submit Predictions를 누르고 들어와 예측데이터에 대한 모델명, 하이퍼파라미터 및 설정사항등을 코멘트에 입력하고 예측데이터를 업로드를 하면 몇초안에 점수를 확인할수 있다.
현재 하루 제출건수 10건을 모두 소진하여 다음 제출가능한 시간까지 몇시간이 남았는지 표시가된다.
파이썬 코드
!pip install missingno
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier
# 데이터 불러오기
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
# 데이터 결측치 및 데이터타입 확인
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6999 entries, 0 to 6998
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 6999 non-null int64
1 Warehouse_block 6999 non-null object
2 Mode_of_Shipment 6999 non-null object
3 Customer_care_calls 5423 non-null float64
4 Customer_rating 6999 non-null int64
5 Cost_of_the_Product 6999 non-null int64
6 Prior_purchases 6049 non-null float64
7 Product_importance 6999 non-null object
8 Gender 6999 non-null object
9 Discount_offered 3468 non-null float64
10 Weight_in_gms 6999 non-null object
11 Reached.on.Time_Y.N 6999 non-null int64
dtypes: float64(3), int64(4), object(5)
memory usage: 656.3+ KB
# 컬럼별 결측치 개수 확인하기
train.isnull().sum(axis=0)
ID 0
Warehouse_block 0
Mode_of_Shipment 0
Customer_care_calls 1576
Customer_rating 0
Cost_of_the_Product 0
Prior_purchases 950
Product_importance 0
Gender 0
Discount_offered 3531
Weight_in_gms 0
Reached.on.Time_Y.N 0
dtype: int64
# 결측치 시각적으로 확인하기
msno.matrix(train)
plt.show()
missingno를 사용하면 컬럼별 결측치를 시각적으로 확인을 할수가 있다.
각 컬럼에서 흰색으로 표시된 부분이 결측치로 표시가 되어 결측데이터의 분포와 양을 쉽게 파악할수 있다.
# 결측치 채우기 위해 컬럼별 중앙값, 평균값 확인
print(train['Customer_care_calls'].median())
print(train['Customer_care_calls'].mean())
print(train['Prior_purchases'].median())
print(train['Prior_purchases'].mean())
print(train['Discount_offered'].median())
print(train['Discount_offered'].mean())
4.0
4.054582334501198
3.0
3.5762936022483056
7.0
13.269031141868512
# 컬럼명 변경(공백처리)
train.rename(columns={'Warehouse_block ':'Warehouse_block'}, inplace=True)
test.rename(columns={'Warehouse_block ':'Warehouse_block'}, inplace=True)
# Customer_care_calls 결측치 채우기
train['Customer_care_calls']= train['Customer_care_calls'].fillna(4)
test['Customer_care_calls']= test['Customer_care_calls'].fillna(4)
# Prior_purchases 결측치 채우기
train['Prior_purchases'] = train['Prior_purchases'].fillna(3)
test['Prior_purchases'] = test['Prior_purchases'].fillna(3)
train['Cost_of_the_Product'].unique()
array([ 135, 225, 229, 228, 195, 171, 282, 161, 274, 222, 264,
196, 232, 194, 207, 146, 221, 134, 254, 160, 273, 142,
224, 253, 209, 247, 239, 215, 177, 189, 139, 184, 121,
133, 158, 286, 281, 261, 214, 180, 169, 280, 166, 172,
212, 246, 154, 203, 296, 185, 249, 269, 213, 263, 267,
164, 178, 268, 278, 141, 140, 241, 105, 197, 193, 231,
237, 305, 210, 138, 226, 151, 309, 255, 152, 186, 136,
244, 252, 111, 248, 9999, 202, 174, 272, 182, 181, 173,
242, 294, 198, 208, 301, 270, 130, 259, 236, 250, 223,
183, 148, 243, 262, 201, 199, 156, 145, 150, 132, 137,
276, 256, 290, 200, 258, 170, 227, 240, 157, 165, 175,
233, 289, 191, 277, 275, 190, 163, 266, 206, 217, 220,
219, 218, 187, 298, 162, 295, 234, 176, 245, 238, 143,
265, 112, 125, 128, 102, 97, 204, 211, 123, 307, 144,
271, 149, 159, 230, 257, 167, 98, 287, 192, 216, 205,
188, 103, 147, 104, 310, 304, 292, 179, 124, 260, 168,
109, 107, 235, 308, 114, 153, 300, 116, 279, 285, 291,
306, 251, 117, 115, 155, 126, 119, 101, 283, 110, 131,
113, 118, 284, 120, 96, 297, 303, 299, 100, 293, 288,
302, 127, 99, 129, 122, 108, 106], dtype=int64)
# 이상치 제거
idx_nm_1 = train[train['Cost_of_the_Product'] == 9999].index
train = train.drop(idx_nm_1, axis=0)
train['Mode_of_Shipment'].unique()
array([' Ship', ' Flight', ' Road', '?', ' Shipzk', ' Flightzk',
' Roadzk'], dtype=object)
# 데이터 공백 및 오류 수정
train['Mode_of_Shipment'].replace(' Ship', 'Ship', inplace=True)
train['Mode_of_Shipment'].replace(' Flight', 'Flight', inplace=True)
train['Mode_of_Shipment'].replace(' Road', 'Road', inplace=True)
train['Mode_of_Shipment'].replace('?', 'Ship', inplace=True)
train['Mode_of_Shipment'].replace(' Shipzk', 'Ship', inplace=True)
train['Mode_of_Shipment'].replace(' Roadzk', 'Road', inplace=True)
train['Mode_of_Shipment'].replace(' Flightzk', 'Flight', inplace=True)
test['Mode_of_Shipment'].replace(' Ship', 'Ship', inplace=True)
test['Mode_of_Shipment'].replace(' Flight', 'Flight', inplace=True)
test['Mode_of_Shipment'].replace(' Road', 'Road', inplace=True)
test['Mode_of_Shipment'].replace('?', 'Ship', inplace=True)
test['Mode_of_Shipment'].replace(' Shipzk', 'Ship', inplace=True)
test['Mode_of_Shipment'].replace(' Flightzk', 'Flight', inplace=True)
test['Mode_of_Shipment'].replace(' Roadzk', 'Road', inplace=True)
print(train['Customer_rating'].unique())
print(test['Customer_rating'].unique())
[ 2 3 1 5 4 99]
[ 1 4 5 2 3 99]
print(train['Customer_rating'].median())
print(train['Customer_rating'].mean())
3.0
3.0238707833047456
# 이상치 중앙값으로 수정
train['Customer_rating'].replace(99, 3, inplace=True)
test['Customer_rating'].replace(99, 3, inplace=True)
train['Product_importance'].value_counts()
low 3344
medium 2979
high 573
? 97
loww 1
mediumm 1
highh 1
Name: Product_importance, dtype: int64
train['Product_importance'].replace('mediumm', 'medium', inplace=True)
train['Product_importance'].replace('loww', 'low', inplace=True)
train['Product_importance'].replace('highh', 'high', inplace=True)
train['Product_importance'].replace('?', 'medium', inplace=True)
test['Product_importance'].replace('mediumm', 'medium', inplace=True)
test['Product_importance'].replace('?', 'medium', inplace=True)
train['Weight_in_gms'].value_counts()
? 446
1817 8
1367 8
4541 7
5709 7
...
2205 1
3713 1
1713 1
1574 1
5542 1
Name: Weight_in_gms, Length: 3332, dtype: int64
# Weight_in_gms 데이터 '?'-> 0
train['Weight_in_gms'].replace('?', 0, inplace=True)
test['Weight_in_gms'].replace('?', 0, inplace=True)
# Weight_in_gms 데이터 object -> int64
train['Weight_in_gms'] = pd.to_numeric(train['Weight_in_gms'])
test['Weight_in_gms'] = pd.to_numeric(test['Weight_in_gms'])
train['Weight_in_gms'].mean()
train['Weight_in_gms']= train['Weight_in_gms'].replace(0, 3423)
test['Weight_in_gms']= test['Weight_in_gms'].replace(0, 3423)
# ID 컬럼 삭제
train.drop("ID",axis=1,inplace=True)
test.drop("ID",axis=1,inplace=True)
# Discount_offered 컬럼 제거
train.drop('Discount_offered', axis=1, inplace=True)
test.drop('Discount_offered', axis=1, inplace=True)
# Gender 컬럼 제거
train.drop("Gender",axis=1,inplace=True)
test.drop("Gender",axis=1,inplace=True)
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6996 entries, 0 to 6998
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Warehouse_block 6996 non-null object
1 Mode_of_Shipment 6996 non-null object
2 Customer_care_calls 6996 non-null float64
3 Customer_rating 6996 non-null int64
4 Cost_of_the_Product 6996 non-null int64
5 Prior_purchases 6996 non-null float64
6 Product_importance 6996 non-null object
7 Weight_in_gms 6996 non-null int64
8 Reached.on.Time_Y.N 6996 non-null int64
dtypes: float64(2), int64(4), object(3)
memory usage: 546.6+ KB
# 범주형 데이터 원핫인코딩
category=['Warehouse_block','Mode_of_Shipment','Product_importance']
one_hot_train = pd.get_dummies(train[category])
one_hot_test = pd.get_dummies(test[category])
# 원학인코딩후 기존컬럼 제거
train.drop(category, axis=1, inplace=True)
test.drop(category, axis=1, inplace=True)
# 원핫인코딩 컬럼 합치기
train=pd.concat([train,one_hot_train], axis=1)
test=pd.concat([test,one_hot_test], axis=1)
train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6996 entries, 0 to 6998
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Customer_care_calls 6996 non-null float64
1 Customer_rating 6996 non-null int64
2 Cost_of_the_Product 6996 non-null int64
3 Prior_purchases 6996 non-null float64
4 Weight_in_gms 6996 non-null int64
5 Reached.on.Time_Y.N 6996 non-null int64
6 Warehouse_block_A 6996 non-null uint8
7 Warehouse_block_B 6996 non-null uint8
8 Warehouse_block_C 6996 non-null uint8
9 Warehouse_block_D 6996 non-null uint8
10 Warehouse_block_F 6996 non-null uint8
11 Mode_of_Shipment_Flight 6996 non-null uint8
12 Mode_of_Shipment_Road 6996 non-null uint8
13 Mode_of_Shipment_Ship 6996 non-null uint8
14 Product_importance_high 6996 non-null uint8
15 Product_importance_low 6996 non-null uint8
16 Product_importance_medium 6996 non-null uint8
dtypes: float64(2), int64(4), uint8(11)
memory usage: 457.7 KB
# 결정트리
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier()
# 문제와 답 분리
X_train = train.drop('Reached.on.Time_Y.N', axis=1)
y_train = train['Reached.on.Time_Y.N']
# 모델 훈련
tree_model.fit(X_train, y_train)
DecisionTreeClassifier()
#모델 평가
tree_model.score(X_train,y_train)
#교차검증
tree_model1 = DecisionTreeClassifier(max_leaf_nodes=500,max_depth=10)
# 모델 학습
tree_model1.fit(X_train, y_train)
# 모델 평가
tree_model1.score(X_train, y_train)
0.7339908519153803
from sklearn.model_selection import cross_val_score
score = cross_val_score(tree_model1, X_train, y_train, cv=5)
np.mean(score)
0.6560895537628919
# 모델에 test데이터를 넣어 결과값 예측하기
pre = tree_model1.predict(test)
# 제출양식 불러오기
result = pd.read_csv('sampleSubmission.csv')
result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 4000 non-null int64
1 Reached.on.Time_Y.N 4000 non-null int64
dtypes: int64(2)
memory usage: 62.6 KB
# 캐글 답지에 예측값 저장
result['Reached.on.Time_Y.N']=pre
result.to_csv('cggle_result.csv', index=False)
스마트인재개발원 : https://www.smhrd.or.kr/
'Learn Coding > AI(인공지능)' 카테고리의 다른 글
텍스트마이닝(Text Mining)-영화 리뷰 데이터#1(스마트인재개발원) (0) | 2021.12.08 |
---|---|
텍스트마이닝(Text Mining)-영화 리뷰 데이터(스마트인재개발원) (0) | 2021.12.07 |
텍스트 마이닝(Text Mining) 개념정리(스마트인재개발원) (0) | 2021.12.06 |
댓글