본문 바로가기
Learn Coding/AI(인공지능)

스마트인재개발원 - 전자상거래 물품 배송예측 대회

by 북극성매니아 2021. 12. 6.
반응형

전자 상거래 물품 배송 예측(분류)

 

스마트인재개발원 5개월 교육과정중 벌써 3개월이 지나고 파이썬을 활용한 머신러닝 수업의 중후반 정도 배우는 중인데 팀대항으로 "전자 상거래 물품 배송 예측(분류)" 대회가 시작되었습니다.

팀당 6명으로 구성되어 4개팀이 대회에 참여하고 있으며 대회기간은 10일간입니다.

 

대회규칙

1. 대회기간 : 10일간(11월 30일~12월 9일) 

2. 예측데이터는 하루에 팀당 10건을 제출할수 있다.

3. 1일기준은 밤12시가 아니며 (오전9시 ~ 익일 9시)기준이다.

4. Submit Prediction을 10건 모두 소진하면  다음날 오전 9시에 10건이 다시 생긴다.

5. 예측데이터를 제출하면 전체데이터중 60%만 채점을하여 점수를 보여주고 대회가 종료되면 전체데이터를 채점한다.

 

This is a page where you can include rules that participants must accept before joining. You may wish to include rules like:

  • Don't cheat! ( 속이지말고 )
  • Apply yourself! ( 스스로 적용해보고 )
  • Have fun! ( 즐겨라! )

 

Data

Data 메뉴를 클릭하면 대회용 데이터와 데이터의 정보를 확인할수 있다.

- Train.csv(훈련 데이터) : 12 columns, 6999 Valid

- sampleSubmission.csv(예측데이터 제출용) : 2 columns, 2 Valid

- test.csv(테스트 데이터) : 11 columns, 4000 Valid

 

Data

 

Column

 

Leaderboard

메뉴중 Leaderboard를 누르면 실시간 팀별 순위를 확인할수 있으며 팀명, Score, 제출건수, 최종 제출시간이 표시된다.

Leaderboard

 

Activity

메뉴중 Team을 클릭하여 들어가서 ID를 클릭하면 본인 및 팀원들의 예측데이터 제출 기록을 확인할수 있다.

세로칸은 7개로 일주일을 의미하며 제출을 한날은 민트색으로 표시되며 제출일과 제출건수가 표시된다.

개인별 활동 내역

 

Submit Predictions

Submit Predictions를 누르고 들어와 예측데이터에 대한 모델명, 하이퍼파라미터 및 설정사항등을 코멘트에 입력하고 예측데이터를 업로드를 하면 몇초안에 점수를 확인할수 있다.

현재 하루 제출건수 10건을 모두 소진하여 다음 제출가능한 시간까지 몇시간이 남았는지 표시가된다.

sbumit predictions

 

파이썬 코드
!pip install missingno
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.neighbors import KNeighborsClassifier

# 데이터 불러오기
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

# 데이터 결측치 및 데이터타입 확인
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6999 entries, 0 to 6998
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   6999 non-null   int64  
 1   Warehouse_block      6999 non-null   object 
 2   Mode_of_Shipment     6999 non-null   object 
 3   Customer_care_calls  5423 non-null   float64
 4   Customer_rating      6999 non-null   int64  
 5   Cost_of_the_Product  6999 non-null   int64  
 6   Prior_purchases      6049 non-null   float64
 7   Product_importance   6999 non-null   object 
 8   Gender               6999 non-null   object 
 9   Discount_offered     3468 non-null   float64
 10  Weight_in_gms        6999 non-null   object 
 11  Reached.on.Time_Y.N  6999 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 656.3+ KB

# 컬럼별 결측치 개수 확인하기
train.isnull().sum(axis=0)
ID                        0
Warehouse_block           0
Mode_of_Shipment          0
Customer_care_calls    1576
Customer_rating           0
Cost_of_the_Product       0
Prior_purchases         950
Product_importance        0
Gender                    0
Discount_offered       3531
Weight_in_gms             0
Reached.on.Time_Y.N       0
dtype: int64

# 결측치 시각적으로 확인하기
msno.matrix(train)
plt.show()

missingno를 사용하면 컬럼별 결측치를 시각적으로 확인을 할수가 있다.

각 컬럼에서 흰색으로 표시된 부분이 결측치로 표시가 되어 결측데이터의 분포와 양을 쉽게 파악할수 있다.

missingno

 

# 결측치 채우기 위해 컬럼별 중앙값, 평균값 확인
print(train['Customer_care_calls'].median())
print(train['Customer_care_calls'].mean())

print(train['Prior_purchases'].median())
print(train['Prior_purchases'].mean())

print(train['Discount_offered'].median())
print(train['Discount_offered'].mean())
4.0
4.054582334501198
3.0
3.5762936022483056
7.0
13.269031141868512

# 컬럼명 변경(공백처리)
train.rename(columns={'Warehouse_block ':'Warehouse_block'}, inplace=True)
test.rename(columns={'Warehouse_block ':'Warehouse_block'}, inplace=True)

# Customer_care_calls 결측치 채우기
train['Customer_care_calls']= train['Customer_care_calls'].fillna(4)
test['Customer_care_calls']= test['Customer_care_calls'].fillna(4)

# Prior_purchases 결측치 채우기
train['Prior_purchases'] = train['Prior_purchases'].fillna(3)
test['Prior_purchases'] = test['Prior_purchases'].fillna(3)

train['Cost_of_the_Product'].unique()
array([ 135,  225,  229,  228,  195,  171,  282,  161,  274,  222,  264,
        196,  232,  194,  207,  146,  221,  134,  254,  160,  273,  142,
        224,  253,  209,  247,  239,  215,  177,  189,  139,  184,  121,
        133,  158,  286,  281,  261,  214,  180,  169,  280,  166,  172,
        212,  246,  154,  203,  296,  185,  249,  269,  213,  263,  267,
        164,  178,  268,  278,  141,  140,  241,  105,  197,  193,  231,
        237,  305,  210,  138,  226,  151,  309,  255,  152,  186,  136,
        244,  252,  111,  248, 9999,  202,  174,  272,  182,  181,  173,
        242,  294,  198,  208,  301,  270,  130,  259,  236,  250,  223,
        183,  148,  243,  262,  201,  199,  156,  145,  150,  132,  137,
        276,  256,  290,  200,  258,  170,  227,  240,  157,  165,  175,
        233,  289,  191,  277,  275,  190,  163,  266,  206,  217,  220,
        219,  218,  187,  298,  162,  295,  234,  176,  245,  238,  143,
        265,  112,  125,  128,  102,   97,  204,  211,  123,  307,  144,
        271,  149,  159,  230,  257,  167,   98,  287,  192,  216,  205,
        188,  103,  147,  104,  310,  304,  292,  179,  124,  260,  168,
        109,  107,  235,  308,  114,  153,  300,  116,  279,  285,  291,
        306,  251,  117,  115,  155,  126,  119,  101,  283,  110,  131,
        113,  118,  284,  120,   96,  297,  303,  299,  100,  293,  288,
        302,  127,   99,  129,  122,  108,  106], dtype=int64)
        
# 이상치 제거
idx_nm_1 = train[train['Cost_of_the_Product'] == 9999].index
train = train.drop(idx_nm_1, axis=0)

train['Mode_of_Shipment'].unique()
array([' Ship', ' Flight', ' Road', '?', ' Shipzk', ' Flightzk',
       ' Roadzk'], dtype=object)
       
# 데이터 공백 및 오류 수정
train['Mode_of_Shipment'].replace(' Ship', 'Ship', inplace=True)
train['Mode_of_Shipment'].replace(' Flight', 'Flight', inplace=True)
train['Mode_of_Shipment'].replace(' Road', 'Road', inplace=True)
train['Mode_of_Shipment'].replace('?', 'Ship', inplace=True)
train['Mode_of_Shipment'].replace(' Shipzk', 'Ship', inplace=True)
train['Mode_of_Shipment'].replace(' Roadzk', 'Road', inplace=True)
train['Mode_of_Shipment'].replace(' Flightzk', 'Flight', inplace=True)

test['Mode_of_Shipment'].replace(' Ship', 'Ship', inplace=True)
test['Mode_of_Shipment'].replace(' Flight', 'Flight', inplace=True)
test['Mode_of_Shipment'].replace(' Road', 'Road', inplace=True)
test['Mode_of_Shipment'].replace('?', 'Ship', inplace=True)
test['Mode_of_Shipment'].replace(' Shipzk', 'Ship', inplace=True)
test['Mode_of_Shipment'].replace(' Flightzk', 'Flight', inplace=True)
test['Mode_of_Shipment'].replace(' Roadzk', 'Road', inplace=True)

print(train['Customer_rating'].unique())
print(test['Customer_rating'].unique())
[ 2  3  1  5  4 99]
[ 1  4  5  2  3 99]

print(train['Customer_rating'].median())
print(train['Customer_rating'].mean())
3.0
3.0238707833047456

# 이상치 중앙값으로 수정
train['Customer_rating'].replace(99, 3, inplace=True)
test['Customer_rating'].replace(99, 3, inplace=True)

train['Product_importance'].value_counts()
low        3344
medium     2979
high        573
?            97
loww          1
mediumm       1
highh         1
Name: Product_importance, dtype: int64

train['Product_importance'].replace('mediumm', 'medium', inplace=True)
train['Product_importance'].replace('loww', 'low', inplace=True)
train['Product_importance'].replace('highh', 'high', inplace=True)
train['Product_importance'].replace('?', 'medium', inplace=True)
test['Product_importance'].replace('mediumm', 'medium', inplace=True)
test['Product_importance'].replace('?', 'medium', inplace=True)

train['Weight_in_gms'].value_counts()
?       446
1817      8
1367      8
4541      7
5709      7
       ... 
2205      1
3713      1
1713      1
1574      1
5542      1
Name: Weight_in_gms, Length: 3332, dtype: int64

# Weight_in_gms 데이터 '?'-> 0
train['Weight_in_gms'].replace('?', 0, inplace=True)
test['Weight_in_gms'].replace('?', 0, inplace=True)

# Weight_in_gms 데이터 object -> int64
train['Weight_in_gms'] = pd.to_numeric(train['Weight_in_gms'])
test['Weight_in_gms'] = pd.to_numeric(test['Weight_in_gms'])

train['Weight_in_gms'].mean()
train['Weight_in_gms']= train['Weight_in_gms'].replace(0, 3423)
test['Weight_in_gms']= test['Weight_in_gms'].replace(0, 3423)

# ID 컬럼 삭제
train.drop("ID",axis=1,inplace=True)
test.drop("ID",axis=1,inplace=True)
# Discount_offered 컬럼 제거
train.drop('Discount_offered', axis=1, inplace=True)
test.drop('Discount_offered', axis=1, inplace=True)
# Gender 컬럼 제거
train.drop("Gender",axis=1,inplace=True)
test.drop("Gender",axis=1,inplace=True)

train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6996 entries, 0 to 6998
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Warehouse_block      6996 non-null   object 
 1   Mode_of_Shipment     6996 non-null   object 
 2   Customer_care_calls  6996 non-null   float64
 3   Customer_rating      6996 non-null   int64  
 4   Cost_of_the_Product  6996 non-null   int64  
 5   Prior_purchases      6996 non-null   float64
 6   Product_importance   6996 non-null   object 
 7   Weight_in_gms        6996 non-null   int64  
 8   Reached.on.Time_Y.N  6996 non-null   int64  
dtypes: float64(2), int64(4), object(3)
memory usage: 546.6+ KB

# 범주형 데이터 원핫인코딩
category=['Warehouse_block','Mode_of_Shipment','Product_importance']
one_hot_train = pd.get_dummies(train[category])
one_hot_test = pd.get_dummies(test[category])

# 원학인코딩후 기존컬럼 제거
train.drop(category, axis=1, inplace=True)
test.drop(category, axis=1, inplace=True)

# 원핫인코딩 컬럼 합치기
train=pd.concat([train,one_hot_train], axis=1)
test=pd.concat([test,one_hot_test], axis=1)

train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6996 entries, 0 to 6998
Data columns (total 17 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Customer_care_calls        6996 non-null   float64
 1   Customer_rating            6996 non-null   int64  
 2   Cost_of_the_Product        6996 non-null   int64  
 3   Prior_purchases            6996 non-null   float64
 4   Weight_in_gms              6996 non-null   int64  
 5   Reached.on.Time_Y.N        6996 non-null   int64  
 6   Warehouse_block_A          6996 non-null   uint8  
 7   Warehouse_block_B          6996 non-null   uint8  
 8   Warehouse_block_C          6996 non-null   uint8  
 9   Warehouse_block_D          6996 non-null   uint8  
 10  Warehouse_block_F          6996 non-null   uint8  
 11  Mode_of_Shipment_Flight    6996 non-null   uint8  
 12  Mode_of_Shipment_Road      6996 non-null   uint8  
 13  Mode_of_Shipment_Ship      6996 non-null   uint8  
 14  Product_importance_high    6996 non-null   uint8  
 15  Product_importance_low     6996 non-null   uint8  
 16  Product_importance_medium  6996 non-null   uint8  
dtypes: float64(2), int64(4), uint8(11)
memory usage: 457.7 KB

# 결정트리
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier()

# 문제와 답 분리
X_train = train.drop('Reached.on.Time_Y.N', axis=1)
y_train = train['Reached.on.Time_Y.N']

# 모델 훈련
tree_model.fit(X_train, y_train)
DecisionTreeClassifier()

#모델 평가
tree_model.score(X_train,y_train)

#교차검증
tree_model1 = DecisionTreeClassifier(max_leaf_nodes=500,max_depth=10)
# 모델 학습
tree_model1.fit(X_train, y_train)
# 모델 평가
tree_model1.score(X_train, y_train)
0.7339908519153803

from sklearn.model_selection import cross_val_score
score = cross_val_score(tree_model1, X_train, y_train, cv=5)
np.mean(score)
0.6560895537628919

# 모델에 test데이터를 넣어 결과값 예측하기
pre = tree_model1.predict(test)

# 제출양식 불러오기
result = pd.read_csv('sampleSubmission.csv')

result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   ID                   4000 non-null   int64
 1   Reached.on.Time_Y.N  4000 non-null   int64
dtypes: int64(2)
memory usage: 62.6 KB

# 캐글 답지에 예측값 저장
result['Reached.on.Time_Y.N']=pre
result.to_csv('cggle_result.csv', index=False)

 

 

스마트인재개발원 : https://www.smhrd.or.kr/

 

스마트인재개발원

4차산업혁명시대를 선도하는 빅데이터, 인공지능, 사물인터넷 전문 '0원' 취업연계교육기관

www.smhrd.or.kr

 

반응형

댓글