Python 데이터 분석

탐색적 데이터 분석

- 범주형(Categorical) : 10대, 20대, 남, 여

- 연속형(수치형) :

범주형

빈도분석

<code>

tips['sex'].value_counts() # Male:157,Female:87,Name:sex,dtype:Int64

</code>

교차분석

pd.crosstab(tips['sex'],tips['day'],margins=True) # margins :

'''

day Thur Fri Sat Sun All

sex

Male 30 10 59 58 157

Female 32 9 28 18 87

All 62 19 87 76 244

'''

전체 빈도 비율

pd.crosstab(tips['sex'],tips['day']).apply(lambda r: r/len(tips), axis=1) # axis=1 : 열 방향

'''

day Thur Fri Sat Sun

sex

Male 0.122951 0.040984 0.241803 0.237705 # 30 / 244, 10 / 244, 59 / 244, 58 / 244

Female 0.131148 0.036885 0.114754 0.073770 # 32 / 244, 9 / 244, 28 / 244, 18 / 244

'''

apply()

import pandas as pd

import numpy as np

df = pd.DataFrame(np.arange(12).reshape(3,4))

'''

0 1 2 3

0 0 1 2 3

1 4 5 6 7

2 8 9 10 11

'''

df.apply(lambda r: print(r), axis=0) # r은 각 열에 있는 행

'''

0, 4, 8

1, 5, 9

2, 6, 10

3, 7, 11

'''

df.apply(lambda r: print(r), axis=1) # r은 각 행에 있는 열

'''

0, 1, 2, 3

4, 5, 6, 7

8, 9, 10, 11

'''

df.apply(lambda r: r/len(df), axis=0)

'''

0 1 2 3

0 0.000000 0.333333 0.666667 1.000000

1 1.333333 1.666667 2.000000 2.333333

2 2.666667 3.000000 3.333333 3.666667

'''

# df로부터 3번째 컬럼을 추출한다

# 짝수 0, 홀수 1

# 0, 1으로 구성된 시리즈를 대상을 빈도분석을 해보세요

# df.apply() : 익명함수로 전달되는 데이터는 시리즈인데, 행일수도 열일수도 있다

# series.apply() : 익명함수로 전달되는 데이터는 스칼라값 한개이다

df.apply(lambda r : r%2, axis=1)

'''

0 1 2 3

0 0 1 0 1

1 0 1 0 1

2 0 1 0 1

'''

df[3].apply(lambda r : r%2).value_counts()

'''

1 3

Name: 3, dtype: int64

'''

벡터연산

arr = np.arange(5) # array([0, 1, 2, 3, 4])

type(arr) # numpy.ndarray

arr2 = arr/2 # Vector / Scala

arr2 # array([0. , 0.5, 1. , 1.5, 2. ])

정규분포(Normal Distribution)

arr = np.random.normal(5,2,1000).astype(int) # normal : 정규분포, 5 : 평균, 2 : 표준편차, 1000 : 값의 갯수

pd.DataFrame(arr).value_counts()

'''

5 199

4 193

6 138

3 135

7 108

2 90

1 52

8 42

9 18

0 16

10 7

-1 2

dtype: int64

'''

import matplotlib.pyplot as plt

plt.hist(arr)

plt.show()

CSV 파일 교차분석

# employees.csv를 로드하여 emp_df을 생성한다

# gender, job_title 간의 교차분석

import pandas as pd

emp_df = pd.read_csv('employees.csv',header=None,index_col=0,names=['id','gender','first_name','phone','job_title'])

display(emp_df)

'''

gender first_name phone job_title

1 male Jose 1-971-533-4552x1542

Machine Learning Engineer

2 male Douglas 881.633.0107

DevOps Engineer

3 female Sherry 001-966-861-0065x493 Project Manager

4 male Charles 001-574-564-4648 Project Manager

5 female Sharon 5838355842 HR Manager

... ... ... ... ...

316 female Michelle 401.041.6802 Web Developer

317 male Steven 1-584-489-5663x896 Web Developer

318 male Kevin 001-511-226-4416x83410 Web Developer

319 male Adam 1-426-702-6363x565 Web Developer

320 male Danny (871)098-2647x80448 Web Developer

'''

display(pd.crosstab(emp_df['gender'],emp_df['job_title']))

'''

job_title Designer DevOps Engineer HR Manager Machine Learning Engineer Mobile Developer Project Manager Tester Web Developer

gender

female 7 16 7 14 40 3 8 51

male 8 20 3 10 50 2 12 69

'''

display(pd.crosstab(emp_df['gender'],emp_df['job_title']).apply(lambda r : r/len(emp_df), axis=1))

'''

job_title Designer DevOps Engineer HR Manager Machine Learning Engineer Mobile Developer Project Manager Tester Web Developer

gender

female 0.021875 0.05 0.021875 0.04375 0.125 0.009375 0.025 0.159375

male 0.025 0.0625 0.009375 0.03125 0.15625 0.00625 0.0375 0.215625

'''

균등분포(Uniform Distribution)

카이제곱(Chi-Square)

- crosstab() 리턴한 분할표를 사용함

- 카이제곱 검정(Chi-Square Test)

- 두 범주형 변수가 서로 독립적인지 검사

- p-value : 0.5보다 작으면 귀무가설을 기각함, 대립가설에 의미가 있다

- 귀무가설 : 두 범주형 변수는 서로 독립적이다

- 대립가설 : 두 범주형 변수는 독립적이지 않다

contingency = pd.crosstab(emp_df['gender'],emp_df['job_title'])

# 분할표

display(contingency)

'''

job_title Designer DevOps Engineer HR Manager Machine Learning Engineer Mobile Developer Project Manager Tester Web Developer

gender

female 7 16 7 14 40 3 8 51

male 8 20 3 10 50 2 12 69

'''

from scipy.stats import chi2_contingency

c, p, dof, expected = chi2_contingency(contingency)

display(p) # 0.6381856628

p-value가 0.5보다 작지 않기 때문에 귀무가설이 의미가 있다

연속형

평균

np.arange(10).mean() # 4.5

중위수

np.random.seed(5) # 같은 무작위 숫자가 나오게 설정

ser = pd.Series(np.random.randint(1, 20, 10)).sort_values()

list(ser) # [4, 5, 7, 8, 9, 10, 15, 16, 17, 17]

ser.median() # 9.5, 정렬된 숫자의 가운데 숫자, 짝수면 2개의 평균

분산(Variance) : ((평균 - 각 원소)^2)의 총합 / n

# 분산 구하기(표본 통계)

ser.var() # 25.28888888888889

# 직접 분산 구하기(모수 통계)

((ser.mean() - ser)**2).sum() / len(ser) # 22.76

# 직접 분산 구하기(표본 통계)

# 자유도 : n-1

((ser.mean() - ser)**2).sum() / (len(ser) - 1) # 25.28888888888889

표준편차 : √분산

ser.std() # 5.028805910838963

# 분산 값을 이용해서 표준편차 구하기

np.sqrt(ser.var()) # 5.028805910838963

# 저수준으로 표준편차 구하기

variance = ((ser.mean() - ser)**2).sum() / (len(ser) - 1) # 25.28888888888889

standardDeviation = variance**(1/2) # 5.028805910838963

기술통계

# 기술통계

ser.describe()

'''

count 10.000000 # 갯수

mean 10.800000 # 평균

std 5.028806 # 표준편차

min 4.000000 #

25% 7.250000 # 1사분위수 : 25% 위치에 있는 수

50% 9.500000 # 2사분위수 : 50% 위치에 있는 중위수

75% 15.750000 # 3사분위 수 : 75% 위치에 있는 수

max 17.000000 #

dtype: float64

'''

ser.describe().index # Index(['count', 'mean', 'std', 'min', '25%', '50%', '75%', 'max'], dtype='object')

ser.describe()['25%'] # 7.25

사분위수

#중위수

ser.quantile() # 9.5

ser.quantile(.75) # 3사분위수

ser.quantile([0.25,0.5,0.75]) # 1사분위수,2사분위수,3사분위수

'''

0.25 7.25 # 25%, 1사분위수 Q1

0.50 9.50 # 50%, 2사분위수, Q2, 중위수

0.75 15.75 # 75%, 3사분위수, Q3

dtype: float64

'''

25% Q1

50% Q2

75% Q3

사분위수 범위(IQR) : Q3 - Q1 거리

정상치 : Q1 - IQR * 1.5 ~ Q3 + IQR * 1.5

이상치 : 정상치를 벗어난 범위

사분위수 구하기

p1 = (len(ser)-1) / 4 # 1.25 지점에 1사분위수가 위치해있다

p2 = ((len(ser)-1) / 4)*2 # 2.5 지점에 2사분위수(중위수)가 위치해있다

p3 = ((len(ser)-1) / 4)*3 # 3.75 지점에 3사분위수가 위치해있다

# 선형 보간의 공식

# 1.25

Q1 = 4 + (9-4)*(0.25) # 5.25

# 2.5

Q2 = 9 + (15-9)*(0.5) # 12.0

# 3.75

Q3 = 15 + (16-15)*(0.75) # 15.75

IQR(Inter Quantile Range) 구하기

IQR = Q3 - Q1 # 10.5

정상치 구하기

max_limit = Q3 + IQR*1.5

min_limit = Q1 - IQR*1.5

max_limit, min_limit # (31.5, -10.5)

simplestatistics로 사분위 구하기

!pip install simplestatistics

import simplestatistics as ss

print(ss.quantile([0,4,9,15,16,17],[0.25,0.50,0.75])) # [4, 9, 16]

사분위수 구하는 여러가지 방법

i + (j-i)* fraction

ser.quantile(0.75, interpolation="linear") # 15.75

ser.quantile(0.75, interpolation="lower") # 15

ser.quantile(0.75, interpolation="higher") # 16

i or j whichever is nearest

ser.quantile(0.75, interpolation="nearest") # 16

(i+j)/2

ser.quantile(0.75, interpolation="midpoint") # 15.5

시각화 하기

#임의의 실수 100개를 가진 시리즈 생성

nums = np.random.random(100)

import matplotlib.pyplot as plt

plt.boxplot(nums)

예제

# 발견된 이상치를 현재 데이터의 평균값으로 대치해보세요

# 최대한계를 벗어난 데이터, 최소한계를 벗어난 데이터를 찾는다

# 각 이상치를 평균값으로 대치

import pandas as pd

import numpy as np

nums = np.random.random(100)

nums[0] = -2

nums[99] = 3

sr = pd.Series(nums)

describe = sr.describe()

Q1 = describe['25%']

Q2 = describe['50%']

Q3 = describe['75%']

IQR = Q3 - Q1

min_limit = Q1 - IQR*1.5

max_limit = Q3 + IQR*1.5

mean = sr.mean()

# outlier(이상치)

sr[sr<min_limit] = mean

sr[sr>max_limit] = mean

import matplotlib.pyplot as plot

plt.boxplot(sr)

plt.show()

데이터 분류

# 수치 -> 범주형

# 수치(1 ~ 100)

# 1 ~ 33 : 1, 34 ~ 66 : 2, 67 ~ 100 : 3

# 수치를 추정해야 하는 시스템 : 회귀(Regression)

# 범주를 추정해야 하는 시스템 : 분류(Classification)

예제

rg = np.arange(1,11)

ct = pd.cut(rg,3)

'''

[(0.991, 4.0], (0.991, 4.0], (0.991, 4.0], (0.991, 4.0], (4.0, 7.0], (4.0, 7.0], (4.0, 7.0], (7.0, 10.0], (7.0, 10.0], (7.0, 10.0]]

Categories (3, interval[float64, right]): [(0.991, 4.0] < (4.0, 7.0] < (7.0, 10.0]]

# (0.991, 4.0]

# 0.991를 포함하지 않고, 4.0을 포함한 범위

# (4.0, 7.0]

# 4.0를 포함하지 않고, 7.0을 포함한 범위

# (7.0, 10.0]

# 7.0를 포함하지 않고, 10.0을 포함한 범위

'''

Python Sklearn make_blobs

from sklearn.datasets import make_blobs 예제 X, y = make_blobs(n_samples=500, centers=3, n_features=2, random_state=0) # 500개의 점을 3개로 모이게 한다, 변수는 2개, 무작위 상태는 0 X.shape, y.shape # ((500, 2), (500,)) plt.scatter(X[:,0],X[:,1],c=y,s=5) plt.show() # 학습 데이터 나누기 from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=.25, random_state=0) x_train.shape, x_test.shape, y_train.shape, y_test.shape # ((375, 2), (125, 2), (375,), (125,)) # 지도 학습 하기 from sklearn.linear_model import LogisticRegression logisticReg = LogisticRegression(max_iter=5000) # 기본 반복 100 logisticReg.fit(x_train, y_train) # 추정하기 pred = logisticReg.predict(X) # 결정계수 logisticReg.score(x_test, y_test) # 0.92 # 한글 깨짐 없이 나오게 설정 from matplotlib import rcParams # 인코딩 폰트 설정 rcParams['font.family'] = 'New Gulim' rcParams['font.size'] = 10 # 산점도 plt.figure(figsize=(10,4)) plt.subplot(1,2, 1) plt.scatter(X[:,0],X[:,1],c=y) plt.title('정답') plt.su...

다루한의 코딩공부

이 블로그 검색

Python 데이터 분석

태그

이 블로그의 인기 게시물

Blogger

Python Sklearn make_blobs

Python 문법