[머신러닝] Linear regression 선형회귀

머신러닝의 회귀와 분류

회귀 Regression : 연속적인 값(float)으로 예측하게 하도록 푸는 방법

분류 Classification : 종류를 예측하는 것

지도, 비지도, 강화

지도 학습 (Supervised learning) : 정답을 알려주면서 학습시키는 방법

비지도 학습 (Unsupervised learning): 정답을 알려주지 않고 군집화(Clustering)하는 방법

강화 학습(Reinforcement learning): 주어진 데이터없이 실행과 오류를 반복하면서 학습하는 방법 (ex. 알파고)

선형회귀 (Linear Regression)

그래프를 보고 가설을 세움

$H (x) = W x + b$ (임의의 직선, 가설)

이 임의의 직선과 점(정답)의 거리가 가까워지도록 해야함(mean squared error)

${{1\over N}\sum_{i=1}^{N}{(H(x_i) - y_i) ^ 2}}$

Cost를 손실함수라 하는데, 이 값이 최소가 되어야 잘 학습된것임.

→ 가설(Hypothesis), 손실함수(Cost or Loss function)

경사하강법 (Gradient descent method)

손실 함수를 최소화(Optimize)하는 것

랜덤으로 한 점으로부터 시작합니다.

좌우로 조금씩 그리고 한번씩 움직이면서 이전 값보다 작아지는지 (한칸씩 전진하는 단위를 Learning rate) → 적당한 Learning rate 찾는 것은 일일이 해줘야함! →지나치게 클 경우 발산할 수 있음(Overshooting)

그리고 그래프의 최소점에 도달하게 되면 학습을 종료

좋은 가설과 좋은 손실 함수를 만들어서 기계가 잘 학습할 수 있도록 만들어야함!

데이터셋 분할

Training set (학습 데이터셋, 트레이닝셋) : 전체 데이터셋의 약 80% 정도

Validation set (검증 데이터셋, 밸리데이션셋) :손실 함수, Optimizer 등을 바꾸면서 모델을 검증하는 용도

Test set (평가 데이터셋, 테스트셋)

선형회귀 실습

Tensorflow

Tensorlow를 import하고 데이터와 변수 설정

import tensorflow as tf

tf.compat.v1.disable_eager_execution()

x_data = [[1, 1], [2, 2], [3, 3]]
y_data = [[10], [20], [30]]

X = tf.compat.v1.placeholder(tf.float32, shape=[None, 2])
Y = tf.compat.v1.placeholder(tf.float32, shape=[None, 1])

W = tf.Variable(tf.random.normal(shape=(2, 1)), name='W')
b = tf.Variable(tf.random.normal(shape=(1,)), name='b')

가설과 비용함수, optimizer를 정의

hypothesis = tf.matmul(X, W) + b
cost = tf.reduce_mean(tf.square(hypothesis - Y))
optimizer = tf.compat.v1.train.GradientDescentOptimizer(learning_rate=0.01).minimize(cost)

비용함수가 줄어드는 것을 확인

with tf.compat.v1.Session() as sess:
  sess.run(tf.compat.v1.global_variables_initializer())

  for step in range(50):
    c, W_, b_, _ = sess.run([cost, W, b, optimizer], feed_dict={X: x_data, Y: y_data})
    print('Step: %2d\t loss: %.2f\t' % (step, c))

  print(sess.run(hypothesis, feed_dict={X: [[4, 4]]}))

Keras (훨씬 간단함)

import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD

x_data = np.array([[1], [2], [3]])
y_data = np.array([[10], [20], [30]])

model = Sequential([
  Dense(1)
])

model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.1))

model.fit(x_data, y_data, epochs=100)

y_pred = model.predict([[4]])
print(y_pred)

캐글 이용하기

캐글 회원가입하기-로그인하고 내 프로필 클릭 > Account 탭 클릭하기

API - Create New API Token 클릭해서 kaggle.json 파일 다운로드 받기

브라우저에서 파일을 열어 username과 key 복사하기

환경변수 지정하기

import os
os.environ['KAGGLE_USERNAME'] = '[내_캐글_username]' # username
os.environ['KAGGLE_KEY'] = '[내_캐글_key]' # key

원하는 데이터셋의 API를 복사해 와 실행하기
```
!kaggle datasets download -d ashydv/advertising-dataset
```

데이터셋 압축 풀어주기
```
!unzip /content/advertising-dataset.zip
```

분석시작!

필요한 라이브러리들 임포트하기

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split

데이터셋 불러와서 형태 확인하기

df = pd.read_csv('advertising.csv')
df.head(5)

print(df.shape)

데이터셋 살짝 살펴보기

sns.pairplot(df, x_vars=['TV', 'Newspaper', 'Radio'], y_vars=['Sales'], height=4)

데이터셋 가공하기

x_data = np.array(df[['TV']], dtype=np.float32)
y_data = np.array(df['Sales'], dtype=np.float32)

print(x_data.shape)
print(y_data.shape)

x_data = x_data.reshape((-1, 1))
y_data = y_data.reshape((-1, 1))

print(x_data.shape)
print(y_data.shape)

데이터셋을 학습 데이터와 검증 데이터로 분할하기

x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=2021)

print(x_train.shape, x_val.shape)
print(y_train.shape, y_val.shape)

학습시키기

model = Sequential([
  Dense(1)
])

model.compile(loss='mean_squared_error', optimizer=Adam(lr=0.1))

model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val), # 검증 데이터를 넣어주면 한 epoch이 끝날때마다 자동으로 검증
    epochs=100 # epochs 복수형으로 쓰기!
)

검증 데이터로 예측하기

y_pred = model.predict(x_val)

plt.scatter(x_val, y_val)
plt.scatter(x_val, y_pred, color='r')
plt.show()

여러 X값을 이용하여 매출 예측하기

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam, SGD
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split

df = pd.read_csv('advertising.csv')

x_data = np.array(df[['TV', 'Newspaper', 'Radio']], dtype=np.float32)
y_data = np.array(df['Sales'], dtype=np.float32)

x_data = x_data.reshape((-1, 3))
y_data = y_data.reshape((-1, 1))

print(x_data.shape)
print(y_data.shape)

x_train, x_val, y_train, y_val = train_test_split(x_data, y_data, test_size=0.2, random_state=2021)

print(x_train.shape, x_val.shape)
print(y_train.shape, y_val.shape)

model = Sequential([
  Dense(1)
])

model.compile(loss='mean_squared_error', optimizer=Adam(lr=0.1))

model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val), # 검증 데이터를 넣어주면 한 epoch이 끝날때마다 자동으로 검증
    epochs=100 # epochs 복수형으로 쓰기!
)

Learning rate(lr) 바꾸면서 실행해보기
Optimizer를 바꾸면서 실행 (Adam, SGD)
손실함수(loss)를 바꾸면서 실행(mean_absolute_error, mean_squared_error)

[머신러닝] 논리회귀 실습( 이진논리회귀, 다항논리회귀 ) (0)	2022.02.27
[머신러닝] 논리회귀 ( Logistic regression ) (이진논리회귀, 다항논리회귀) (0)	2022.02.27

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

blog...