cs224n-Lecture 1(Introduction and Word Vectors)

2019-10-05

machine learning / natural language processing

Lecture 1(Introduction and Word Vectors)

전통적인 NLP

단어를 discrete한 심볼로 생각 (one-hot vector)
Localist Representation
영어는 단어를 무한대로 만들 수 있기 때문에 very 큰 vector size 필요 (50만)
또한 “모텔”과 “호텔”은 비슷한 단어지만 one-hot vector는 오쏘고널
- 자연적인 연관성 없음

SVD based methods

간단 설명
- 고정된 size의 window를 이용해 특정 단어 주변에 나타나는 단어들의 수를 표현한 co-occurrence matrix를 만듦
- 이것을 SVD를 통해 차원을 줄여 word vector로 만든다..
단점
- 단어가 추가될 때 마다 matrix의 차원이 계속 바뀐다
- matrix가 극단적으로 sparse하다. (대부분의 단어가 co-occur 하지 않아서)
- 차원수가 엄청나게 높다 (10^6*10^6)
- 학습에 쿼드라틱 코스트 (O(N^2)?) 가 든다. (SVD) 등등

최근 NLP

Distributional Semantics
- The only difference between localist and distributed representation is whether individual units have “meaning and interpretation” or not.
컨텍스트에서 나타나는 것에 따라 의미 (좌우문맥)
Word vector = [0.285, -0.332, …, 0.271]
- Dense vector (차원수 많아 야 4000)
- =워드임베딩, 워드리프리젠테이션
- 비슷한 뜻의 단어는 거리가 가깝다

Word2Vec(2013)

엄청큰 말뭉치 있음
모든 단어는 벡터로 표현
모둔 단어는 포지션 t를 갖는데, 가운데인 c와 콘텍스트(바깥쪽)인 o 를 가짐

Main Concept
- 모든 Word는 각각의 Word Vector를 가진다
- 여기에선 Center Word일 때와 Context Word일 때 두 개의 Vector를 가짐 (v->center,u->context)
- 좌우 문맥(Context)를 통해 단어를 추정하는 과정에서 Word Vector 추정

Likelihood

$Likelihood = L(\theta) = \prod_{t=1}^{T}\prod_{-m\leq j\leq m} P(w_{t+j}|w_{t};\theta)$

m => window size

t => each position

Likelihood가 최대가 되는 theta를 찾는다

P(x|y) y가 일어났을 때 x의 확률

$P(x|y) 는 \ y가\ 일어났을\ 때의\ x가\ 일어날\ 확률\\ P(w_{t+j}|w_t)\text{는 center word가 } w_t\text{일 때}\\ \text{position }t+j\text{의 context word가 }w_{t+j}\text{일 확률}$

Objective Function
- theta를 최적화 하기 위한 Function
- = Loss Function
  J(θ)=−1TlogL(θ)
  - 음수인 이유는 maximize가 아니라 minimize 하기 위해
  - log는 L(theta)의 곱을 합으로 바꿔주기 위해 (-> 계산 쉬워지라고)
  - log : underflow 막기 위해
  - T로 나누는 이유는 Average를 취하기 위해

P(o|c)
P(o|c)=exp(→u⊤o→vc)∑w∈Vexp(→u⊤w→vc)
- Softmax Function
  - max => 여러 개의 x_i 중 가장 큰 확률을 증폭
  - soft => 작은 x_i에도 확률 부여
- 결론 : P(o|c)는 o와 c의 유사도의 softmax function

Optimization
- theta는 model의 모든 parameter
- 크기 d-dim 2V-length (word vector이 context, center 두개가 있으므로)
- Derivative of Objective Function
  $J(\theta) = -\frac{1}{T}\log L(\theta)$

Vector Composition
- 이렇게 만들어진 Word Vector들은 Vector Composition 가능
- ex) Queen - Woman + Man = King

Two model variants
- Skip-grams : center word로 context word 예측 -> 현재 한 방법
- Continuous Bag of words : context word로 center word 예측 : 적은 DB에서도 잘 작동한다.

Purumir's Blog

cs224n-Lecture 1(Introduction and Word Vectors)

Lecture 1(Introduction and Word Vectors)

전통적인 NLP

SVD based methods

최근 NLP

Word2Vec(2013)

About

Categories

Tags

Tag Cloud

Archives

Recents