In this post you will find K means clustering example with word2vec in python code. Word2Vec is one of the popular methods in language modeling and feature learning techniques in natural language processing (NLP). This method is used to create word embeddings in machine learning whenever we need vector representation of data.
For example in data clustering algorithms instead of bag of words (BOW) model we can use Word2Vec. The advantage of using Word2Vec is that it can capture the distance between individual words.
The example in this post will demonstrate how to use results of Word2Vec word embeddings in clustering algorithms. For this, Word2Vec model will be feeded into several K means clustering algorithms from NLTK and Scikit-learn libraries.
Here we will do clustering at word level. Our clusters will be groups of words. In case we need to cluster at sentence or paragraph level, here is the link that showing how to move from word level to sentence/paragraph level:
Word2vec is a technique for natural language processing.The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text.Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector. How to load pre-trained word2vec and GloVe word embedding models from Google and Stanford. Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples. Let’s get started.
There is also doc2vec word embedding model that is based on word2vec. doc2vec is created for embedding sentence/paragraph/document. Here is the link how to use doc2vec word embedding in machine learning:
Text Clustering with doc2vec Word Embedding Machine Learning Model
Text Clustering with doc2vec Word Embedding Machine Learning Model
Getting Word2vec
Using word2vec from python library gensim is simple and well described in tutorials and on the web [3], [4], [5]. Here we just look at basic example. For the input we use the sequence of sentences hard-coded in the script.
Now we have model with words embedded. We can query model for similar words like below or ask to represent word as vector:
To get vocabulary or the number of words in vocabulary:
This will produce: [‘good’, ‘this’, ‘post’, ‘another’, ‘learning’, ‘last’, ‘the’, ‘and’, ‘more’, ‘new’, ‘is’, ‘one’, ‘about’, ‘machine’, ‘book’]
Word2vec Paper
Now we will feed word embeddings into clustering algorithm such as k Means which is one of the most popular unsupervised learning algorithms for finding interesting segments in the data. It can be used for separating customers into groups, combining documents into topics and for many other applications.
![Word2vec Python Word2vec Python](/uploads/1/3/7/5/137564690/544982139.jpg)
You will find below two k means clustering examples.
K Means Clustering with NLTK Library
Our first example is using k means algorithm from NLTK library.
To use word embeddings word2vec in machine learning clustering algorithms we initiate X as below:
Our first example is using k means algorithm from NLTK library.
To use word embeddings word2vec in machine learning clustering algorithms we initiate X as below:
Now we can plug our X data into clustering algorithms.
In the python code above there are several options for the distance as below:
nltk.cluster.util.cosine_distance(u, v)
Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 – (u.v / |u||v|).
Returns 1 minus the cosine of the angle between vectors v and u. This is equal to 1 – (u.v / |u||v|).
nltk.cluster.util.euclidean_distance(u, v)
Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u – v).
Returns the euclidean distance between vectors u and v. This is equivalent to the length of the vector (u – v).
Here we use cosine distance to cluster our data.
After we got cluster results we can associate each word with the cluster that it got assigned to:
After we got cluster results we can associate each word with the cluster that it got assigned to:
Here is the output for the above:
good:0
this:2
post:1
another:2
learning:2
last:1
the:2
and:2
more:0
new:1
is:0
one:1
about:2
machine:1
book:2
good:0
this:2
post:1
another:2
learning:2
last:1
the:2
and:2
more:0
new:1
is:0
one:1
about:2
machine:1
book:2
Word2vec Python Example
K Means Clustering with Scikit-learn Library
This example is based on k means from scikit-learn library.
In this example we also got some useful metrics to estimate clustering performance.
Output:
Output:
Word2vec Python Code
Here is the full python code of the script.
Word2vec Python Implementation
References
1. Word embedding
2. Comparative study of word embedding methods in topic segmentation
3. models.word2vec – Deep learning with word2vec
4. Word2vec Tutorial
5. How to Develop Word Embeddings in Python with Gensim
6. nltk.cluster package
1. Word embedding
2. Comparative study of word embedding methods in topic segmentation
3. models.word2vec – Deep learning with word2vec
4. Word2vec Tutorial
5. How to Develop Word Embeddings in Python with Gensim
6. nltk.cluster package