Word Embedding in Python : Different Approaches

As you all know, You can never process text directly in Machine Learning . In order to achieve that You need to convert them into some vector. Word Embedding is just a technique to convert text into numeric form. There could be different techniques. This article will brief you on – Word Embedding in Python through various Approaches.

Word Embedding in Python : Different Approaches-

In broader term , There are two different approaches –

1. Frequency based Embedding

2. Prediction based Embedding

Let’s understand Frequency based Embedding and there will be different article on Prediction based Embedding .

1. Frequency based Embedding –

There are three sub approaches under the Frequency based Embedding .Lets go through them –

1.1 Count Vectors-

This is one of the simplest technique in a word embedding. Here the complete vocabulary is converted into tokens . each token becomes a column. Each document gets into rows. The value will be given on the basis of the count of the words in the corresponding document.

1.2 TF-IDF-

There were some drawbacks on the count vectors. In order to understand them, Let’s think about the articles and punctuation on a sentence. Always there will be higher frequency on those sides. Although they are not relevant. In order to address this issue, we have a new approach that is TF-IDF. Here we first calculate the term frequency on a particular document. After it, we calculate the number of the document has that term. Let’s understand the formula here –

TF-IDF SCORE = TF ( term frequency on a document ) * IDF ( Inverse Document frequency )

TF = numbers of occurrence of a word / Total number of words in documents

IDF = log [(Total number of Document )/ Numbers of Documents where that term occurred)]

Now let’s think about the word “THE”. Suppose it came in Doc1 20 times where total numbers of words in Doc1 was 200. Suppose if we have 30 Documents total and it came in almost every document. Do you really think, Will it create any difference while solving Machine Learning problems? Obviously, It should get low priority than the other word. If you go with count vectorizer, It will give that word Higher weight age. But TF-IDF can solve your problem more efficiently, Let’s see how –

TF = 20/200 =0.10

IDF = log (30/30)=0

TF-IDF = 0.10 *0 =0

Actually we have taken the example of “THE” which can be removed in preprocessing in NLP but the there some words exist in vocab which shows the same behavior.

1.3 Co-Occurrence Matrix-

These above were good. Still, These were not capable of maintaining the semantic relationship. See it is true in vocabulary that the similar words in the same context come together. The occurrence matrix is completely on the same concept. We create a sliding window and the size of the sliding window from left and right are taken together. We create a matrix where the tokens or words of the vocabulary becomes the column and row. The count is the frequency of in a corresponding sliding window. Note that smaller window size has some information loss but having more holing of relationships.

As we have discussed in the starting part of the article that Prediction based Word Embedding will be discussed in a different article.

Thanks

Data Science Learner Team