Hi Guys ! I can understand the choice difficulties with word Embedding . Specially when your data is quite specific to domain . I have some interesting finding on Word Embedding Technique with Domain Data . I am really excited to share my Word Embedding experience with different data .So without any delay Lets start –
Word Embedding Technique with Domain Data –
Lets talk about the general understanding on existing word embedding Techniques . We all believe that Predictive Word Embedding Techniques like Word2Vec , Fast Text , GloVe are far better than Frequency Embedding Techniques like TF-IDF or Count Vectorizer etc . Actually it is true but not in all cases .Lets understand with some example .
Case 1 : Domain Data and Data Volume is low –
In this case you should apply the frequency based embedding technique ( TF -IDF and Count Vectorizer ) . Because if you see in the down-line implementation of Word2Vec and FastText etc , You will get the concept of down sampling in which few of the word in every sentence are down sample ( not consider ) based on some threshold value and term frequency . This may loose some important domain word while training . I have seen TF-IDF performs well in small domain data than others.
Case 2 : Domain Data and Data Volume is High –
If we have domain data of finance , There will be completely different key terms in you data . So using pretained model may will miss your key terms right . Because they are train on general data set like IMDB , new etc . In this case you should train your own embedding on Word2Vec and FastText Technique using you data . In short algorithm will be Predictive on you in house data .
Case 3 : General and Large Data –
All you need to apply pretained predictive embedding model ( GloVe , FastText etc ) . It is gonna give you awesome control over the data .It is really help in chatbot or general conversation implementation .
Choosing the best Word Embedding Technique with Domain Data is very crucial while model development . As we have try to solve this problem in three different scenario . It will hep you in choosing the best way for you . Obviously I am not denying or confirming that my finding and suggestion will work in all type of data . Actually Data is king is Data Science . Algorithms plays only 30 percent role while 70 percent is all about data . Also data pre processing is one of the game changer which helps to extract the meaning full information . Hence I mean to say that what ever the finding you will get inside this post will help you in most of the cases but there could be some exceptions as well .
I hope you have find this article useful and interesting . In order to get such post of Data Science , NLP , Text Analytics , Please subscribe us .
Data Science Learner Team