Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec
as a synonyms filter for my queries.
Giving the following document structure:
{
"_index": "conversations",
"_type": "conversation",
"_id": "103130",
"_score": 0.97602403,
"_source": {
"context": "Wele to our service, how can I help? do you offer a free trial",
"answer": "Yes we do. Here is a link for our trial account."
}
}
I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec
.
My question is: How can this be done using ES Node JS client?
Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec
as a synonyms filter for my queries.
Giving the following document structure:
{
"_index": "conversations",
"_type": "conversation",
"_id": "103130",
"_score": 0.97602403,
"_source": {
"context": "Wele to our service, how can I help? do you offer a free trial",
"answer": "Yes we do. Here is a link for our trial account."
}
}
I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec
.
My question is: How can this be done using ES Node JS client?
Tf-Idf of documents is typically used to find the similarity of documents (using Cosine similarity, euclidean distance etc)
Tf or term frequency indicates frequency of a word in the document. Higher the frequency of the word, higher the importance of the word.
Idf or inverse document frequency indicates the number of documents(of input collection) that contains the word. More rare the word, higher the importance of the word.
If we just use TF to build document vector, we are prone to spam because mon words(for eg: pronouns, conjunctions etc) gain more importance. Hence, bination of td-idf gives better meaning and indicates the real significance of the word. Or in other words to rank words of a document based on the significance, it is not advised to calculate just the tf of each word, instead use tf-idf on the entire input collection and rank based on the tf-idf value which shows the real significance of keywords.
Have a look at sample python solution for calculating tf-idf value for json tweets list and finding the similar tweets.
Github Sample
Elastic Search provides a very specific data aggregation which allow you to extract "Significant Keywords" for a subset of your Index [1]
To elaborate what is significant you need a foreground (the subset of docs you want to analyse) and a background (the entire corpus) .
As you may realize, to identify a term as significant you need to pare how is appearing in your corpus in parison to something else ( for example a generic corpus). You may find some archive that contains a sort of general IDF score for terms ( Reuter corpus, brown corpus, wikipedia ect ect). Then you can : Foreground document set -> your corpus Background document set -> generic corpus
[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html