javascript - Extracting most important words from Elasticsearch index, using Node JS client - Stack Overflow

admin2025-04-21  0

Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec as a synonyms filter for my queries.

Giving the following document structure:

{
        "_index": "conversations",
        "_type": "conversation",
        "_id": "103130",
        "_score": 0.97602403,
        "_source": {
          "context": "Wele to our service, how can I help? do you offer a free trial",
          "answer": "Yes we do. Here is a link for our trial account."
        }
      }

I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec.

My question is: How can this be done using ES Node JS client?

Inspired by the following git and video I'm trying to create a conceptual search for my domain, using word2vec as a synonyms filter for my queries.

Giving the following document structure:

{
        "_index": "conversations",
        "_type": "conversation",
        "_id": "103130",
        "_score": 0.97602403,
        "_source": {
          "context": "Wele to our service, how can I help? do you offer a free trial",
          "answer": "Yes we do. Here is a link for our trial account."
        }
      }

I would like to iterate through the entire index and extract words with "higher significant" (tf-idf ?).
Once I will have the top 100 words list, I'll create a synonyms filter using word2vec.

My question is: How can this be done using ES Node JS client?

Share Improve this question edited Jul 6, 2017 at 8:04 David Lemon 1,57010 silver badges21 bronze badges asked Nov 14, 2016 at 14:10 Shlomi SchwartzShlomi Schwartz 8,91330 gold badges120 silver badges198 bronze badges 10
  • tf-idf is not defined for a collection, it is defined for a document. You would end up with the idf part, it is very doubtful that is what you are looking for. – S van Balen Commented Dec 29, 2016 at 10:19
  • Thanks for the reply, can you suggest a better approach to extract significant words out of the index? – Shlomi Schwartz Commented Dec 29, 2016 at 10:23
  • 2 Given a query, we could calculate tf-idf for terms in the query results as pared to the entire document space. I take it from your question that you want to do so prior to receiving a query. You could try either paring your documentspace against another more general one (fi. the Internet, or Wikipedia) or you could calculate the information gain of all terms (or any other feature selection method). – S van Balen Commented Dec 29, 2016 at 10:46
  • 2 The significance of a term normaly depends on something. In feature selection what you do is you evaluate a feature (term in this case) on how well it seperates a target class. Since we are lacking a target, we would basically be evaluating on how well it devides the search space, which brings us back at document frequency. But somehow I doubt that is what you are looking for. I suggest you look into extracting terms that are significant for your search space (high frequency in search space low document frequency on the internet) or terms that have been searched often in your application. – S van Balen Commented Dec 29, 2016 at 11:03
  • 2 you could iterate all documents, for each doc retrieve the terms and calculate the tf-idf value of each term. then count how many times the heights value term appeared in your documents, then take the top n terms. – Roni Gadot Commented Dec 31, 2016 at 14:52
 |  Show 5 more ments

2 Answers 2

Reset to default 1

Tf-Idf of documents is typically used to find the similarity of documents (using Cosine similarity, euclidean distance etc)

Tf or term frequency indicates frequency of a word in the document. Higher the frequency of the word, higher the importance of the word.

Idf or inverse document frequency indicates the number of documents(of input collection) that contains the word. More rare the word, higher the importance of the word.

If we just use TF to build document vector, we are prone to spam because mon words(for eg: pronouns, conjunctions etc) gain more importance. Hence, bination of td-idf gives better meaning and indicates the real significance of the word. Or in other words to rank words of a document based on the significance, it is not advised to calculate just the tf of each word, instead use tf-idf on the entire input collection and rank based on the tf-idf value which shows the real significance of keywords.

Have a look at sample python solution for calculating tf-idf value for json tweets list and finding the similar tweets.

Github Sample

Elastic Search provides a very specific data aggregation which allow you to extract "Significant Keywords" for a subset of your Index [1]

To elaborate what is significant you need a foreground (the subset of docs you want to analyse) and a background (the entire corpus) .

As you may realize, to identify a term as significant you need to pare how is appearing in your corpus in parison to something else ( for example a generic corpus). You may find some archive that contains a sort of general IDF score for terms ( Reuter corpus, brown corpus, wikipedia ect ect). Then you can : Foreground document set -> your corpus Background document set -> generic corpus

[1] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

转载请注明原文地址:http://conceptsofalgorithm.com/Algorithm/1745241931a292061.html

最新回复(0)