- Class material for February 26, 2015
After TF-IDF scoring of 100 documents…
- All words in 100 docs are in the corpus.
- Separately, each word in each doc has its own TF-IDF score. That is, each doc is represented as the vector of TF-IDF scores of all words in the corpus.
- e.g.) It was awesome! -> [0, .2345, 0, 0, …, 1.23, 3.4] (if the corpus is ordered as [“you”, “it”, “sucks” , “cold”, …, “was”, “awesome”])
What is this TF-IDF for?
We’ve learned much about TF-IDF method; how to calculate TF score and IDF score, how the conceptual assumption in this method (Bag of Words) and so on. Then, what is this for? How can we use this for what?
- Having a seat with your group members.
- Discuss how to use this score generally or for your project. (10 min)
Is TF-IDF better than just counting hits?
One of the easiest way to find relevant documents about a specific query is finding the documents which contain the query words many times. In what situation, does TF-IDF work better than this?