[2015 Spring AI: Mining the Social Web] TF-IDF (2) & Sentiment Analysis

1. TF-IDF

– Open the Data File, then make Words Corpus and Tweet Dictionary

In [2]:
# Mostly simliar to Example 4-9. Querying Google+ data with TF-IDF 
# in our textbook "Mining the Social Web" 4.4.2 Applying TF-IDF to Human Languages

data = "oscar_tweets.txt"
tweet_dictionary = {}
words_corpus = []
i = 0
for line in open(data):
    if len(line.strip().split())!=0:
        tweet_dictionary[i] = line.lower()
        words_corpus.append(line.lower().split())
        i += 1
print tweet_dictionary[1]
print words_corpus[1]
rt @dory: when you're washing the dishes at 7:15 but you remember you gotta be at the oscars by 7:30 http://t.co/27faqodhpm

['rt', '@dory:', 'when', "you're", 'washing', 'the', 'dishes', 'at', '7:15', 'but', 'you', 'remember', 'you', 'gotta', 'be', 'at', 'the', 'oscars', 'by', '7:30', 'http://t.co/27faqodhpm']

– Set Your Query Terms and Scoring Each Document (Tweet)

In [3]:
# Set your query with tf-idf method
QUERY_TERMS = ['lego']

# TextCollection provides tf, idf, and tf_idf abstractions so
# that we don't have to maintain/compute them ourselves
import nltk
tc = nltk.TextCollection(words_corpus)

relevant_tweets = []

for idx in range(len(words_corpus)):
    score = 0
    for term in [t.lower() for t in QUERY_TERMS]:
        score += tc.tf_idf(term, words_corpus[idx])
    if score > 0:
        relevant_tweets.append({'score':score, 'tweet':tweet_dictionary[idx]})

– Sort by Score and Display Results

In [5]:
relevant_tweets = sorted(relevant_tweets, key=lambda p: p['score'], reverse=True)
for tweet in relevant_tweets[:5]:
    print tweet['tweet']
    print '\tScore: %s' % (tweet['score'],)
    print 
how the lego oscars were built http://t.co/glbdphfyn9

    Score: 0.867215250635

http://t.co/lghymlygns - is getting a lego oscar bet

    Score: 0.758813344306

see how the awesome lego oscars were made https://t.co/lheategesj

    Score: 0.674500750494

how the lego oscars were built - gif on imgur

    Score: 0.607050675445

rt @thingswork: this is how the lego oscars were built http://t.co/kzuabkuy1u

    Score: 0.551864250404


2. Sentiment Analysis

– Scoring Positivity (or Negativity) of Tweets

In [7]:
# source: http://textblob.readthedocs.org/en/dev/quickstart.html#sentiment-analysis
# The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). 
# The polarity score is a float within the range [-1.0, 1.0]. 
# The subjectivity is a float within the range [0.0, 1.0] 
# where 0.0 is very objective and 1.0 is very subjective.

from textblob import TextBlob

positive_tweets = []
for idx in range(len(words_corpus)):
    positivity = TextBlob(tweet_dictionary[idx]).sentiment.polarity
    subjectivity = TextBlob(tweet_dictionary[idx]).sentiment.subjectivity
    if positivity <= -0.9:
        positive_tweets.append({'positivity':positivity, 'tweet':tweet_dictionary[idx]})

positive_tweets = sorted(positive_tweets, key=lambda p: p['positivity'], reverse=True)
for tweet in positive_tweets[:5]:
    print tweet['tweet']
    print '\tScore: %s' % (tweet['positivity'],)
    print 
zendaya defends oscars dreadlocks after 'outrageously offensive' remark via @abc7ny http://t.co/jirc40gy8p

    Score: -1.0

@mrbradgoreski travolta was the worst dressed wax figure at the oscars.

    Score: -1.0

the amount of pics of scarlett johansson &amp; john travolta at the oscars people texted me is obscene. i hate u all! (and u know me so well.)

    Score: -1.0

rt @mygeektime: just getting over an awful stomach virus...

    Score: -1.0

behati's style at the oscars was the worst ive ever seen omg

    Score: -1.0


– Scoring Subjectivity (or Objectivity) of Tweets

In [8]:
subjective_tweets = []
for idx in range(len(words_corpus)):
    subjectivity = TextBlob(tweet_dictionary[idx]).sentiment.subjectivity
    if subjectivity >= 1:
        subjective_tweets.append({'subjectivity':subjectivity, 'tweet':tweet_dictionary[idx]})

subjective_tweets = sorted(subjective_tweets, key=lambda p: p['subjectivity'], reverse=True)
for tweet in subjective_tweets[:5]:
    print tweet['tweet']
    print '\tScore: %s' % (tweet['subjectivity'],)
    print 
rt @9gag: remember the greatest oscars ever? http://t.co/qw3xdbmne9

    Score: 1.0

rt @logotv: confirmed. @actuallynph's bulge at the #oscars was indeed padded. watch: http://t.co/a8iaxitxcu

    Score: 1.0

rt @girlposts: remember the greatest oscars ever? http://t.co/ij9fm4cdhm

    Score: 1.0

rt @ryanabe: another oscars, another sad leo.

    Score: 1.0

rt @9gag: remember the greatest oscars ever? http://t.co/qw3xdbmne9

    Score: 1.0


Advertisements
[2015 Spring AI: Mining the Social Web] TF-IDF (2) & Sentiment Analysis

[2015 Spring AI: Mining the Social Web] TF-IDF (1)

  • Class material for February 26, 2015

After TF-IDF scoring of 100 documents…

  1. All words in 100 docs are in the corpus.
  2. Separately, each word in each doc has its own TF-IDF score. That is, each doc is represented as the vector of TF-IDF scores of all words in the corpus.
  • e.g.) It was awesome! -> [0, .2345, 0, 0, …, 1.23, 3.4] (if the corpus is ordered as [“you”, “it”, “sucks” , “cold”, …, “was”, “awesome”])

Figure_9_1

What is this TF-IDF for?

We’ve learned much about TF-IDF method; how to calculate TF score and IDF score, how the conceptual assumption in this method (Bag of Words) and so on. Then, what is this for? How can we use this for what?

  1. Having a seat with your group members.
  2. Discuss how to use this score generally or for your project. (10 min)

Is TF-IDF better than just counting hits?

One of the easiest way to find relevant documents about a specific query is finding the documents which contain the query words many times. In what situation, does TF-IDF work better than this?

[2015 Spring AI: Mining the Social Web] TF-IDF (1)