[2015 Spring AI: Mining the Social Web] TF-IDF (2) & Sentiment Analysis

1. TF-IDF

– Open the Data File, then make Words Corpus and Tweet Dictionary

In [2]:
# Mostly simliar to Example 4-9. Querying Google+ data with TF-IDF 
# in our textbook "Mining the Social Web" 4.4.2 Applying TF-IDF to Human Languages

data = "oscar_tweets.txt"
tweet_dictionary = {}
words_corpus = []
i = 0
for line in open(data):
    if len(line.strip().split())!=0:
        tweet_dictionary[i] = line.lower()
        words_corpus.append(line.lower().split())
        i += 1
print tweet_dictionary[1]
print words_corpus[1]
rt @dory: when you're washing the dishes at 7:15 but you remember you gotta be at the oscars by 7:30 http://t.co/27faqodhpm

['rt', '@dory:', 'when', "you're", 'washing', 'the', 'dishes', 'at', '7:15', 'but', 'you', 'remember', 'you', 'gotta', 'be', 'at', 'the', 'oscars', 'by', '7:30', 'http://t.co/27faqodhpm']

– Set Your Query Terms and Scoring Each Document (Tweet)

In [3]:
# Set your query with tf-idf method
QUERY_TERMS = ['lego']

# TextCollection provides tf, idf, and tf_idf abstractions so
# that we don't have to maintain/compute them ourselves
import nltk
tc = nltk.TextCollection(words_corpus)

relevant_tweets = []

for idx in range(len(words_corpus)):
    score = 0
    for term in [t.lower() for t in QUERY_TERMS]:
        score += tc.tf_idf(term, words_corpus[idx])
    if score > 0:
        relevant_tweets.append({'score':score, 'tweet':tweet_dictionary[idx]})

– Sort by Score and Display Results

In [5]:
relevant_tweets = sorted(relevant_tweets, key=lambda p: p['score'], reverse=True)
for tweet in relevant_tweets[:5]:
    print tweet['tweet']
    print '\tScore: %s' % (tweet['score'],)
    print 
how the lego oscars were built http://t.co/glbdphfyn9

    Score: 0.867215250635

http://t.co/lghymlygns - is getting a lego oscar bet

    Score: 0.758813344306

see how the awesome lego oscars were made https://t.co/lheategesj

    Score: 0.674500750494

how the lego oscars were built - gif on imgur

    Score: 0.607050675445

rt @thingswork: this is how the lego oscars were built http://t.co/kzuabkuy1u

    Score: 0.551864250404


2. Sentiment Analysis

– Scoring Positivity (or Negativity) of Tweets

In [7]:
# source: http://textblob.readthedocs.org/en/dev/quickstart.html#sentiment-analysis
# The sentiment property returns a namedtuple of the form Sentiment(polarity, subjectivity). 
# The polarity score is a float within the range [-1.0, 1.0]. 
# The subjectivity is a float within the range [0.0, 1.0] 
# where 0.0 is very objective and 1.0 is very subjective.

from textblob import TextBlob

positive_tweets = []
for idx in range(len(words_corpus)):
    positivity = TextBlob(tweet_dictionary[idx]).sentiment.polarity
    subjectivity = TextBlob(tweet_dictionary[idx]).sentiment.subjectivity
    if positivity <= -0.9:
        positive_tweets.append({'positivity':positivity, 'tweet':tweet_dictionary[idx]})

positive_tweets = sorted(positive_tweets, key=lambda p: p['positivity'], reverse=True)
for tweet in positive_tweets[:5]:
    print tweet['tweet']
    print '\tScore: %s' % (tweet['positivity'],)
    print 
zendaya defends oscars dreadlocks after 'outrageously offensive' remark via @abc7ny http://t.co/jirc40gy8p

    Score: -1.0

@mrbradgoreski travolta was the worst dressed wax figure at the oscars.

    Score: -1.0

the amount of pics of scarlett johansson &amp; john travolta at the oscars people texted me is obscene. i hate u all! (and u know me so well.)

    Score: -1.0

rt @mygeektime: just getting over an awful stomach virus...

    Score: -1.0

behati's style at the oscars was the worst ive ever seen omg

    Score: -1.0


– Scoring Subjectivity (or Objectivity) of Tweets

In [8]:
subjective_tweets = []
for idx in range(len(words_corpus)):
    subjectivity = TextBlob(tweet_dictionary[idx]).sentiment.subjectivity
    if subjectivity >= 1:
        subjective_tweets.append({'subjectivity':subjectivity, 'tweet':tweet_dictionary[idx]})

subjective_tweets = sorted(subjective_tweets, key=lambda p: p['subjectivity'], reverse=True)
for tweet in subjective_tweets[:5]:
    print tweet['tweet']
    print '\tScore: %s' % (tweet['subjectivity'],)
    print 
rt @9gag: remember the greatest oscars ever? http://t.co/qw3xdbmne9

    Score: 1.0

rt @logotv: confirmed. @actuallynph's bulge at the #oscars was indeed padded. watch: http://t.co/a8iaxitxcu

    Score: 1.0

rt @girlposts: remember the greatest oscars ever? http://t.co/ij9fm4cdhm

    Score: 1.0

rt @ryanabe: another oscars, another sad leo.

    Score: 1.0

rt @9gag: remember the greatest oscars ever? http://t.co/qw3xdbmne9

    Score: 1.0


Advertisements
[2015 Spring AI: Mining the Social Web] TF-IDF (2) & Sentiment Analysis

[2015 Spring AI: Mining the Social Web] TF-IDF (1)

  • Class material for February 26, 2015

After TF-IDF scoring of 100 documents…

  1. All words in 100 docs are in the corpus.
  2. Separately, each word in each doc has its own TF-IDF score. That is, each doc is represented as the vector of TF-IDF scores of all words in the corpus.
  • e.g.) It was awesome! -> [0, .2345, 0, 0, …, 1.23, 3.4] (if the corpus is ordered as [“you”, “it”, “sucks” , “cold”, …, “was”, “awesome”])

Figure_9_1

What is this TF-IDF for?

We’ve learned much about TF-IDF method; how to calculate TF score and IDF score, how the conceptual assumption in this method (Bag of Words) and so on. Then, what is this for? How can we use this for what?

  1. Having a seat with your group members.
  2. Discuss how to use this score generally or for your project. (10 min)

Is TF-IDF better than just counting hits?

One of the easiest way to find relevant documents about a specific query is finding the documents which contain the query words many times. In what situation, does TF-IDF work better than this?

[2015 Spring AI: Mining the Social Web] TF-IDF (1)

How to be Effective in the Classroom: From Communicating Difficult Concepts to Storytelling

Part of AI classes for 2015 spring semester.

Time & Brief Explanation

  • Friday, February 20th, 10:00 am to 11:30 am
  • Professor Siegel, who gets many kudos from students for his engaging teaching methods, will lead this spring interactive workshop for AI’s. Come be inspired by his session on communication in the classroom and making a difference.
    (Professor Marty Siegel)

Contents Summary

  • Not to try covering the material, but to try “uncovering” the material

  • Playing the whole game

  • Math class should make students understand ongoing games on mathematics field.
  • Ex> Learning soccer:
  • playing from the very first day of learning
  • doesn’t exist “Soccer 101: Kick”, “Soccer 201: Pass” and so on.
  • Then, what’s the game of my field, or the field of the course?
  • Show students the big picture. Make them imagine it.
  • Communicating with students about these topics provides them A HA moments.

  • Storytelling

  • Telling as a story helps students to remember very well.
  • Personal story makes you more human, approachable, communicatable.

For I400 (the class for which currently I’m doing AI )

  • Possible questions for an applicatoin
  • Why are you taking this course? What made you to decide to take this?
  • Why is social media data important? What are the differences of social media data compared to previous other data sources? Why do you want to collect and use “social media data” rather than other sources?
  • What kinds of commands (or set of commands) do we need to do what you want? Can you explain why each part is necessary?
  • (I missed later part of the class (because of another group meeting)).
How to be Effective in the Classroom: From Communicating Difficult Concepts to Storytelling