[2015 Spring AI: Mining the Social Web] TF-IDF (1)

  • Class material for February 26, 2015

After TF-IDF scoring of 100 documents…

  1. All words in 100 docs are in the corpus.
  2. Separately, each word in each doc has its own TF-IDF score. That is, each doc is represented as the vector of TF-IDF scores of all words in the corpus.
  • e.g.) It was awesome! -> [0, .2345, 0, 0, …, 1.23, 3.4] (if the corpus is ordered as [“you”, “it”, “sucks” , “cold”, …, “was”, “awesome”])

Figure_9_1

What is this TF-IDF for?

We’ve learned much about TF-IDF method; how to calculate TF score and IDF score, how the conceptual assumption in this method (Bag of Words) and so on. Then, what is this for? How can we use this for what?

  1. Having a seat with your group members.
  2. Discuss how to use this score generally or for your project. (10 min)

Is TF-IDF better than just counting hits?

One of the easiest way to find relevant documents about a specific query is finding the documents which contain the query words many times. In what situation, does TF-IDF work better than this?

Advertisements
[2015 Spring AI: Mining the Social Web] TF-IDF (1)

[2015 Spring, Complex System Seminar] Creativity and Innovation in Scaling of Cities

Brief Review

Basically, the exponential relationship between the object’s size and the amount of its interactions or interaction results is addressed from biology, known as allometric scaling problem (1, 2). After some explanatory models for this allometric scaling phenomena (2, 3), social science researchers have tried to apply these biological characteristics into urbanization process and its result(4, 5, 6, 7).

The scaling results show that different urban features which are not related to each other have also exponential scalings with the size of city over various cities worldwide. In particular, the scaling exponents of urban features, which represent the levels of linearity, are shown to be different depending on the characteristics of the features. For example, quantities related to material infrastructure, such as the number of gas station, the length of electrical cables are characterized as sublinear(exponent value is smaller than 1.), while those about related to social currenice, sucah as information, innovation or wealth are characterized as superlinear(exponent value is bigger than 1.)(6).

To explain the process that this urban scaling result is emerged, Luís M. A. Bettencourt constructed a model assuming four characteristics of urban behavior: mixing population, incremental growth of infrastructure network, bounded human effort, proportional socio-economic output to local interactions(7).

Arguments

The Linearity of Residuals

resi_linearity
As shown in Fig. 1, the amount of residuals also has a relationship with the size of crimes (thus, also with the size of cities). That means, the data may have a problem of [heteroscedacity]. According to Wikipedia,

“(…) regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. Biased standard errors lead to biased inference, so results of hypothesis tests are possibly wrong.” (8)

If it is true, can we say more than “Bigger the city is, more the crimes are.”? Furthermore, what should we care when we apply existing regression methods on log-scaled values?

Overemphasis on Population

Assume that there are some set of quantities named A, B, and C. If A has a linear relationship with B, and B has a linear relationship with C, B is highly likely to have a linear relationship with C either. In this situation, the three linear relationships between A, B, and C themselves cannot explain any underlying causalities among them.

By the way, as far as I understand, some explanations in urban scaling implictly claim that the city size – generally measured by its population – is the main factor to cause the quantitative differences. For example, Luis Bettencourt mentioned that

“On average, as city size increases, per capita socio-economic quantities such as wages, GDP, number of patents produced and number of educational and research institutions all increase by approximately 15% more than the expected linear growth.”(5)

However, does it sound also natural that higher average wages of a specific city makes people to move into the city so the city gets bigger? Similarly, we can say that higher level of educational, informational environments or higher profits of companies made the city bigger by gravitating people nearby to move in. Furthermore, the multiple positive feedbacks between combinations of various urban features result in the growth of city.

Reference

1 – West, Geoffrey B., James H. Brown, and Brian J. Enquist. “A general model for the origin of allometric scaling laws in biology.” Science 276.5309 (1997): 122-126.

2 – West, Geoffrey B., James H. Brown, and Brian J. Enquist. “The fourth dimension of life: fractal geometry and allometric scaling of organisms.” science 284.5420 (1999): 1677-1679.

3 – West, Geoffrey B., James H. Brown, and Brian J. Enquist. “A general model for the structure and allometry of plant vascular systems.” Nature 400.6745 (1999): 664-667.

4 – Batty, Michael. “The size, scale, and shape of cities.” science 319.5864 (2008): 769-771.

5 – Bettencourt, Luis, and Geoffrey West. “A unified theory of urban living.” Nature 467.7318 (2010): 912-913.

6 – Bettencourt, Luís MA, et al. “Growth, innovation, scaling, and the pace of life in cities.” Proceedings of the national academy of sciences 104.17 (2007): 7301-7306.

7 – Bettencourt, Luís MA. “The origins of scaling in cities.” science 340.6139 (2013): 1438-1441.

8 – “Heteroscedasticity”, Wikipedia.org

[2015 Spring, Complex System Seminar] Creativity and Innovation in Scaling of Cities

How to be Effective in the Classroom: From Communicating Difficult Concepts to Storytelling

Part of AI classes for 2015 spring semester.

Time & Brief Explanation

  • Friday, February 20th, 10:00 am to 11:30 am
  • Professor Siegel, who gets many kudos from students for his engaging teaching methods, will lead this spring interactive workshop for AI’s. Come be inspired by his session on communication in the classroom and making a difference.
    (Professor Marty Siegel)

Contents Summary

  • Not to try covering the material, but to try “uncovering” the material

  • Playing the whole game

  • Math class should make students understand ongoing games on mathematics field.
  • Ex> Learning soccer:
  • playing from the very first day of learning
  • doesn’t exist “Soccer 101: Kick”, “Soccer 201: Pass” and so on.
  • Then, what’s the game of my field, or the field of the course?
  • Show students the big picture. Make them imagine it.
  • Communicating with students about these topics provides them A HA moments.

  • Storytelling

  • Telling as a story helps students to remember very well.
  • Personal story makes you more human, approachable, communicatable.

For I400 (the class for which currently I’m doing AI )

  • Possible questions for an applicatoin
  • Why are you taking this course? What made you to decide to take this?
  • Why is social media data important? What are the differences of social media data compared to previous other data sources? Why do you want to collect and use “social media data” rather than other sources?
  • What kinds of commands (or set of commands) do we need to do what you want? Can you explain why each part is necessary?
  • (I missed later part of the class (because of another group meeting)).
How to be Effective in the Classroom: From Communicating Difficult Concepts to Storytelling

Poisson Distribution and Regression (+ Zero Inflation)

Poisson Distribution (from Wikipedia)

  • “… is a discrete probability distribution that expresses the probability of a given number of events occuring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.”
  • Wow, it’s exactly fit to the dependent variable in my research:
    • fixed interval of time and space
    • time intervals between events have an average, but independent each other
  • If a discrete random variable X has a Poisson distribution with parameter lambda > 0,
    • lambda = E(X) = mean = variance

Poisson Regression (from Wikipedia)

  • Poisson regression assumes,
    • the response variable Y has a Poisson distribution
    • the logarithm of its expected value can be modeled by a linear combination of unknown parameters.
  • Also called log-linear model
  • Zero inflation?
    • “Another common problem with Poisson regression is excess zeros: if there are two processes at work, one determining whether there are zero events or any events, and a Poisson process determining how many events there are, there will be more zeros than a Poisson regression would predict. An example would be the distribution of cigarettes smoked in an hour by members of a group where some individuals are non-smokers.”
    • Zero-inflated model
      • “… a zero-inflated model is a statistical model based on a zero-inflated probability distribution, i.e. a distribution that allows for frequent zero-valued observations.
      • “The zero-inflated Poisson model concerns a random event containing excess zero-count data in unit time.” : fit to my dataset!

In practice

Poisson Distribution and Regression (+ Zero Inflation)

Multicollinearity (mainly from Wikipedia)

Wikipedia link

Meaning and Effect

  • Multicollinearity (also collinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.
  • Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors.

Detection

  • Large changes in the estimated regression coefficients when a predictor variable is added or deleted.
  • If one variable is shown as statistically significant in a simple linear regression model, but is not in a muliple regression model.
  • If the Condition Number is above 30, the regression is said to have significant multicollinearity. The Condition Number is appeared on a regression result summary of Python StatsModels.
  • Use correlation matrix. Correlation values (off-diagonal elements) of at least .4 are sometimes interpreted as indicating a multicollinearity problem.

Remedies

  • Dummy variables for every category with regression constant: Perfect multicollinearity
  • Drop one of the variables. However, you lose information (because you’ve dropped a variable). Omission of a relevant variable results in biased coefficient estimates for the remaining explanatory variables that are correlated with the dropped variable.
  • Standardize your independent variables. This may help reduce a false flagging of a condition index above 30. Numpy module in Python has a function for standardization of variables named “preprocessing.scale“. Of course, disadvantages of standardization should be considered either when to interpret the regression result.
Multicollinearity (mainly from Wikipedia)

Heteroscedasticity (mainly from Wikipedia)

Meaning and Effect

  • In statistics, a collection of random variables is heteroscedastic if there are sub-populations that have different variabilities from others. That is, if the variance of dependent values in a dataset is (significantly) different depending on the related independent values, we can say that the dataset is heteroscedastic.
  • Example 1: A classic example of heteroscedasticity is that of income versus expenditure on meals. As one’s income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times eat expensive meals. Those with higher incomes display a greater variability of food consumption. (higher income – independent variable – cause higher variance in expenditure on meals – dependent variable.)
  • Wikipedia link

Effect

Heteroscedasticity does not cause ordinary least squares coefficient estimates to be biased, although it can cause ordinary least squares estimates of the variance (and, thus, standard errors) of the coefficients to be biased, possibly above or below the true or population variance. Thus, regression analysis using heteroscedastic data will still provide an unbiased estimate for the relationship between the predictor variable and the outcome, but standard errors and therefore inferences obtained from data analysis are suspect. Biased standard errors lead to biased inference, so results of hypothesis tests are possibly wrong. For example, if OLS is performed on a heteroscedastic data set, yielding biased standard error estimation, a researcher might fail to reject a null hypothesis at a given significance level, when that null hypothesis was actually uncharacteristic of the actual population (making a type II error).

Heteroscedasticity (mainly from Wikipedia)