Question 1. Naive Bayes and Laplace smoothing. We are given a set of keywords that presumed to be informative about a news site article and whether it is about sports, business or another subject. These words are {NFL, inflation, Trump, pitcher, stocks, England, Florida, storm}. In the following table, we have information about whether these keywords appeared in a training set of news site articles, along with their categories. The table also includes a column about the length of each article (in number of words).
Category | NFL | Inflation | Trump | Pitcher | Stocks | England | Florida | Storm | Length |
Sports | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 220 |
Sports | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 245 |
Sports | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 191 |
Business | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 321 |
Business | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 290 |
Business | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 437 |
Other | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 93 |
Other | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 96 |
Other | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 486 |
Other | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 302 |
a. For each 0-1 variables associated with the words, calculate their probability of occurring in each category as in the table below.
Category | NFL | Inflation | Trump | Pitcher | Stocks | England | Florida | Storm |
Sports | ||||||||
Business | ||||||||
Other |
You should fill this table by writing the conditional probability of each keyword (column header) given the category (row title). For instance, in the cell with row Business and column Trump you should write Pr Trump|Business . Write your answer in fraction form (e.g. 3/7) instead of decimal form.
b. Suppose a new article comes in and contains words “NFL”, “inflation” and “Trump”, and none of the other keywords. Determine for which of the categories of articles the conditional probabilities are larger. You can skip computing the denominators and only calculate the numerators using Bayes formula, since you just want to know which category has the largest probability given it contains this set of words. (Ignore the article length for now).
c. Apply Laplace smoothing with α = 1 and β = 10 for all 0-1 variables and recalculate the probabilities again and write them in the following table. Remember to use the fraction form not the decimal notation.
Category | NFL | Inflation | Trump | Pitcher | Stocks | England | Florida | Storm |
Sports | ||||||||
Business | ||||||||
Other |
d. Now repeat question 1b), but this time use the Laplace smoothed probabilities.
e. Now assume that the length of the article follows the normal distribution, but with different mean and standard deviation for each category. For each category calculate sample mean and sample standard deviation (divide by n not n − 1).
f. Suppose the article with words “NFL”, “inflation” and “Trump” has 150 words. Repeat part 1d) but also include the density function of the normal distribution for the length of each article. Does this change the verdict of Naive Bayes as to which one category is more likely?
a)
Category | NFL | Inflation | Trump | Pitcher | Stocks | England | Florida | Storm |
Sports | 2/3 | 0 | 1/3 | 1/3 | 0 | 1/3 | 1/3 | 1/3 |
Business | 0 | 1/3 | 1/3 | 0 | 2/3 | 1/3 | 1/3 | 1/3 |
Other | 1/4 | 1/4 | 2/4 | 1/4 | 1/4 | 2/4 | 2/4 | 2/4 |
b) key words are: NFL, inflation, Trump
In sports category there 3 articles containing either of the word; in business category there are 2 articles; in other there are 4 articles containing the 3 words.
Bayes Formula is given by
A1: [article belongs to sports category]
A2: [article belongs to business category]
A3: [article belongs to other category]
B: [an article contains all 3 words]
P(B|A) = 0 for each category. Bayes formula will not be applied here
Get Answers For Free
Most questions answered within 1 hours.