Question

Question 1. Naive Bayes and Laplace smoothing. We are given a set of keywords that presumed...

Question 1. Naive Bayes and Laplace smoothing. We are given a set of keywords that presumed to be informative about a news site article and whether it is about sports, business or another subject. These words are {NFL, inflation, Trump, pitcher, stocks, England, Florida, storm}. In the following table, we have information about whether these keywords appeared in a training set of news site articles, along with their categories. The table also includes a column about the length of each article (in number of words).

Category NFL Inflation Trump Pitcher Stocks England Florida Storm Length
Sports 1 0 0 0 0 0 1 0 220
Sports 1 0 0 0 0 1 0 1 245
Sports 0 0 1 1 0 0 0 0 191
Business 0 1 0 0 1 0 0 1 321
Business 0 0 0 0 1 0 0 0 290
Business 0 0 1 0 0 1 1 0 437
Other 0 1 0 0 0 1 0 1 93
Other 0 0 1 0 0 0 1 0 96
Other 0 0 0 1 0 1 1 0 486
Other 1 0 1 0 1 0 0 1 302

a. For each 0-1 variables associated with the words, calculate their probability of occurring in each category as in the table below.

Category NFL Inflation Trump Pitcher Stocks England Florida Storm
Sports
Business
Other

You should fill this table by writing the conditional probability of each keyword (column header) given the category (row title). For instance, in the cell with row Business and column Trump you should write Pr Trump|Business . Write your answer in fraction form (e.g. 3/7) instead of decimal form.

b. Suppose a new article comes in and contains words “NFL”, “inflation” and “Trump”, and none of the other keywords. Determine for which of the categories of articles the conditional probabilities are larger. You can skip computing the denominators and only calculate the numerators using Bayes formula, since you just want to know which category has the largest probability given it contains this set of words. (Ignore the article length for now).

c. Apply Laplace smoothing with α = 1 and β = 10 for all 0-1 variables and recalculate the probabilities again and write them in the following table. Remember to use the fraction form not the decimal notation.

Category NFL Inflation Trump Pitcher Stocks England Florida Storm
Sports
Business
Other

d. Now repeat question 1b), but this time use the Laplace smoothed probabilities.

e. Now assume that the length of the article follows the normal distribution, but with different mean and standard deviation for each category. For each category calculate sample mean and sample standard deviation (divide by n not n − 1).

f. Suppose the article with words “NFL”, “inflation” and “Trump” has 150 words. Repeat part 1d) but also include the density function of the normal distribution for the length of each article. Does this change the verdict of Naive Bayes as to which one category is more likely?

Homework Answers

Answer #1

a)

Category NFL Inflation Trump Pitcher Stocks England Florida Storm
Sports 2/3 0 1/3 1/3 0 1/3 1/3 1/3
Business 0 1/3 1/3 0 2/3 1/3 1/3 1/3
Other 1/4 1/4 2/4 1/4 1/4 2/4 2/4 2/4

b) key words are: NFL, inflation, Trump

In sports category there 3 articles containing either of the word; in business category there are 2 articles; in other there are 4 articles containing the 3 words.

Bayes Formula is given by

A1: [article belongs to sports category]

A2: [article belongs to business category]

A3: [article belongs to other category]

B: [an article contains all 3 words]

P(B|A) = 0 for each category. Bayes formula will not be applied here

Know the answer?
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for?
Ask your own homework help question
Similar Questions