You will utilize your Python environment to derive structure from unstructured data. You will utilize the given data set.
Using this data set, you will create a text analytics Python application that extracts themes from each comment using term frequency–inverse document frequency (TF–IDF) or simple word counts. For the deliverable, provide your Python file and a .csv with your results added as a column to the original data set.
* I am unable to provide the data set itself, as I can't post a link or a .csv file, so a generic python script is what I am looking for
Response to comment: What do you mean by search?
Below is a screen shot of the python program to check indentation. Comments are given on every line explaining the code. Below is the output of the program: Below is the screenshot of "brown.csv": Below is the code to copy: #CODE STARTS HERE----------------
from nltk.corpus import brown #import brown corpus data for testing. import pandas as pd #Using sklearn library to calculate tfidf from sklearn.feature_extraction.text import TfidfVectorizer #Download dataset demo_document = brown.sents(categories=['news']) dataset = [] for demo in demo_document[:100]: dataset.append(" ".join(demo)) #Print first 15 documents of the dataset print("DEMO DATASET:") for i in range(15): print(i,dataset[i]) #fit_transform will calculate the idf-idf scores model = TfidfVectorizer(use_idf=True) tfIdf = model.fit_transform(dataset) #Print tf-idf of first 15 documents of the dataset print("\nTF-IDF VALUES:") df = pd.DataFrame(tfIdf[:15].todense(), columns=model.get_feature_names()) print (df) #Export dataframe to .csv export_df = pd.DataFrame(tfIdf.todense(), columns=model.get_feature_names()) export_df.to_csv("brown.csv") #CODE ENDS HERE------------------
Get Answers For Free
Most questions answered within 1 hours.