In Python (using pandas and numpy) I am trying to clean CSV data so it adheres to a strict coding system instead of free response. More specifically, how would I code a simple rule based system to handle the various spellings and word choices that represent the following statuses:
#Here is how I converted a string pandas dataframe column to strictly adhere to the categories I have
import pandas as pd #Import pandas library
import numpy as np #Import numpylibrary
#Sample string dataframe
df_sample = pd.DataFrame(pd.Series(['Never married', 'Separated', 'Widowed','Divorced','Married','kjskd','Married','Never married','uguyd']))
print(df_sample)
print(type(df_sample[0]))
#Converting the string dataframe column to a categorical while structurally ordering it using pd.Categorical method
ordered_marstat = ['Never married', 'Divorced', 'Married', 'Widowed', 'Separated']
df = pd.DataFrame(pd.Categorical(df_sample[0], ordered=True,categories=ordered_marstat))
print(df)
This was from my original project where I had to store the string data frame into a new dataframe but you can also replace it with existing column using below syntax
df_sample[0] = pd.DataFrame(pd.Categorical(df_sample[0], ordered=True,categories=ordered_marstat))
Get Answers For Free
Most questions answered within 1 hours.