from a URL and does the following.
Write a Python code that reads a csv record 1. Identify the columns which have values as "?" in them.
2. Replace these with the mean value of the respective attribute.
2. Modify the data type of the attribute to float after replacement.
Gen AI - Data Repository Maintenance
Data repository management and administration is a crucial aspect of data engineering. It consists of all aspects that ensure the maintenance of the quality of the data, integrity of the data, and completeness of the data. In this lab, you will explore the different aspects of data repository administration and management in the context of a real-world data set.
Objectives
In this section, we will learn how to use generative AI to generate codes that will do the following:
- Handle missing entries in the data
- Remove duplicate entries from the data
- Outlier detection in the data
Data set
The data set being used is a modified subset that maps the price of laptops with different attributes. The data set is a filtered and modified version of the “Laptop Price Prediction using specifications” data set, available under the Database Contents License (DbCL) v1.0 on the Kaggle website.
The data set has been made available at the following URL = “https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m2/data/laptop_pricing_dataset_base.csv”
Handling Missing Values
The first thing to do with any data would be to handle any missing data. Missing data can be addressed in multiple ways.
- If the attribute for which the value is missing is categorical or discrete, replace it with most frequent value.
- If the attribute for which the value is missing is continuous, replace it with either the mean or the median value.
- If the attribute for which the value is missing is the target attribute, its best to drop that entire row of data.
In this case, the data points missing are in continuous variables, and missing data is represented by the character ?. You can use the Generative AI model to write a code that handles these missing values for you.
Input 1
Response 1
import pandas as pd
# Read CSV record from URL
= 'https://example.com/data.csv'
url = pd.read_csv(url)
data print(data.dtypes)
# Identify columns with "?" values
= data.columns[data.isin(['?']).any()]
cols_with_question_mark print(cols_with_question_mark)
# Replace "?" values with the mean value of the respective attribute
for col in cols_with_question_mark:
= pd.to_numeric(data[col], errors='coerce').mean()
mean_val = pd.to_numeric(data[col].replace('?', mean_val))
data[col]
# Modify the data type of the attribute to float after replacement
= data[cols_with_question_mark].astype(float)
data[cols_with_question_mark]
# Print the modified data
print(data.dtypes)
Modify Code
Update the code with the URL provided in the Data Set section.
- The result after executing the python file will be similar to image below
- Note that
Screen Size
andWeight
attributes are the ones identified to have values missing and hence, their data type earlier wasobject
. Upon modification and filling with attribute mean values, the data type has been modified tofloat64
.
Remove Duplicates
Input 2
in a python dataframe, write a code that removes all duplicate entries from the data. Print the total number of rows in the data before and after removal, to prove the working of the code. For a data
Response 2
import pandas as pd
# Assuming 'data' is the pandas DataFrame containing the data
# Print the total number of rows before removal
print("Total number of rows before removal:", len(data))
# Remove duplicate entries
=True)
data.drop_duplicates(inplace
# Print the total number of rows after removal
print("Total number of rows after removal:", len(data))
# Print the modified data
print(data)
This code snippet will remove all duplicate entries from the DataFrame ‘data’ and print the total number of rows before and after removal to demonstrate the effectiveness of the code.
You can append the relevant part of this code to your existing file on the testing interface. Make sure to comment the print statements of the previous step in order to have clear visibility of the output for this one.
The contents of your testing code should now be somewhat as shown below.
Updated Code
import pandas as pd
# Read CSV record from URL
= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m2/data/laptop_pricing_dataset_base.csv"
url = pd.read_csv(url)
data #print(data.dtypes)
# Identify columns with "?" values
= data.columns[data.isin(['?']).any()]
cols_with_question_mark #print(cols_with_question_mark)
# Replace "?" values with the mean value of the respective attribute
for col in cols_with_question_mark:
= pd.to_numeric(data[col], errors='coerce').mean()
mean_val = pd.to_numeric(data[col].replace('?', mean_val))
data[col]
# Modify the data type of the attribute to float after replacement
= data[cols_with_question_mark].astype(float)
data[cols_with_question_mark]
# Print the modified data
#print(data.dtypes)
print("Total number of rows before removal:", len(data))
# Remove duplicate entries
=True)
data.drop_duplicates(inplace
# Print the total number of rows after removal
print("Total number of rows after removal:", len(data))
Outlier Identification
You should now test the data for quality and identify if there are any anomalous outliers in target attribute ‘Price’. You can simply use the following prompt on the generative AI platform to create a code that will extract the required entries.
Input 3
'Price' has outliers that might be anomalies in comparison to the other data. Write a python code to extract entries of a dataframe where the attribute
Response 3
import pandas as pd
import numpy as np
# Assuming 'data' is the pandas DataFrame containing the data
# Calculate the mean and standard deviation of the prices
= data['Price'].mean()
mean_price = data['Price'].std()
std_price
# Define a threshold for outliers (e.g., 3 standard deviations away from the mean)
= 3
outlier_threshold
# Identify outliers using the threshold
= data[(np.abs(data['Price'] - mean_price) > outlier_threshold * std_price)]
outliers
# Extract entries with price outliers
= data[data['Price'].isin(outliers['Price'])]
entries_with_outliers
# Print the entries with price outliers
print("Entries with price outliers:")
print(entries_with_outliers)
This code snippet calculates the mean and standard deviation of the prices in the data set, identifies outliers based on a specified threshold, and then extracts entries where the ‘Price’ attribute has outliers. This will help you isolate and analyze entries with anomalous price values in comparison to the rest of the data.
You can use the relevant part of this code and add it to the existing file on the testing interface. Make sure to comment out the previous print statements to effectively see the output of this code snippet properly.
Updated Code
import pandas as pd
# Read CSV record from URL
= "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m2/data/laptop_pricing_dataset_base.csv"
url = pd.read_csv(url)
data #print(data.dtypes)
# Identify columns with "?" values
= data.columns[data.isin(['?']).any()]
cols_with_question_mark #print(cols_with_question_mark)
# Replace "?" values with the mean value of the respective attribute
for col in cols_with_question_mark:
= pd.to_numeric(data[col], errors='coerce').mean()
mean_val = pd.to_numeric(data[col].replace('?', mean_val))
data[col]
# Modify the data type of the attribute to float after replacement
= data[cols_with_question_mark].astype(float)
data[cols_with_question_mark]
# Print the modified data
#print(data.dtypes)
#print("Total number of rows before removal:", len(data))
# Remove duplicate entries
=True)
data.drop_duplicates(inplace
# Print the total number of rows after removal
#print("Total number of rows after removal:", len(data))
import numpy as np
# Assuming 'data' is the pandas DataFrame containing the data
# Calculate the mean and standard deviation of the prices
= data['Price'].mean()
mean_price = data['Price'].std()
std_price
# Define a threshold for outliers (e.g., 3 standard deviations away from the mean)
= 3
outlier_threshold
# Identify outliers using the threshold
= data[(np.abs(data['Price'] - mean_price) > outlier_threshold * std_price)]
outliers
# Extract entries with price outliers
= data[data['Price'].isin(outliers['Price'])]
entries_with_outliers
# Print the entries with price outliers
print("Entries with price outliers:")
print(entries_with_outliers)