-m pip install pandas
python3 -m pip install numpy
python3 -m pip install scikit-learn
python3 -m pip install mlxtend
python3 -m pip install seaborn python3
Data Set
Gen AI Data Anonymization
Other than creating synthetic data, which by its base nature takes care of anonymization requirements, you can use generative AI platforms to create scripts that can modify the personally identifiable information (PII) in a given data set. In this section, we will make use of the generative AI platform to create Python scripts that can be used to apply different anonymization strategies to the PII attributes in the data.
Objectives
We will introduce anonymization in data using techniques like
- Pseudonymization
- Redaction
- Generalization
- Noise addition
Testing Environment Setup
You will need the following libraries to test the code
The data set being used in this lab has been synthetically created using faker
library in Python. Please note, that none of the entries in the data set have any bearing to any inidividual whatsoever.
The data set is available on the link shared below.
Synthetic data set
Attributes of the data set are:
Attribute | Description |
---|---|
Name | Name of the candidate |
Email address of the candidate | |
Age | Age of the candidate |
Contact Number | Contact number of the candidate |
Pseudonymization
Pseudonymization involves replacing original entries in data with appropriate pseudonyms. This assignment completely removes the original entry and protects the privacy of the user. We can use this technique to convert the names of the candidates into pseudonyms.
- You can simply use the GPT model to create a code that will do that for you.
- Assume that you want to replace the names in the data set with a pseudonym
User_i
wherei
represents thei th
entry in the data. - Consider the following prompt to generate the code that can achieve this. Enter this text into AlmostAI chatbot
Input 1
'Name' attribute of a dataset into pseudonyms like "User_i" using Python Replace the entries under
Response 1
import pandas as pd
# Read the dataset into a pandas DataFrame
= pd.read_csv('your_dataset.csv')
df
# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
'Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]
df[
# Save the modified DataFrame back to a new CSV file
'modified_dataset.csv', index=False) df.to_csv(
Test 1
import pandas as pd
# Read the dataset into a pandas DataFrame
= pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
df print(df.head())
# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
'Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]
df[
# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())
Redaction
A common method of anonymization is redacting parts of the information so as to protect the personal information in a data.
- You can test this out by redacting the email addresses of the candidates in the record such that only the first and last characters of the username and the service providers are visible.
- The rest of the characters are replaced with the character ‘*’.
Consider the following prompt to achieve this.
Input 1
'Email' in a dataframe such that only the first and last characters of the username and the service providers are visible. Rest all characters are replaced with the character '*'. Write a python code to redact the entries under the attribute
Response 1
import pandas as pd
# Create a sample dataframe
= {'Name': ['John', 'Alice', 'Bob'],
data 'Email': ['john.doe@example.com', 'alice.smith@example.com', 'bob.jones@example.com']}
= pd.DataFrame(data)
df
# Function to redact email addresses
def redact_email(email):
= email.split('@')
username, domain = username[0] + '*'*(len(username)-2) + username[-1]
redacted_username = domain[0] + '*'*(len(domain)-2) + domain[-1]
redacted_domain return redacted_username + '@' + redacted_domain
# Redact 'Email' column in the dataframe
'Email'] = df['Email'].apply(redact_email)
df[
# Display the redacted dataframe
print(df)
Test 1
You can modify this code and use it in the testing environment to confirm that it infact redacts the email addresses in the dataframe. A necessary modification would be using only the function and the function call, and ignoring the import command and data frame creation, since both these steps have already been completed in the previous task. You can append this code to your existing code and see the result of both the processes in a single go.
import pandas as pd
# Read the dataset into a pandas DataFrame
= pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
df print(df.head())
# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
'Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]
df[
# Function to redact email addresses
def redact_email(email):
= email.split('@')
username, domain = username[0] + '*'*(len(username)-2) + username[-1]
redacted_username = domain[0] + '*'*(len(domain)-2) + domain[-1]
redacted_domain return redacted_username + '@' + redacted_domain
# Redact 'Email' column in the dataframe
'Email'] = df['Email'].apply(redact_email)
df[
# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())
Generalization
Generalization involves putting specific entries, which may be possible identifiers, into generic groups, such that the personal details in the records are protected. You can apply the generalization logic to the Age
attribute of the said data set, and convert the specific age of the candidates into generic categories. For example, 28 can become 20s, 36 can become 30s, and so on.
You can create a code for this using the following prompt.
Input 1
'Age' of a data frame such that exact number is converted into a generic range. For example, 28 becomes '20s', 36 becomes '30s', etc. Write a python code to generalize the entries under the attribute
Response 1
import pandas as pd
# Create a sample dataframe
= {'Name': ['John', 'Alice', 'Bob'],
data 'Age': [28, 36, 42]}
= pd.DataFrame(data)
df
# Function to generalize age
def generalize_age(age):
= str(age)[0] + '0s'
age_range return age_range
# Generalize 'Age' column in the dataframe
'Age'] = df['Age'].apply(generalize_age)
df[
# Display the generalized dataframe
print(df)
Test 1
You can modify this code and use it in the testing environment to confirm that it in fact generalizes the age in the dataframe. A necessary modification would be using only the function and the function call, and ignoring the import command and dataframe creation, since both these steps have already been completed in the first task. You can append this code to your existing code and see the result of all the processes in a single go.
import pandas as pd
# Read the dataset into a pandas DataFrame
= pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
df print(df.head())
# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
'Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]
df[
# Function to redact email addresses
def redact_email(email):
= email.split('@')
username, domain = username[0] + '*'*(len(username)-2) + username[-1]
redacted_username = domain[0] + '*'*(len(domain)-2) + domain[-1]
redacted_domain return redacted_username + '@' + redacted_domain
# Redact 'Email' column in the dataframe
'Email'] = df['Email'].apply(redact_email)
df[
# Function to generalize age
def generalize_age(age):
= str(age)[0] + '0s'
age_range return age_range
# Generalize 'Age' column in the dataframe
'Age'] = df['Age'].apply(generalize_age)
df[
# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())
Noise addition
Another way to anonymize the data is to add random noise to it. This converts the original data into unusable garbage data and is effective in protecting the privacy of the candidate. You can apply such random addition to the attribute ‘Contact Number’ in the data set.
Assuming that all contact numbers are numerical values of 10 digits length, you need to add a random noise of length five digits to it. To create a Python code that can do this, you can use the following prompt on the GPT system.
Input 1
5 digit length to a numerical attribute 'Contact Number' in a data frame which had all values of length 10 digits. Write a python code to add random noise of
Response 1
import pandas as pd
import random
# Create a sample dataframe
= {'Name': ['John', 'Alice', 'Bob'],
data 'Contact Number': [1234567890, 9876543210, 2468101214]}
= pd.DataFrame(data)
df
# Function to add random noise
def add_random_noise(contact_number):
= str(random.randint(10000, 99999))
noise return str(contact_number)[:-5] + noise
# Add random noise to 'Contact Number' column in the dataframe
'Contact Number'] = df['Contact Number'].apply(add_random_noise)
df[
# Display the dataframe with added noise
print(df)
Test 1
You can modify this code and use it in the testing environment to confirm that it adds noise to the contact number in the dataframe. A necessary modification would be using only the function and the function call, and ignoring the pandas import command and dataframe creation, since both these steps have already been completed in the first task. The import command for random
will still be needed for the function to work. You can append this code to your existing code and see the result of all the processes in a single go.
import pandas as pd
import random
# Read the dataset into a pandas DataFrame
= pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
df print(df.head())
# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
'Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]
df[
# Function to redact email addresses
def redact_email(email):
= email.split('@')
username, domain = username[0] + '*'*(len(username)-2) + username[-1]
redacted_username = domain[0] + '*'*(len(domain)-2) + domain[-1]
redacted_domain return redacted_username + '@' + redacted_domain
# Redact 'Email' column in the dataframe
'Email'] = df['Email'].apply(redact_email)
df[
# Function to generalize age
def generalize_age(age):
= str(age)[0] + '0s'
age_range return age_range
# Generalize 'Age' column in the dataframe
'Age'] = df['Age'].apply(generalize_age)
df[
def add_random_noise(contact_number):
= str(random.randint(10000, 99999))
noise return str(contact_number)[:-5] + noise
# Add random noise to 'Contact Number' column in the dataframe
'Contact Number'] = df['Contact Number'].apply(add_random_noise)
df[
# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())