Data Set

Gen AI Data Anonymization

Other than creating synthetic data, which by its base nature takes care of anonymization requirements, you can use generative AI platforms to create scripts that can modify the personally identifiable information (PII) in a given data set. In this section, we will make use of the generative AI platform to create Python scripts that can be used to apply different anonymization strategies to the PII attributes in the data.

Objectives


We will introduce anonymization in data using techniques like

  1. Pseudonymization
  2. Redaction
  3. Generalization
  4. Noise addition

Testing Environment Setup


You will need the following libraries to test the code

python3 -m pip install pandas
python3 -m pip install numpy
python3 -m pip install scikit-learn
python3 -m pip install mlxtend
python3 -m pip install seaborn

The data set being used in this lab has been synthetically created using faker library in Python. Please note, that none of the entries in the data set have any bearing to any inidividual whatsoever.

The data set is available on the link shared below.
Synthetic data set

Attributes of the data set are:

Attribute Description
Name Name of the candidate
Email Email address of the candidate
Age Age of the candidate
Contact Number Contact number of the candidate

Pseudonymization


Pseudonymization involves replacing original entries in data with appropriate pseudonyms. This assignment completely removes the original entry and protects the privacy of the user. We can use this technique to convert the names of the candidates into pseudonyms.

  • You can simply use the GPT model to create a code that will do that for you.
  • Assume that you want to replace the names in the data set with a pseudonym User_i where i represents the i th entry in the data.
  • Consider the following prompt to generate the code that can achieve this. Enter this text into AlmostAI chatbot

Input 1

Replace the entries under 'Name' attribute of a dataset into pseudonyms like "User_i" using Python

Response 1

import pandas as pd

# Read the dataset into a pandas DataFrame
df = pd.read_csv('your_dataset.csv')

# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
df['Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]

# Save the modified DataFrame back to a new CSV file
df.to_csv('modified_dataset.csv', index=False)

Test 1

import pandas as pd

# Read the dataset into a pandas DataFrame
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
print(df.head())

# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
df['Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]

# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())

Redaction


A common method of anonymization is redacting parts of the information so as to protect the personal information in a data.

  • You can test this out by redacting the email addresses of the candidates in the record such that only the first and last characters of the username and the service providers are visible.
  • The rest of the characters are replaced with the character ‘*’.

Consider the following prompt to achieve this.

Input 1

Write a python code to redact the entries under the attribute 'Email' in a dataframe such that only the first and last characters of the username and the service providers are visible. Rest all characters are replaced with the character '*'.

Response 1

import pandas as pd

# Create a sample dataframe
data = {'Name': ['John', 'Alice', 'Bob'],
        'Email': ['john.doe@example.com', 'alice.smith@example.com', 'bob.jones@example.com']}
df = pd.DataFrame(data)

# Function to redact email addresses
def redact_email(email):
    username, domain = email.split('@')
    redacted_username = username[0] + '*'*(len(username)-2) + username[-1]
    redacted_domain = domain[0] + '*'*(len(domain)-2) + domain[-1]
    return redacted_username + '@' + redacted_domain

# Redact 'Email' column in the dataframe
df['Email'] = df['Email'].apply(redact_email)

# Display the redacted dataframe
print(df)

Test 1

You can modify this code and use it in the testing environment to confirm that it infact redacts the email addresses in the dataframe. A necessary modification would be using only the function and the function call, and ignoring the import command and data frame creation, since both these steps have already been completed in the previous task. You can append this code to your existing code and see the result of both the processes in a single go.

import pandas as pd

# Read the dataset into a pandas DataFrame
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
print(df.head())

# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
df['Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]

# Function to redact email addresses
def redact_email(email):
    username, domain = email.split('@')
    redacted_username = username[0] + '*'*(len(username)-2) + username[-1]
    redacted_domain = domain[0] + '*'*(len(domain)-2) + domain[-1]
    return redacted_username + '@' + redacted_domain

# Redact 'Email' column in the dataframe
df['Email'] = df['Email'].apply(redact_email)

# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())

Generalization


Generalization involves putting specific entries, which may be possible identifiers, into generic groups, such that the personal details in the records are protected. You can apply the generalization logic to the Age attribute of the said data set, and convert the specific age of the candidates into generic categories. For example, 28 can become 20s, 36 can become 30s, and so on.

You can create a code for this using the following prompt.

Input 1

Write a python code to generalize the entries under the attribute 'Age' of a data frame such that exact number is converted into a generic range. For example, 28 becomes '20s', 36 becomes '30s', etc.

Response 1

import pandas as pd

# Create a sample dataframe
data = {'Name': ['John', 'Alice', 'Bob'],
        'Age': [28, 36, 42]}
df = pd.DataFrame(data)

# Function to generalize age
def generalize_age(age):
    age_range = str(age)[0] + '0s'
    return age_range

# Generalize 'Age' column in the dataframe
df['Age'] = df['Age'].apply(generalize_age)

# Display the generalized dataframe
print(df)

Test 1

You can modify this code and use it in the testing environment to confirm that it in fact generalizes the age in the dataframe. A necessary modification would be using only the function and the function call, and ignoring the import command and dataframe creation, since both these steps have already been completed in the first task. You can append this code to your existing code and see the result of all the processes in a single go.

import pandas as pd

# Read the dataset into a pandas DataFrame
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
print(df.head())

# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
df['Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]

# Function to redact email addresses
def redact_email(email):
    username, domain = email.split('@')
    redacted_username = username[0] + '*'*(len(username)-2) + username[-1]
    redacted_domain = domain[0] + '*'*(len(domain)-2) + domain[-1]
    return redacted_username + '@' + redacted_domain

# Redact 'Email' column in the dataframe
df['Email'] = df['Email'].apply(redact_email)

# Function to generalize age
def generalize_age(age):
    age_range = str(age)[0] + '0s'
    return age_range
# Generalize 'Age' column in the dataframe
df['Age'] = df['Age'].apply(generalize_age)

# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())

Noise addition


Another way to anonymize the data is to add random noise to it. This converts the original data into unusable garbage data and is effective in protecting the privacy of the candidate. You can apply such random addition to the attribute ‘Contact Number’ in the data set.

Assuming that all contact numbers are numerical values of 10 digits length, you need to add a random noise of length five digits to it. To create a Python code that can do this, you can use the following prompt on the GPT system.

Input 1

Write a python code to add random noise of 5 digit length to a numerical attribute 'Contact Number' in a data frame which had all values of length 10 digits.

Response 1

import pandas as pd
import random

# Create a sample dataframe
data = {'Name': ['John', 'Alice', 'Bob'],
        'Contact Number': [1234567890, 9876543210, 2468101214]}
df = pd.DataFrame(data)

# Function to add random noise
def add_random_noise(contact_number):
    noise = str(random.randint(10000, 99999))
    return str(contact_number)[:-5] + noise

# Add random noise to 'Contact Number' column in the dataframe
df['Contact Number'] = df['Contact Number'].apply(add_random_noise)

# Display the dataframe with added noise
print(df)

Test 1

You can modify this code and use it in the testing environment to confirm that it adds noise to the contact number in the dataframe. A necessary modification would be using only the function and the function call, and ignoring the pandas import command and dataframe creation, since both these steps have already been completed in the first task. The import command for random will still be needed for the function to work. You can append this code to your existing code and see the result of all the processes in a single go.

import pandas as pd
import random

# Read the dataset into a pandas DataFrame
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-AI0273EN-SkillsNetwork/labs/v1/m1/data/synthetic_dataset.csv')
print(df.head())

# Replace the entries under the 'Name' attribute with pseudonyms like "User_i"
df['Name'] = ['User_' + str(i) for i in range(1, len(df) + 1)]

# Function to redact email addresses
def redact_email(email):
    username, domain = email.split('@')
    redacted_username = username[0] + '*'*(len(username)-2) + username[-1]
    redacted_domain = domain[0] + '*'*(len(domain)-2) + domain[-1]
    return redacted_username + '@' + redacted_domain

# Redact 'Email' column in the dataframe
df['Email'] = df['Email'].apply(redact_email)

# Function to generalize age
def generalize_age(age):
    age_range = str(age)[0] + '0s'
    return age_range

# Generalize 'Age' column in the dataframe
df['Age'] = df['Age'].apply(generalize_age)

def add_random_noise(contact_number):
    noise = str(random.randint(10000, 99999))
    return str(contact_number)[:-5] + noise

# Add random noise to 'Contact Number' column in the dataframe
df['Contact Number'] = df['Contact Number'].apply(add_random_noise)

# Print the first 5 entries of the modified dataframe
print('Modified dataset')
print(df.head())