Gen AI Data Generation & Augmentation with Mostly.ai

In this section we will use generative AI to generate synthetic data samples and transfer learns on a given data set.

Objectives


We will use a popular tool, Mostly.ai, to create synthetic data samples to augment a CSV data set.

Data Set


You will use a data set that includes insurance records. The data set is available at the following link:
Insurance Dataset

This data set is a cleaned-up version of the Medical Insurance Price Prediction data set, available under the CC0 1.0 Universal License on the Kaggle website.

Steps

Download the data set

The first step is to download the dataset on your machine. You will need to upload this file to the interface in a subsequent step. Select the link provided in the Data Set section to download the data set.

Open the website

Select the following link to open the mostly.ai website and interface. https://mostly.ai/

Create an account

You can create an account on this website free of charge, or you can simply log in using your Gmail ID. After you log in, you’ll see the following interface.

Upload the data set

Upload the CSV file of the data set to the interface by using the upload option available on the console. After you upload the data set, you will see its filename on the console. Then select Proceed

Data configuration settings

You can choose to modify the category of an attribute, or you can choose to include a parameter in the augmentation process without these settings. For the purposes of this section, do not change these settings. Simply select Configure models to go to the model configuration settings.

Model configuration settings

You can modify the max training time, number of epochs, sample size, and other settings to generate the best possible model based on your requirements. For the purpose of this section, use the default settings. When done working with the settings select Start Training

Model training

After the model training completes ( as you see it during the process above), you will see an onscreen result similar to what you see on the following screen capture.

Model Report


Click the Model hyperlink to open the Quality Assurance Report in a separate tab. The page displays similar to what you see in the following screen capture.

Create Synthetic Data


If you look two images up, on the main page upper right corner Generate Data:

Generate Data

You can select the number of samples you want to generate, as well as modify the statistical nature of the data created by choosing the appropriate parameters. For the purpose of this section, keep all the settings at their default values, and select Start generation to create the required synthetic data.

Start Generation

Select Start Generation and you will see a progress report as shown below

When the data is done you’ll get a popup telling you it is done

Download Synthetic Data


Upper right corner Download synthetic Data and choose the option format