Gen AI Data Generation & Augmentation with Mostly.ai
In this section we will use generative AI to generate synthetic data samples and transfer learns on a given data set.
Objectives
We will use a popular tool, Mostly.ai, to create synthetic data samples to augment a CSV data set.
Data Set
You will use a data set that includes insurance records. The data set is available at the following link:
Insurance Dataset
This data set is a cleaned-up version of the Medical Insurance Price Prediction data set, available under the CC0 1.0 Universal License on the Kaggle website.
Steps
Download the data set
The first step is to download the dataset on your machine. You will need to upload this file to the interface in a subsequent step. Select the link provided in the Data Set section to download the data set.
Open the website
Select the following link to open the mostly.ai website and interface. https://mostly.ai/
Create an account
You can create an account on this website free of charge, or you can simply log in using your Gmail ID. After you log in, you’ll see the following interface.
Upload the data set
Upload the CSV file of the data set to the interface by using the upload option available on the console. After you upload the data set, you will see its filename on the console. Then select Proceed
Data configuration settings
You can choose to modify the category of an attribute, or you can choose to include a parameter in the augmentation process without these settings. For the purposes of this section, do not change these settings. Simply select Configure models
to go to the model configuration settings.
Model configuration settings
You can modify the max training time, number of epochs, sample size, and other settings to generate the best possible model based on your requirements. For the purpose of this section, use the default settings. When done working with the settings select Start Training
Model training
After the model training completes ( as you see it during the process above), you will see an onscreen result similar to what you see on the following screen capture.
Model Report
Click the Model
hyperlink to open the Quality Assurance Report in a separate tab. The page displays similar to what you see in the following screen capture.
Create Synthetic Data
If you look two images up, on the main page upper right corner Generate Data:
Generate Data
You can select the number of samples you want to generate, as well as modify the statistical nature of the data created by choosing the appropriate parameters. For the purpose of this section, keep all the settings at their default values, and select Start generation
to create the required synthetic data.
Start Generation
Select Start Generation and you will see a progress report as shown below
When the data is done you’ll get a popup telling you it is done
Download Synthetic Data
Upper right corner Download synthetic Data and choose the option format