Synthetic Data Generations using Python

There are number of techniques used to generated the synthetic data based on the specific use cases. so here we are going to implement some of the commonly generating a synthetic data using python library.

1. Generating a synthetic data using faker python library.

The Faker library in Python is used to generate realistic, randomized fake data for various purposes such as testing, populating databases, or creating sample datasets. It provides a simple and customizable way to generate fake names, addresses, emails, and other types of data to simulate real-world scenarios in a controlled and privacy-preserving manner.

Step 1: Install faker library by using command:

!pip install faker

Step 2: Load the faker library and generating artificial personal information about the people.

Python3

import pandas as pd
from faker import Faker
 
fake = Faker()
 
# Generate a synthetic DataFrame with columns like name, email, and job title of people
data = {'Name': [fake.name() for _ in range(100)],
        'Email': [fake.email() for _ in range(100)],
        'Job': [fake.job() for _ in range(100)]}
 
df = pd.DataFrame(data)
print(df.head())

Output:

           Name                    Email                          Job
0  Jill Morales     kemptina@example.org          Animal nutritionist
1   Jimmy Lynch     nicole46@example.com  Nature conservation officer
2   Rachel Dean    kenneth32@example.org            Buyer, industrial
3    Corey Reid   brandilong@example.net         Electronics engineer
4  Mark Ramirez  lisaperkins@example.com               Science writer

2. Generating a synthetic data using Scikit-Learn library.

Scikit-learn are powerful libraries for machine learning. They can be used to generate synthetic datasets with specific characteristics especially for classification problem.

Step 1: Install scikit-learn library by using command:

 !pip install scikit-learn

Step 2: Load the scikit-learn library and generating artificial data for classification problem.

Python3

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
 
# Generate synthetic binary classification data
X, y = make_classification(n_samples=1000, n_features=5, n_informative=3, n_classes=2, random_state=42)
 
# Create a DataFrame from the NumPy arrays
columns = [f"feature_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(data=X, columns=columns)
df['target'] = y
 
# Print the first few rows of the DataFrame
df.head()

Output:

   feature_0  feature_1  feature_2  feature_3  feature_4  target
0  -0.065300  -0.717214   0.393952  -0.934473   1.681514       0
1   0.567015  -0.044606   1.612851  -1.350174   2.488878       0
2  -0.247215  -0.650569  -0.743500  -1.214190   0.841110       0
3   1.145870   0.974224   1.562506  -2.277010   2.276521       1
4   0.599605  -0.427545   2.374472  -1.503510   3.604959       0

3. Generating a synthetic data using bootstrap sampling in Numpy library

Bootstrap sampling is a resampling technique that involves randomly drawing samples with replacement from a dataset. This technique can be generated by using numpy library in python.

Step 1: Install numpy library by using command:

!pip install numpy

Step 2: Load the numpy library and generating artificial numerical integers data.

Python3

import numpy as np
import pandas as pd
 
# Original dataset
original_data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
 
# Bootstrap sampling function
def bootstrap_sample(data, num_samples=1000):
    synthetic_data = np.random.choice(data, size=(num_samples, len(data)), replace=True)
    return synthetic_data
 
# Generate synthetic data
synthetic_data = bootstrap_sample(original_data)
 
# Create a DataFrame from the synthetic data
num_columns = synthetic_data.shape[1]
column_names = [f'Column_{i+1}' for i in range(num_columns)]
synthetic_df = pd.DataFrame(synthetic_data, columns=column_names)
 
# Print the first few rows of the synthetic DataFrame
synthetic_df.head()

Output:

   Column_1  Column_2  Column_3  Column_4  Column_5  Column_6  Column_7  \
0         9         8         6         9         5         2        10   
1         9         7         6         5        10         9         8   
2         6         3         1         3         1        10         1   
3         8        10         4         3         4         3         9   
4         4         6         6         6        10         9         2   
   Column_8  Column_9  Column_10  
0         2         6          2  
1         7         6          2  
2         4         1          5  
3         8         1          4  
4         4         4          6

4. Generating synthetic data using Gaussian statistical model in Numpy library.

Gaussian statistical models are used to generate synthetic data by assuming a normal distribution, defined by mean and standard deviation parameters. This can be achieved through using Numpy library in python

Step 1: Install numpy library by using command:

 !pip install numpy

Step 2: Load the numpy library and generating artificial gaussian distributed data.

Python3

import numpy as np
import pandas as pd
 
# Parameters for the normal distribution (mean and standard deviation)
means = [5, 10, 15, 20, 25]  
std_devs = [2, 2, 2, 2, 2]  
 
# Generate synthetic data points using random sampling from a gaussian distribution for each input variable
num_samples = 1000
synthetic_data = np.random.normal(means, std_devs, size=(num_samples, len(means)))
 
# Create a DataFrame from the synthetic data
column_names = [f'Input_{i+1}' for i in range(len(means))] + ['Output']
synthetic_df = pd.DataFrame(data=np.hstack([synthetic_data, synthetic_data.sum(axis=1, keepdims=True)]), 
                            columns=column_names)
 
# Print the first few rows of the synthetic DataFrame
synthetic_df.head()

Output:

    Input_1    Input_2    Input_3    Input_4    Input_5     Output
0  4.780425  13.162034  16.231576  21.169664  26.686617  82.030316
1  7.178791   8.808152  14.307585  14.125770  25.779506  70.199805
2  5.036429   9.160681  13.157213  20.185858  25.250525  72.790706
3  6.982617   8.553333  12.821710  20.714012  24.872681  73.944353
4  3.843489   9.966772  14.924961  20.377723  29.385274  78.498219

Application of synthetic data in machine learning

Synthetic data plays a crucial role in few aspect in the process of machine learning which are listed below,

It helps in data augmentation process by augmenting existing datasets, helping improve model performance by providing additional diverse examples for training.
Synthetic data aids anomaly detection by creating realistic outliers and anomalies, allowing machine learning models to better identify and handle unexpected patterns or anomalies in real-world data.
synthetic data allows to reduce biases present in real data can be addressed, promoting fairness and reducing algorithmic bias in machine learning models.
Synthetic data supplements real-world datasets, providing additional examples for model training, especially in scenarios where obtaining sufficient authentic data is challenging.
Synthetic data is valuable for simulating rare or extreme events, allowing machine learning models to be trained effectively on scenarios that may have limited occurrences in real-world data.

Limitation of synthetic data

While synthetic data offers numerous advantages in machine learning, it also has certain limitations that need to be considered which are listed below,

Synthetic data may lead to overfitting, where the model performs well on the synthetic data but poorly on real-world data. This is because the synthetic data may not adequately represent the full range of real-world data variations.
Creating accurate and realistic synthetic data often requires domain expertise to understand the underlying patterns and relationships in the real data.
Generating high-quality synthetic data can be computationally expensive, especially for complex data types like images or natural language.
While synthetic data can protect privacy by not using real-world data, there is still a risk of re-identifying individuals based on the synthetic data, especially if it contains sensitive attributes.

Conclusion

Finally we conclude, synthetic data proves to be a valuable asset in the field of machine learning, offering solutions to data scarcity, privacy concerns, and biased datasets. Through various generation methods such as random sampling, bootstrapping, and advanced techniques like GANs, synthetic data replicates the statistical characteristics of real-world data, enabling researchers and practitioners to conduct experiments, test algorithms, and develop models without compromising sensitive information. While synthetic data introduces cost-effectiveness, bias mitigation, and privacy preservation, it is not without limitations. Careful consideration of its potential for overfitting, the need for domain expertise in generation, computational costs, and potential re-identification risks is essential.

What is synthetic data?

In data science, synthetic data is referred to as artificially generated data that replicates the statistical characteristics and patterns of real-world data. It serves various purposes in data analysis, machine learning, and deep learning. It enables machine learning researchers and data scientists to conduct experiments, test algorithms, and develop models without exposing sensitive or private information. Using algorithms and mathematical models, synthetic data is created to simulate the complexities found in real datasets. It can also be used in existing datasets, especially in cases where the existing data is limited or biased. Furthermore, it facilitates the assessment of model robustness, generalization, and performance under various scenarios.

Synthetic Data Generations using Python

1. Generating a synthetic data using faker python library.

Step 1: Install faker library by using command:

Step 2: Load the faker library and generating artificial personal information about the people.

Python3

2. Generating a synthetic data using Scikit-Learn library.

Step 1: Install scikit-learn library by using command:

Step 2: Load the scikit-learn library and generating artificial data for classification problem.

Python3

3. Generating a synthetic data using bootstrap sampling in Numpy library

Step 1: Install numpy library by using command:

Step 2: Load the numpy library and generating artificial numerical integers data.

Python3

4. Generating synthetic data using Gaussian statistical model in Numpy library.

Python3

Application of synthetic data in machine learning

Limitation of synthetic data

Conclusion

What is synthetic data?

Categories

Contact US

Synthetic Data Generations using Python

1. Generating a synthetic data using faker python library.

Step 1: Install faker library by using command:

Step 2: Load the faker library and generating artificial personal information about the people.

Python3

2. Generating a synthetic data using Scikit-Learn library.

Step 1: Install scikit-learn library by using command:

Step 2: Load the scikit-learn library and generating artificial data for classification problem.

Python3

3. Generating a synthetic data using bootstrap sampling in Numpy library

Step 1: Install numpy library by using command:

Step 2: Load the numpy library and generating artificial numerical integers data.

Python3

4. Generating synthetic data using Gaussian statistical model in Numpy library.

Python3

Application of synthetic data in machine learning

Limitation of synthetic data

Conclusion

What is synthetic data?

Similar Reads

Categories

Contact US