Synthetic Data Generations using Python
There are number of techniques used to generated the synthetic data based on the specific use cases. so here we are going to implement some of the commonly generating a synthetic data using python library.
1. Generating a synthetic data using faker python library.
The Faker library in Python is used to generate realistic, randomized fake data for various purposes such as testing, populating databases, or creating sample datasets. It provides a simple and customizable way to generate fake names, addresses, emails, and other types of data to simulate real-world scenarios in a controlled and privacy-preserving manner.
Step 1: Install faker library by using command:
!pip install faker
Step 2: Load the faker library and generating artificial personal information about the people.
Python3
import pandas as pd from faker import Faker fake = Faker() # Generate a synthetic DataFrame with columns like name, email, and job title of people data = { 'Name' : [fake.name() for _ in range ( 100 )], 'Email' : [fake.email() for _ in range ( 100 )], 'Job' : [fake.job() for _ in range ( 100 )]} df = pd.DataFrame(data) print (df.head()) |
Output:
Name Email Job
0 Jill Morales kemptina@example.org Animal nutritionist
1 Jimmy Lynch nicole46@example.com Nature conservation officer
2 Rachel Dean kenneth32@example.org Buyer, industrial
3 Corey Reid brandilong@example.net Electronics engineer
4 Mark Ramirez lisaperkins@example.com Science writer
2. Generating a synthetic data using Scikit-Learn library.
Scikit-learn are powerful libraries for machine learning. They can be used to generate synthetic datasets with specific characteristics especially for classification problem.
Step 1: Install scikit-learn library by using command:
!pip install scikit-learn
Step 2: Load the scikit-learn library and generating artificial data for classification problem.
Python3
import numpy as np import pandas as pd from sklearn.datasets import make_classification # Generate synthetic binary classification data X, y = make_classification(n_samples = 1000 , n_features = 5 , n_informative = 3 , n_classes = 2 , random_state = 42 ) # Create a DataFrame from the NumPy arrays columns = [f "feature_{i}" for i in range (X.shape[ 1 ])] df = pd.DataFrame(data = X, columns = columns) df[ 'target' ] = y # Print the first few rows of the DataFrame df.head() |
Output:
feature_0 feature_1 feature_2 feature_3 feature_4 target
0 -0.065300 -0.717214 0.393952 -0.934473 1.681514 0
1 0.567015 -0.044606 1.612851 -1.350174 2.488878 0
2 -0.247215 -0.650569 -0.743500 -1.214190 0.841110 0
3 1.145870 0.974224 1.562506 -2.277010 2.276521 1
4 0.599605 -0.427545 2.374472 -1.503510 3.604959 0
3. Generating a synthetic data using bootstrap sampling in Numpy library
Bootstrap sampling is a resampling technique that involves randomly drawing samples with replacement from a dataset. This technique can be generated by using numpy library in python.
Step 1: Install numpy library by using command:
!pip install numpy
Step 2: Load the numpy library and generating artificial numerical integers data.
Python3
import numpy as np import pandas as pd # Original dataset original_data = [ 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ] # Bootstrap sampling function def bootstrap_sample(data, num_samples = 1000 ): synthetic_data = np.random.choice(data, size = (num_samples, len (data)), replace = True ) return synthetic_data # Generate synthetic data synthetic_data = bootstrap_sample(original_data) # Create a DataFrame from the synthetic data num_columns = synthetic_data.shape[ 1 ] column_names = [f 'Column_{i+1}' for i in range (num_columns)] synthetic_df = pd.DataFrame(synthetic_data, columns = column_names) # Print the first few rows of the synthetic DataFrame synthetic_df.head() |
Output:
Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 \
0 9 8 6 9 5 2 10
1 9 7 6 5 10 9 8
2 6 3 1 3 1 10 1
3 8 10 4 3 4 3 9
4 4 6 6 6 10 9 2
Column_8 Column_9 Column_10
0 2 6 2
1 7 6 2
2 4 1 5
3 8 1 4
4 4 4 6
4. Generating synthetic data using Gaussian statistical model in Numpy library.
Gaussian statistical models are used to generate synthetic data by assuming a normal distribution, defined by mean and standard deviation parameters. This can be achieved through using Numpy library in python
Step 1: Install numpy library by using command:
!pip install numpy
Step 2: Load the numpy library and generating artificial gaussian distributed data.
Python3
import numpy as np import pandas as pd # Parameters for the normal distribution (mean and standard deviation) means = [ 5 , 10 , 15 , 20 , 25 ] std_devs = [ 2 , 2 , 2 , 2 , 2 ] # Generate synthetic data points using random sampling from a gaussian distribution for each input variable num_samples = 1000 synthetic_data = np.random.normal(means, std_devs, size = (num_samples, len (means))) # Create a DataFrame from the synthetic data column_names = [f 'Input_{i+1}' for i in range ( len (means))] + [ 'Output' ] synthetic_df = pd.DataFrame(data = np.hstack([synthetic_data, synthetic_data. sum (axis = 1 , keepdims = True )]), columns = column_names) # Print the first few rows of the synthetic DataFrame synthetic_df.head() |
Output:
Input_1 Input_2 Input_3 Input_4 Input_5 Output
0 4.780425 13.162034 16.231576 21.169664 26.686617 82.030316
1 7.178791 8.808152 14.307585 14.125770 25.779506 70.199805
2 5.036429 9.160681 13.157213 20.185858 25.250525 72.790706
3 6.982617 8.553333 12.821710 20.714012 24.872681 73.944353
4 3.843489 9.966772 14.924961 20.377723 29.385274 78.498219
Application of synthetic data in machine learning
Synthetic data plays a crucial role in few aspect in the process of machine learning which are listed below,
- It helps in data augmentation process by augmenting existing datasets, helping improve model performance by providing additional diverse examples for training.
- Synthetic data aids anomaly detection by creating realistic outliers and anomalies, allowing machine learning models to better identify and handle unexpected patterns or anomalies in real-world data.
- synthetic data allows to reduce biases present in real data can be addressed, promoting fairness and reducing algorithmic bias in machine learning models.
- Synthetic data supplements real-world datasets, providing additional examples for model training, especially in scenarios where obtaining sufficient authentic data is challenging.
- Synthetic data is valuable for simulating rare or extreme events, allowing machine learning models to be trained effectively on scenarios that may have limited occurrences in real-world data.
Limitation of synthetic data
While synthetic data offers numerous advantages in machine learning, it also has certain limitations that need to be considered which are listed below,
- Synthetic data may lead to overfitting, where the model performs well on the synthetic data but poorly on real-world data. This is because the synthetic data may not adequately represent the full range of real-world data variations.
- Creating accurate and realistic synthetic data often requires domain expertise to understand the underlying patterns and relationships in the real data.
- Generating high-quality synthetic data can be computationally expensive, especially for complex data types like images or natural language.
- While synthetic data can protect privacy by not using real-world data, there is still a risk of re-identifying individuals based on the synthetic data, especially if it contains sensitive attributes.
Conclusion
Finally we conclude, synthetic data proves to be a valuable asset in the field of machine learning, offering solutions to data scarcity, privacy concerns, and biased datasets. Through various generation methods such as random sampling, bootstrapping, and advanced techniques like GANs, synthetic data replicates the statistical characteristics of real-world data, enabling researchers and practitioners to conduct experiments, test algorithms, and develop models without compromising sensitive information. While synthetic data introduces cost-effectiveness, bias mitigation, and privacy preservation, it is not without limitations. Careful consideration of its potential for overfitting, the need for domain expertise in generation, computational costs, and potential re-identification risks is essential.
What is synthetic data?
In data science, synthetic data is referred to as artificially generated data that replicates the statistical characteristics and patterns of real-world data. It serves various purposes in data analysis, machine learning, and deep learning. It enables machine learning researchers and data scientists to conduct experiments, test algorithms, and develop models without exposing sensitive or private information. Using algorithms and mathematical models, synthetic data is created to simulate the complexities found in real datasets. It can also be used in existing datasets, especially in cases where the existing data is limited or biased. Furthermore, it facilitates the assessment of model robustness, generalization, and performance under various scenarios.