analysis

Every Day Prompt Engineering Part 7: Data Analysis

A somewhat overlooked feature of ChatGPT is its ability to analyse data. ChatGPT is capable to read files, analyze them and provide visualizations, files and analysis. In this blog post, we will take a look at how to use ChatGPT for data analysis. To demonstrate the power of advanced data analysis, we will use data from Free Data Sets.
For illustrative purposes we will utilise a dataset about global fossil fuel emissions from 1750 to 2021. The data set is quite complex and has more than 50.000 rows and more than 70 columns. You will notice that you can upload multiple files at once.  ChatGPT tries to understand how they relate to each other. At the time of writing you can upload up to 750MB with a single file of limit of 25MB. After we made the files available to ChatGPT, we explain what the files are and what we want to do with the data initially. In our case, it’s about Global fossil fuel emissions and we want a plot how the emissions developed over time:
The CSV file contains data about global fossil fuel emissions from 1750 to 2021. The XLS file contains a description of the columns. Can you create a graph showing how the emissions developed by country over the observation period?
When prompted ChatGPT will read the content of files and start generating python code that it executes to analyse the files. It will describe what it finds and explain how it will draw the graph. In our case, we got this:
import matplotlib.pyplot as plt
# Filter the dataset for relevant columns
emissions_filtered = emissions_data[['country', 'year', 'co2']].dropna()
# Since visualizing all countries over such a long period can be complex,
# let's focus on the top emitting countries in the latest available year to simplify the visualization.
latest_year = emissions_filtered['year'].max()
top_emitting_countries = emissions_filtered[emissions_filtered['year'] == latest_year].nlargest(10, 'co2')['country']
# Filter the dataset to include only the top emitting countries
emissions_top_countries = emissions_filtered[emissions_filtered['country'].isin(top_emitting_countries)]
# Pivot the data for visualization
emissions_pivot = emissions_top_countries.pivot(index='year', columns='country', values='co2')
# Plotting
plt.figure(figsize=(14, 8))
plt.plot(emissions_pivot, marker='', linewidth=2)
plt.title('Development of CO2 Emissions by Country (Top 10 in 2021)', fontsize=16)
plt.xlabel('Year', fontsize=14)
plt.ylabel('CO2 Emissions (million tonnes)', fontsize=14)
plt.legend(emissions_pivot.columns, loc='upper left', title='Country')
plt.grid(True)
plt.yscale('log')
plt.tight_layout()
# Save the figure
plt_path = '/mnt/data/co2_emissions_development_by_country.png'
plt.savefig(plt_path)
plt_path
After our initial analysis, we can dive deeper into the data. Let’s assume that we want to know how the countries of the European Union did over the years:
Can you create a pie chart for the CO2 emissions of the European Union Countries from 2010 to 2021?
Note that we didn’t tell ChatGPT what countries in the European Union are. It does so by itself and even considers the Brexit:
# List of European Union (EU) countries based on common membership around 2021 (excluding UK due to Brexit)
eu_countries = [
   'Austria', 'Belgium', 'Bulgaria', 'Croatia', 'Cyprus', 'Czechia', 'Denmark',
   'Estonia', 'Finland', 'France', 'Germany', 'Greece', 'Hungary', 'Ireland',
   'Italy', 'Latvia', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands',
   'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'Spain', 'Sweden'
]
# Filter dataset for EU countries and the years 2010 to 2021
emissions_eu = emissions_filtered[
   (emissions_filtered['country'].isin(eu_countries)) &
   (emissions_filtered['year'] >= 2010) &
   (emissions_filtered['year'] <= 2021)
]
# Aggregate CO2 emissions by country
eu_co2_emissions = emissions_eu.groupby('country')['co2'].sum()
# Plotting the pie chart
plt.figure(figsize=(10, 8))
plt.pie(eu_co2_emissions, labels=eu_co2_emissions.index, autopct='%1.1f%%', startangle=140)
plt.title('CO2 Emissions of EU Countries (2010-2021)', fontsize=16)
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
# Save the figure
pie_chart_path = '/mnt/data/eu_co2_emissions_pie_chart.png'
plt.savefig(pie_chart_path)
pie_chart_path
Now if we take look at the graph, it looks almost good, but not great. Some of the labels are too close to each other and difficult to read, but it’s a good start. You can for example ask ChatGPT to improve this by adding a coloured legend. However, we’ll continue to dive into the data. We’ll take a look how Germany’s emissions changed over the years:
Can you create a bar chart that depicts the development of the fraction of CO2 emissions for Germany?
Notice, that since we didn’t explain to ChatGPT that we want to see Germany’s contribution to the emissions of the European Union, ChatGPT automatically use the whole data set:
To create a bar chart depicting the development of the fraction of CO2 emissions for Germany, I'll focus on illustrating the annual CO2 emissions as a fraction of Germany's total emissions over the specified time range. This will involve calculating the fraction of each year's emissions relative to the total emissions for all observed years in the dataset for Germany.
Let's proceed with extracting Germany's CO2 emissions data, calculating the yearly fractions, and then creating the bar chart. In the end, we got the bar chart with Germany’s contribution to the worldwide emissions. Now, we want to get some understanding, when Germany will eventually reach a point with no emissions at all:
Do a trend analysis for the Germany data from the last bar chart. When can we expect to reach zero?
ChatGPT starts with a linear regression model and calculates the point in time when to reach zero:
To analyze the trend of CO2 emissions for Germany and estimate when emissions might reach zero, we can use a linear regression model on the annual CO2 emissions data. This approach will allow us to project the trend into the future and estimate when the emissions might approach zero based on the historical data.
Unfortunately, we won’t reach zero until 2100. Again, see how ChatGPT makes assumptions when not stated otherwise:
Based on the linear regression model fitted to Germany's CO2 emissions data, the trend analysis does not predict a year when CO2 emissions will reach zero within the projected range up to the year 2100. This outcome suggests that the linear model, based on historical data, does not anticipate emissions decreasing to zero under the current trend. It's important to note that a linear regression model may not accurately capture the complexities of CO2 emissions trends, especially considering potential changes in policies, technologies, and economic factors over time. More sophisticated models or analyses, possibly incorporating these factors, might provide a different outlook on when Germany could achieve zero emissions.
Since ChatGPT tells us that a linear regression model might not  accurately capture the trends, we can opt for other models. If you are familiar with other models, you can simply ask ChatGPT to use a different model. Or, you can ask what models are there that we can use:
What other models besides the linear regression model can you apply to the data? Give me a list and rank the items by their strength and weaknesses.
ChatGPT provides a list of models that can be applied to the data. We opt for ARIMA (Autoregressive Integrated Moving Average) and let ChatGPT do its work. If you are not familiar with a model (like I’m with ARIMA), remember to double check the analysis of ChatGPT with people that know how to use the model. You can also ask ChatGPT itself for explanations of the calculations and ask it to double check the results. Here you can find the conversation with the exception of the uploaded files.

Data cleaning, transformation and aggregation

In the examples we have used the data as they were provided. Depending on the quality of the available data, you might have to clean the data first. This requires knowledge about the data, in particular when the data poorly structured (“dirty”). However, you can ask ChatGPT to go through it and restructure, reformat field types and prepare the data so that you can work with it.
The main limitation is the size of the data sets. You can’t really process huge files with millions of rows. Like mentioned before, you can upload files with about 25MB and process 100.000 rows of data quite easily. There is also a 60 second processing time limit: if ChatGPT takes longer to process the data, it will fail.

Additional Capabilities of Advanced Data Analysis

It's important to highlight a few other remarkable capabilities that ChatGPT provides out of the box.
  1. Statistical Analysis: This involves carrying out various statistical tests and analyses, such as hypothesis testing, analyzing correlations, conducting regression analysis, and engaging in more intricate statistical modeling.
  2. Machine Learning: This area allows for the creation and assessment of machine learning models. Activities encompass data preparation, feature engineering, choosing models, training them, evaluating their performance, and making predictions. This topic is comprehensive enough to warrant its own dedicated discussion.
  3. Text Analysis: When dealing with text data, numerous tasks can be undertaken, including text preprocessing, analyzing sentiments, modeling topics, and applying other natural language processing (NLP) methods.

Conclusion

Our exploration of data analysis shows the robust and often under-appreciated capabilities of ChatGPT in the realm of data analysis. From handling complex datasets like the global fossil fuel emissions from 1750 to 2021, to executing sophisticated statistical, machine learning, and text analysis tasks, ChatGPT demonstrates a remarkable versatility that can be easily harnessed with simple prompting.

FAQs

  1. How does ChatGPT handle missing or incomplete data in the datasets it analyzes? ChatGPT typically handles missing or incomplete data by either omitting them from analysis or using imputation techniques to estimate missing values based on available data..
  2. Can ChatGPT provide insights into the accuracy of the models it uses for prediction or trend analysis? The accuracy of predictive models like linear regression or ARIMA is usually evaluated using metrics such as R-squared, mean squared error, or similar statistical measures, although specific methods might vary based on the context and data characteristics.
  3. How does ChatGPT determine the significance of correlations found in data analysis? As for determining the significance of correlations, ChatGPT likely utilizes statistical tests, such as Pearson or Spearman correlation coefficients, and assesses p-values to judge the statistical significance of the relationships found in the data. These approaches allow for a thorough analysis of data patterns, trends, and predictions, ensuring reliable insights are derived from the analysis.

Unlock the Future of Business with AI

Dive into our immersive workshops and equip your team with the tools and knowledge to lead in the AI era.

Scroll to top