Grocery Store Transaction Data Database Design and Creation
Database diagramming
From a grocery store transaction dataset that was available in one Excel spreadsheet, I created a diagram of the columns from the dataset and how they would relate to each other in a database. From there, I separated each of the tables into .CSV files by saving each of the columns needed from the transactions.exl file to .CSV files with the same names as the tables in the diagram on the previous page. No rows were excluded.
Database creation in Python and Jupyter Notebook
To create the database, the following commands were completed in Jupyter Notebook.
Import SQL:
<import sqlite3
con = sqlite3.connect('Sky_Stage_Project2.sqlite') cur = con.cursor()
Create the tables in the diagram above:
# Create table - order_item cur.execute('''CREATE TABLE order_item ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [units_sold] integer)''') con.commit()
# Create table - orders cur.execute('''CREATE TABLE orders ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [customer_id] integer, [purchase_date] integer)''') con.commit()
# Create table - customers cur.execute('''CREATE TABLE customers ([generated_id] INTEGER PRIMARY KEY, [customer_id] integer, [gender] text, [marital_status] text, [homeowner] text, [children] text, [annual_income] text)''') con.commit()
# Create table - products cur.execute('''CREATE TABLE products ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [product_family] text, [purchase_department] text, [product_category] text, [revenue] integer, [purchase_date] integer)''') con.commit()
# Create table - locations cur.execute('''CREATE TABLE locations ([generated_id] INTEGER PRIMARY KEY, [transaction] integer, [customer_id] integer, [city] text, [state] text, [country] text)''') con.commit()
Import Pandas library:
import pandas as pd
Insert the values from the table .CSV files into the tables made in the database:
con = sqlite3.connect('Sky_Stage_Project2.sqlite') cur = con.cursor() read_order_item = pd.read_csv ('order_items.csv') read_order_item.to_sql('order_item', con, if_exists='append', index = False) # Insert the values from the csv file into the table 'order_items' read_orders = pd.read_csv ('orders.csv') read_orders.to_sql('orders', con, if_exists='replace', index = False) # Replace the values from the csv file into the table 'orders' read_customers = pd.read_csv ('customers.csv') read_customers.to_sql('customers', con, if_exists='replace', index = False) # Replace the values from the csv file into the table 'customers' read_products = pd.read_csv ('products.csv') read_products.to_sql('products', con, if_exists='replace', index = False) # Replace the values from the csv file into the table 'products' read_locations = pd.read_csv ('locations.csv') read_locations.to_sql('locations', con, if_exists='replace', index = False) # Replace the values from the csv file into the table 'locations'
View database through running SQL queries in Python
Look to see if the tables were imported correctly by running some SQL commands:
con=sqlite3.connect('Sky_Stage_Project2.sqlite') def x(q): return pd.read_sql_query(q,con) q='''SELECT * FROM order_item ''' df=x(q) df
con=sqlite3.connect('Sky_Stage_Project2.sqlite') def x(q): return pd.read_sql_query(q,con) q='''SELECT * FROM orders ''' df=x(q) df
con=sqlite3.connect('Sky_Stage_Project2.sqlite') def x(q): return pd.read_sql_query(q,con) q='''SELECT * FROM customers ''' df=x(q) df
con=sqlite3.connect('Sky_Stage_Project2.sqlite') def x(q): return pd.read_sql_query(q,con) q='''SELECT * FROM products ''' df=x(q) df
con=sqlite3.connect('Sky_Stage_Project2.sqlite') def x(q): return pd.read_sql_query(q,con) q='''SELECT * FROM locations ''' df=x(q) df
Conduct some simple analyses with SQL
con=sqlite3.connect('Sky_Stage_Project2.sqlite') def x(q): return pd.read_sql_query(q,con) q='''SELECT * FROM products ORDER BY revenue DESC LIMIT 10; ''' df=x(q) df
q='''SELECT product_category, COUNT(*) FROM orders NATURAL JOIN customers NATURAL JOIN products NATURAL JOIN order_item GROUP BY product_category ORDER BY COUNT(*) ASC LIMIT 30; ''' df=x(q) df
q='''SELECT product_category, COUNT(*) FROM orders NATURAL JOIN customers NATURAL JOIN products NATURAL JOIN order_item GROUP BY product_category ORDER BY COUNT(*) DESC LIMIT 30; ''' df=x(q) df