Pandas Introduction

Part II

If you're continuing using the same notebook as in part one, you won't need to import anything extra. If this is a new notebook, you'll need to import the numpy and pandas libraries as well as the Series and DF objects from the pandas library like so.

In [53]:
import numpy as np
import pandas as pd
import seaborn as sb

from numpy.random import randn
import matplotlib.pyplot as plt
from matplotlib import rcParams
from pandas import Series, DataFrame

Don't forget to add inline matplotlib and configure your base grid.

In [54]:
%matplotlib inline

# Figsize is represented in inches.
rcParams['figure.figsize']= 5,4
sb.set_style('whitegrid')

In the previous demonstration, we covered some of the basics of with dataframe objects, basic data munging, and some visualization. In part two, we'll start to try to start using the data and generating new and usefull values from what we have.

Profit Calculator

This takes two dataframes and concatinates them based on a common key. This is handy if you have sales data on a platform like Google Analytics while keeping the details of your profit margins offline. In this way you can merge them and do your own offline analytics while keeping the details of your specific margins private.

This would then make it easy to maintain a single list of your products and bump it against a sales report from any time period. Below we read in the test analytics data. Set the sku data type to string, and specify which columns to use.

In [55]:
Sales = pd.read_csv("./Test.csv", parse_dates=True, dtype = {"sku":str}, usecols = ["date", "price", "sku"])
Sales
Out[55]:
date price sku
0 22 Aug 2013 72 445
1 22 Aug 2013 56 203
2 22 Aug 2013 56 203
3 22 Aug 2013 115 661
4 04 Sep 2013 115 661
5 11 Sep 2013 72 445
6 28 Sep 2013 72 445
7 28 Sep 2013 117 661
8 02 Oct 2013 72 445
9 02 Oct 2013 56 203
10 02 Oct 2013 115 661

Here we create our product database which maps a product sku with its cost. I added some spacing in the cost column to make it easier to see which product maps to which SKU, but that's not necessary.

In [56]:
# Cost of goods sold COGS
COGS = DataFrame({'sku':['203','445', '661'], 
                  'cost':[  46,   53,   89]})

COGS
Out[56]:
cost sku
0 46 203
1 53 445
2 89 661

Once we have our two dataframes, we merge them by calling the merge method and passing in our two dataframes. We merge using "left" or the AD DataFrame as the primary source and we look for matches "on" the SKU column. This produces a new DF that contains all of the data in one spot.

In [57]:
# Create a new frame by joining the original two, merging by matching sku numbers to the main DF
C = pd.merge(Sales, COGS, how = 'left', on='sku')
print(C)
           date  price  sku  cost
0   22 Aug 2013     72  445    53
1   22 Aug 2013     56  203    46
2   22 Aug 2013     56  203    46
3   22 Aug 2013    115  661    89
4   04 Sep 2013    115  661    89
5   11 Sep 2013     72  445    53
6   28 Sep 2013     72  445    53
7   28 Sep 2013    117  661    89
8   02 Oct 2013     72  445    53
9   02 Oct 2013     56  203    46
10  02 Oct 2013    115  661    89

If you want to change the order of the columns to keep certain data next to each other, you can simply set the order in a list and pass it in as a parameter to the DF.

In [58]:
Col_Order = ['date', 'sku','cost', 'price']
C1 = DataFrame(C, columns=Col_Order)

C1
Out[58]:
date sku cost price
0 22 Aug 2013 445 53 72
1 22 Aug 2013 203 46 56
2 22 Aug 2013 203 46 56
3 22 Aug 2013 661 89 115
4 04 Sep 2013 661 89 115
5 11 Sep 2013 445 53 72
6 28 Sep 2013 445 53 72
7 28 Sep 2013 661 89 117
8 02 Oct 2013 445 53 72
9 02 Oct 2013 203 46 56
10 02 Oct 2013 661 89 115

So now that we have the cost and the price listed in one DF, how do we calculate the profit and start making use of this data? Simple, just call the DF using the column indexes and set up an equasion like you might if you were working it out on paper.

In [59]:
# Creates a profit series and calculates the difference between the 
# price and the cost.
C['profit']=C['price']-C['cost']

# Sets the column order including new column. 
cols = ['date', 'sku', 'cost', 'price', 'profit']

# Creates a new DF and passes in the source DF and desired column order.
Profit = DataFrame(C, columns=cols)
Profit
Out[59]:
date sku cost price profit
0 22 Aug 2013 445 53 72 19
1 22 Aug 2013 203 46 56 10
2 22 Aug 2013 203 46 56 10
3 22 Aug 2013 661 89 115 26
4 04 Sep 2013 661 89 115 26
5 11 Sep 2013 445 53 72 19
6 28 Sep 2013 445 53 72 19
7 28 Sep 2013 661 89 117 28
8 02 Oct 2013 445 53 72 19
9 02 Oct 2013 203 46 56 10
10 02 Oct 2013 661 89 115 26

Now we can take the combined data and start looking a little deeper. Here I want to find what it costs me to do business every day, how much I charged, and how much profit I made. You can do that by grouping your data by a column, in this case the date, and using the sum() method.

In [60]:
sales_by_date = Profit.groupby(Profit['date'])
sales_by_date.sum()
Out[60]:
cost price profit
date
02 Oct 2013 188 243 55
04 Sep 2013 89 115 26
11 Sep 2013 53 72 19
22 Aug 2013 234 299 65
28 Sep 2013 142 189 47

Now you may be interested in visualizing that as a line graph. Now it's a littler easier to see your bottom line. Here you can see the relationship between the price of your product, the cost of goods, and the difference represented as profit on the bottom.

In [64]:
sales_by_date.sum().plot()
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x116b239b0>

You can break it down even further if you're only interested in your profit you can simply use the DF group and sum the profit series and plot that.

In [63]:
sales_by_date['profit'].sum().plot()
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x116ab3e48>

Here we can group the our sales by item numbers and find the total cost of goods by item and associated profit.

In [65]:
sales_by_date = Profit.groupby(Profit['sku'])
sales_by_date.sum()
Out[65]:
cost price profit
sku
203 138 168 30
445 212 288 76
661 356 462 106

You could then display the visualization of your costs and profit by item type. Here we can see where our margins are tightest and where we can focus our marketing efforts. The return on item 445 is pretty close to that of 661; however, the cost per item is much lower. Perhaps it would make more sense to focuse on selling more 445. I could keep my overhead down, keep more in stock, and still do really well.

In [66]:
sales_by_date.sum().plot(kind='bar')
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x116cbae80>

Vehicle Comparison

Here we take a closer look at our vehicle data again.

In [12]:
# Reading in the data
cars = pd.read_csv("/Users/Steglitz/jupyter/mtcars.csv")

# Setting the columns that you want to use. 
# NOTE: You can chose the display order by reordering the columns the way you want. 
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.index = cars.car_names
# Returns the first five rows.
cars.head()
Out[12]:
car_names mpg cyl disp hp drat wt qsec vs am gear carb
car_names
Mazda RX4 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Here we plot the efficiency of vehicles. On the x-axis we have the list of our vehciles. On the y-axis we display where they fall in respect to their miles per gallon.

In [13]:
mpg = cars['mpg']

fig = plt.figure()
ax = fig.add_axes([.1,.1,1,1])
mpg.plot()

ax.set_xticks(range(32))

ax.set_xticklabels(cars.car_names, rotation=60, fontsize='medium')
ax.set_title('Miles per Gallon of Cars in mtcars')

ax.set_xlabel('car names')
ax.set_ylabel('miles/gal')

ax.legend(loc='best')
Out[13]:
<matplotlib.legend.Legend at 0x112fd1240>

You can also grab quick details about the dataset like:

In [15]:
# Maximum value
mpg.max()
Out[15]:
33.899999999999999
In [16]:
# Minimum value
mpg.min()
Out[16]:
10.4
In [17]:
# Record count
mpg.count()
Out[17]:
32

Or you could use the describe method to give a quick all-in-one summary. Describe generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [18]:
mpg.describe()
Out[18]:
count    32.000000
mean     20.090625
std       6.026948
min      10.400000
25%      15.425000
50%      19.200000
75%      22.800000
max      33.900000
Name: mpg, dtype: float64

Therer are many useful ways to visualize your data to see relationships between data. Here we have a histogram plot that shows you the general distribution of the data.

In [19]:
mpg.plot(kind='hist')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x112f2a828>

You can also use seaborn to plot a histogram with distribution curve. This can be helfpul when determining if your data meets prerequisites for certain analytics techniques.

In [20]:
sb.distplot(mpg)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x113144f60>

Here we create a scatterplot from the two column series, "hp" and "mpg". The "s" attribute sets the dot size.

In [70]:
cars.plot(kind='scatter', x='hp', y='mpg', c=['lightblue'], s=50)
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x11701c908>

Here we have the same scatterplot with trendline and confidence interval region. Here the interval was just randomly selected for demonstration purposes, not calculated.

In [76]:
sb.regplot(x='hp', y='mpg', data=cars, scatter=True, ci=80)
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x1176869e8>

Here you can create a grid matrix which evaluates the relationship between all the listed variables. Here we can see the relationship between automatic and manual cars with MPG, displacement, HP, and weight.

In [82]:
cars_df = pd.DataFrame((cars.ix[:,(1,3,4,6)].values), columns = ['mpg', 'disp', 'hp', 'wt'])
cars_target = cars.ix[:,9].values
target_names = [0,1]

cars_df['group'] = pd.Series(cars_target, dtype="category")
sb.pairplot(cars_df, hue='group', palette='hls')
Out[82]:
<seaborn.axisgrid.PairGrid at 0x1186f7ba8>

Math and Statistics with Pandas

You can set your desired precision by using the set_printoptions method with the precision attribute.

In [27]:
np.set_printoptions(precision=2)

Creating Arrays

You can create arrays by passing lists to the array method.

In [83]:
a = np.array([1,2,3,4,5,6])
a
Out[83]:
array([1, 2, 3, 4, 5, 6])

You can create a multidimensional array by passing in a list of lists.

In [84]:
b = np.array([[10,20,30], [40,50,60]])
b
Out[84]:
array([[10, 20, 30],
       [40, 50, 60]])

Creating an array by assignment

In [85]:
np.random.seed(25)
c = 36*np.random.randn(6)
c
Out[85]:
array([  8.22,  36.97, -30.23, -21.28, -34.45,  -8.  ])
In [86]:
d = np.arange(1,35)
d
Out[86]:
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34])

Arithmetic with arrays

Using the arrays we just created, you can do arithmetic operations on entire arrays.

You can do multiplication:

In [87]:
a * 10
Out[87]:
array([10, 20, 30, 40, 50, 60])

Addition:

In [88]:
print('C: ', c);print('A: ', a)

c + a
C:  [  8.22  36.97 -30.23 -21.28 -34.45  -8.  ]
A:  [1 2 3 4 5 6]
Out[88]:
array([  9.22,  38.97, -27.23, -17.28, -29.45,  -2.  ])
In [89]:
Subtraction: 
  File "<ipython-input-89-c7951b29771d>", line 1
    Subtraction:
                 ^
SyntaxError: invalid syntax
In [90]:
print('C: ', c);print('A: ', a)

c - a
C:  [  8.22  36.97 -30.23 -21.28 -34.45  -8.  ]
A:  [1 2 3 4 5 6]
Out[90]:
array([  7.22,  34.97, -33.23, -25.28, -39.45, -14.  ])

Array multiplication:

In [91]:
print('C: ', c);print('A: ', a)

c * a
C:  [  8.22  36.97 -30.23 -21.28 -34.45  -8.  ]
A:  [1 2 3 4 5 6]
Out[91]:
array([   8.22,   73.94,  -90.68,  -85.13, -172.24,  -48.02])

Division:

In [92]:
print('C: ', c);print('A: ', a)

c /a
C:  [  8.22  36.97 -30.23 -21.28 -34.45  -8.  ]
A:  [1 2 3 4 5 6]
Out[92]:
array([  8.22,  18.48, -10.08,  -5.32,  -6.89,  -1.33])

Multiplying matrices and basic linear algebra

Using a . after a number typecasts it as a float as opposed to an integer. This helps avoid problems associated with integer math in python.

In [93]:
# Setting up the aa matrix.
aa = np.array([[2.,4.,6.],[1.,3.,5.],[10.,20.,30.]])
aa
Out[93]:
array([[  2.,   4.,   6.],
       [  1.,   3.,   5.],
       [ 10.,  20.,  30.]])
In [94]:
# Setting up the bb matrix.
bb = np.array([[0.,1.,2.],[3.,4.,5.],[6.,7.,8.]])
bb
Out[94]:
array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.],
       [ 6.,  7.,  8.]])

Matrix addition:

In [95]:
print('Matrix aa: \n', aa);print('Matrix bb: \n', bb)

aa+bb
Matrix aa: 
 [[  2.   4.   6.]
 [  1.   3.   5.]
 [ 10.  20.  30.]]
Matrix bb: 
 [[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]
Out[95]:
array([[  2.,   5.,   8.],
       [  4.,   7.,  10.],
       [ 16.,  27.,  38.]])

Subtraction:

In [96]:
print('Matrix aa: \n', aa);print('Matrix bb: \n', bb)

aa-bb
Matrix aa: 
 [[  2.   4.   6.]
 [  1.   3.   5.]
 [ 10.  20.  30.]]
Matrix bb: 
 [[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]
Out[96]:
array([[  2.,   3.,   4.],
       [ -2.,  -1.,   0.],
       [  4.,  13.,  22.]])

Matrix multiplication:

In [97]:
print('Matrix aa: \n', aa);print('Matrix bb: \n', bb)

aa*bb
Matrix aa: 
 [[  2.   4.   6.]
 [  1.   3.   5.]
 [ 10.  20.  30.]]
Matrix bb: 
 [[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]
Out[97]:
array([[   0.,    4.,   12.],
       [   3.,   12.,   25.],
       [  60.,  140.,  240.]])

Matrix division:

In [46]:
print('Matrix aa: \n', aa);print('Matrix bb: \n', bb)

aa /bb
Matrix aa: 
 [[  2.   4.   6.]
 [  1.   3.   5.]
 [ 10.  20.  30.]]
Matrix bb: 
 [[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]]
/Users/Steglitz/Envs/DataAnalysis/lib/python3.6/site-packages/ipykernel_launcher.py:3: RuntimeWarning: divide by zero encountered in true_divide
  This is separate from the ipykernel package so we can avoid doing imports until
Out[46]:
array([[  inf,  4.  ,  3.  ],
       [ 0.33,  0.75,  1.  ],
       [ 1.67,  2.86,  3.75]])

You can see that even with a divide by zero error, Pandas is able to complete the calculation and give you a warning.

This concludes part two of our Pandas series. The datasets can be found at PythonTrier Github. Give it a try and see what you can come up with.