Why Pandas?¶

It's good at:

  • Cleaning,
  • Preparing,
  • Analyzing,
  • and Visualizing data

Pandas is built on top of NumPy so it's easy to work with arrays and matrices.

Terms Let's begin by defining a couple terms and concepts and then slowly build to something. We'll cover:

  • Series Objects
  • Selecting and Filtering data
  • DataFrame Objects
  • Data Munging
  • Concatenating and Transforming Data
  • Grouping and Data Aggregation
  • Graphs and Data Visualization

Series Objects¶

  • Series objects are a single, one-dimensional, ndarray with axis labels (including time series). In short, it's a row or a column and it's always indexed.
  • Labels don't need to be unique, but must be a hashable type.
  • The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index.

NOTE: Statistical methods from ndarrays have been overridden to automatically exclude missing data (currently represented as NaN).

Creating a Series Object¶

You'll need to import the numpy and pandas libraries as well as the Series and DF objects from the pandas library like so.

In [26]:
import numpy as np
import pandas as pd

from pandas import Series, DataFrame

We'll start by creating a single series object with 8 values by calling mp.arrange and passing an index to the Series() method and assigning that to our series_obj variable. In this case, we're choosing to assign lables to our rows.

In [27]:
# NP.arrange creates a series of 8 values from 0 to 7. 

series_obj = Series(np.arange(8), index=['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6', 'row 7', 'row 8'])
print(series_obj)
row 1    0
row 2    1
row 3    2
row 4    3
row 5    4
row 6    5
row 7    6
row 8    7
dtype: int64

Selecting and Filtering Data¶

Label Indexing¶

You can select a row using a label-index. You can look at the chart in the previous slide to see that row 3 is indeed 2.

In [28]:
series_obj['row 3']
Out[28]:
2

Integer Indexing¶

You can also select a row using integer indexing. Here we select rows 6 and 8 by passing a list with the values of 5 and 7.

In [29]:
series_obj[[5,7]]
Out[29]:
row 6    5
row 8    7
dtype: int64

Now let's build on series objects and a take a look at Data-frames.

DataFrame objects¶

Data-frames are a collection of series objects that form a two-dimensional, size-mutable, potentially heterogeneous tabular data structures with labeled axes (rows and columns).

They can be thought of as a dict-like container for Series objects.

Basically, they look and feel like spreadsheets that you might see in something like Excel or Numbers.

Arithmetic operations align on both row and column labels.

They're indexable.

Here we generate a random array with a size of 36 and then reshape it to a 6 by 6 using the reshape method. We again pass in custom index and column names.

In [30]:
DF = DataFrame(np.random.rand(36).reshape((6,6)), 
                   index=['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6'],
                   columns=['column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6'])
print(DF)
       column 1  column 2  column 3  column 4  column 5  column 6
row 1  0.077140  0.644862  0.309258  0.524254  0.958092  0.883201
row 2  0.295432  0.512376  0.088702  0.641717  0.132421  0.766486
row 3  0.076742  0.331044  0.679852  0.509213  0.655146  0.602120
row 4  0.719055  0.415219  0.396542  0.825139  0.712552  0.097937
row 5  0.842154  0.440821  0.373989  0.913676  0.547778  0.251937
row 6  0.027474  0.206257  0.590885  0.163652  0.836928  0.775203

Special Indexer¶

When you call the .ix[ ] special indexer, and pass in a set of row and column indexes, you're telling Python to select and retrieve only those specific rows and columns. The format is as follows:

In [31]:
# object_name.ix[[row indexes], [column indexes]]

DF.ix[['row 2', 'row 5'], ['column 5', 'column 2']]
Out[31]:
column 5 column 2
row 2 0.132421 0.512376
row 5 0.547778 0.440821

Data slicing allows you to select and retrieve all records from the starting label-index, to the ending label-index, and every record in between.

In [32]:
# object_name['starting label-index':'ending label-index'] 

DF.ix['row 3':'row 6']
Out[32]:
column 1 column 2 column 3 column 4 column 5 column 6
row 3 0.076742 0.331044 0.679852 0.509213 0.655146 0.602120
row 4 0.719055 0.415219 0.396542 0.825139 0.712552 0.097937
row 5 0.842154 0.440821 0.373989 0.913676 0.547778 0.251937
row 6 0.027474 0.206257 0.590885 0.163652 0.836928 0.775203

Scalar comparisons¶

You can use comparison operators (like greater than or less than) to return True / False values for all records, to indicate how each element compares to a scalar value using the following format:

object_name < scalar value¶

Here we want to check whether the returned value is less than 0.2.

In [33]:
DF < .2
Out[33]:
column 1 column 2 column 3 column 4 column 5 column 6
row 1 True False False False False False
row 2 False False True False True False
row 3 True False False False False False
row 4 False False False False False True
row 5 False False False False False False
row 6 True False False True False False

You can also use comparison operators and scalar values for indexing, to return only the records that satisfy the comparison expression you write.

In [34]:
# object_name[object_name > scalar value] 

series_obj[series_obj > 6]
Out[34]:
row 8    7
dtype: int64

Data Munging¶

is cleaning up messy data. Here are a couple ways you can use Pandas to take care of some problems you may run into with your data.

Setting values¶

Using ['label-index', 'label-index', 'label-index'] = scalar value, you can set the value of one or many objects at once to a scalar value by using label-indexes. One could use this to set approximate values or throw-away numbers for specific cases.

In [35]:
series_obj['row 1', 'row 5', 'row 8'] = 8

print(series_obj)
row 1    8
row 2    1
row 3    2
row 4    3
row 5    8
row 6    5
row 7    6
row 8    8
dtype: int64

How to Treat Missing Values¶

There are different ways to handle missing values. Here we'll cover different ways to fill them, or filter them out.

Here we're going to set some missing values to simulate missing data in our dataset.

In [36]:
missing = np.nan

series_obj = Series(['row 1', 'row 2', missing, 'row 4','row 5', 'row 6', missing, 'row 8'])
series_obj
Out[36]:
0    row 1
1    row 2
2      NaN
3    row 4
4    row 5
5    row 6
6      NaN
7    row 8
dtype: object

object_name.isnull()¶

You can use the .isnull() method to return a Boolean value that describes whether an element in a Pandas object is a null value.

In [37]:
series_obj.isnull()
Out[37]:
0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
dtype: bool
In [38]:
# Setting the seed so our numbers stay consistent for this demonstration.
np.random.seed(25)
DF = DataFrame(np.random.randn(36).reshape(6,6))

# Setting rows three through five in column zero and rows one through four in 
# column five to missing.
DF.ix[3:5, 0] = missing
DF.ix[1:4, 5] = missing

DF
Out[38]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 NaN
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 NaN
3 NaN -0.419678 2.294842 -2.594487 2.822756 NaN
4 NaN -1.976254 0.533340 -0.290870 -0.513520 NaN
5 NaN -1.839905 1.607671 0.388292 0.399732 0.405477

object_name.fillna(numeric value)¶

The .fillna method() finds each missing value from within a Pandas object and fills it with the numeric value that you've passed in. Here, we'll set the values to zero.

In [39]:
filled_DF = DF.fillna(0)

filled_DF
Out[39]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 0.000000
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 0.000000
3 0.000000 -0.419678 2.294842 -2.594487 2.822756 0.000000
4 0.000000 -1.976254 0.533340 -0.290870 -0.513520 0.000000
5 0.000000 -1.839905 1.607671 0.388292 0.399732 0.405477

object_name.fillna(dict)¶

You can also pass a dictionary into the .fillna() method. The method will then fill in missing values from each column Series (as designated by the dictionary key) with its own unique value (as specified in the corresponding dictionary value).

Here we'll fill missing values in column zero with 0.1 and in column five we'll use 1.25. This allows you to get more granular instead of treating the entire dataset as one entity.

In [40]:
filled_DF = DF.fillna({0: 0.1, 5: 1.25})

filled_DF
Out[40]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 1.250000
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 1.250000
3 0.100000 -0.419678 2.294842 -2.594487 2.822756 1.250000
4 0.100000 -1.976254 0.533340 -0.290870 -0.513520 1.250000
5 0.100000 -1.839905 1.607671 0.388292 0.399732 0.405477

You can also pass in "method='ffill'" as an arguement, and the .fillna() method will fill-forward any missing values with values from the last non-null element in the column series. Note rows 3 to 5 in column 0 and rows 1 to 4 in column 5.

In [41]:
fill_DF = DF.fillna(method='ffill')

fill_DF
Out[41]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 -0.222326
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 -0.222326
3 2.152957 -0.419678 2.294842 -2.594487 2.822756 -0.222326
4 2.152957 -1.976254 0.533340 -0.290870 -0.513520 -0.222326
5 2.152957 -1.839905 1.607671 0.388292 0.399732 0.405477
In [42]:
# Here's were setting another data-frame object with missing values so we can 
# continue with our example.
np.random.seed(25)

DF1 = DataFrame(np.random.randn(36).reshape(6,6))
DF1.ix[3:5, 0] = missing
DF1.ix[1:4, 5] = missing

DF1
Out[42]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 NaN
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 NaN
3 NaN -0.419678 2.294842 -2.594487 2.822756 NaN
4 NaN -1.976254 0.533340 -0.290870 -0.513520 NaN
5 NaN -1.839905 1.607671 0.388292 0.399732 0.405477

object_name.isnull().sum()¶

You can generate a True|False table which identifies the NaNs by calling the .isnull() method. Then you can add the .sum() method to count of how many missing instances you have by column. Here you can see column zero and five have missing values. I've divided them up to illustrated what's produced.

In [43]:
DF1.isnull()
Out[43]:
0 1 2 3 4 5
0 False False False False False False
1 False False False False False True
2 False False False False False True
3 True False False False False True
4 True False False False False True
5 True False False False False False
In [44]:
DF1.isnull().sum()
Out[44]:
0    3
1    0
2    0
3    0
4    0
5    4
dtype: int64

Filtering data¶

Filtering out missing values¶

object_name.dropna()¶

To identify and drop all rows from a DF that contain ANY missing values, simply call the .dropna() method off of the DF object.

In [45]:
DF_no_NaN = DF1.dropna()

DF_no_NaN
Out[45]:
0 1 2 3 4 5
0 0.228273 1.02689 -0.839585 -0.591182 -0.956888 -0.222326

Filtering out missing values¶

If you wanted to drop columns that contain any missing values, you'd just pass in the axis=1 argument to select and search the DF by columns, instead of by row.

In [46]:
DF_no_NaN = DF1.dropna(axis=1)

DF_no_NaN
Out[46]:
1 2 3 4
0 1.026890 -0.839585 -0.591182 -0.956888
1 1.837905 -2.053231 0.868583 -0.920734
2 -1.334661 0.076380 -1.246089 1.202272
3 -0.419678 2.294842 -2.594487 2.822756
4 -1.976254 0.533340 -0.290870 -0.513520
5 -1.839905 1.607671 0.388292 0.399732
In [47]:
# Here's were setting another data-frame object with missing values so we can 
# continue with our example.

np.random.seed(25)
DF2 = DataFrame(np.random.randn(36).reshape(6,6))
DF2.ix[3:5, 0] = missing
DF2.ix[3, 1] = missing
DF2.ix[3, 2] = missing
DF2.ix[3, 3] = missing
DF2.ix[3, 4] = missing
DF2.ix[1:4, 5] = missing

DF2
Out[47]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 NaN
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 NaN
3 NaN NaN NaN NaN NaN NaN
4 NaN -1.976254 0.533340 -0.290870 -0.513520 NaN
5 NaN -1.839905 1.607671 0.388292 0.399732 0.405477

object_name.dropna(how='all')¶

To identify and drop only the rows from a DF that contain ALL missing values, simply call the .dropna() method off of the DF object, and pass in the how='all' argument.

In [48]:
DF2.dropna(how='all')
Out[48]:
0 1 2 3 4 5
0 0.228273 1.026890 -0.839585 -0.591182 -0.956888 -0.222326
1 -0.619915 1.837905 -2.053231 0.868583 -0.920734 NaN
2 2.152957 -1.334661 0.076380 -1.246089 1.202272 NaN
4 NaN -1.976254 0.533340 -0.290870 -0.513520 NaN
5 NaN -1.839905 1.607671 0.388292 0.399732 0.405477

Removing duplicates¶

In [49]:
# Here we set our demonstration object

DF3 = DataFrame({'column 1': [1, 1, 2, 2, 3, 3, 3],
                  'column 2': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
                  'column 3': ['A', 'A', 'B', 'B', 'C', 'C', 'C']})
DF3
Out[49]:
column 1 column 2 column 3
0 1 a A
1 1 a A
2 2 b B
3 2 b B
4 3 c C
5 3 c C
6 3 c C

object_name.duplicated()¶

The .duplicated() method searches each row in the DF, and returns a True or False value to indicate whether it is a duplicate of another row found earlier in the DF.

In [50]:
DF3.duplicated()
Out[50]:
0    False
1     True
2    False
3     True
4    False
5     True
6     True
dtype: bool

object_name.drop_duplicates()¶

To drop all duplicate rows, just call the drop_duplicates() method off of the DF.

In [51]:
DF3.drop_duplicates()
Out[51]:
column 1 column 2 column 3
0 1 a A
2 2 b B
4 3 c C
In [52]:
# Now we reset our demonstration object to look at column-wise filtering.
DF4 = DataFrame({'column 1': [1, 1, 2, 2, 3, 3, 4],
                  'column 2': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
                  'column 3': ['A', 'A', 'B', 'B', 'C', 'D', 'C']})
DF4
Out[52]:
column 1 column 2 column 3
0 1 a A
1 1 a A
2 2 b B
3 2 b B
4 3 c C
5 3 c D
6 4 c C

object_name.drop_duplicates(['column_name'])¶

To drop the rows that have duplicates found in a column Series, just call the drop_duplicates() method and pass in the label-index of the column. This method will drop all rows that have duplicates in the column you specify. As you can see from the previous chart, it's not inspecting the other columns as we still have a duplicate in column 2.

In [53]:
DF4.drop_duplicates(['column 3'])
Out[53]:
column 1 column 2 column 3
0 1 a A
2 2 b B
4 3 c C
5 3 c D

Concatenating and Transforming Data¶

In [54]:
# Setting up the first Data Frame.

DF5_1 = pd.DataFrame(np.arange(36).reshape(6,6))

DF5_1
Out[54]:
0 1 2 3 4 5
0 0 1 2 3 4 5
1 6 7 8 9 10 11
2 12 13 14 15 16 17
3 18 19 20 21 22 23
4 24 25 26 27 28 29
5 30 31 32 33 34 35
In [55]:
# Setting up the second Data Frame.

DF5_2 = pd.DataFrame(np.arange(15).reshape(5,3))

DF5_2
Out[55]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14

Concatenating data¶

pd.concat([left_object, right_object], axis=1)¶

The concat() method joins data from seperate sources into one combined data table. If you want to join objects based on their row index values, just call the pd.concat() method on the objects you want joined, and then pass in the axis=1 argument. The axis=1 argument tells Python to concatenate the DFs by adding columns (in other words, joining on the row index values).

In [56]:
pd.concat([DF5_1, DF5_2], axis =1)
Out[56]:
0 1 2 3 4 5 0 1 2
0 0 1 2 3 4 5 0.0 1.0 2.0
1 6 7 8 9 10 11 3.0 4.0 5.0
2 12 13 14 15 16 17 6.0 7.0 8.0
3 18 19 20 21 22 23 9.0 10.0 11.0
4 24 25 26 27 28 29 12.0 13.0 14.0
5 30 31 32 33 34 35 NaN NaN NaN

pd.concat([left_object, right_object], axis=1)¶

If you simply pass in the left and right Data Fram, Pantas joins them column-wise, keeping the width of the widest Data Frame while adding "NaN" cells in the other DF to keep its shape.

In [57]:
pd.concat([DF5_1, DF5_2])
Out[57]:
0 1 2 3 4 5
0 0 1 2 3.0 4.0 5.0
1 6 7 8 9.0 10.0 11.0
2 12 13 14 15.0 16.0 17.0
3 18 19 20 21.0 22.0 23.0
4 24 25 26 27.0 28.0 29.0
5 30 31 32 33.0 34.0 35.0
0 0 1 2 NaN NaN NaN
1 3 4 5 NaN NaN NaN
2 6 7 8 NaN NaN NaN
3 9 10 11 NaN NaN NaN
4 12 13 14 NaN NaN NaN

Adding Data¶

In [58]:
# Setting up a series object to add to another Data Frame.

series_obj = Series(np.arange(6))
series_obj.name = "added_variable"

series_obj
Out[58]:
0    0
1    1
2    2
3    3
4    4
5    5
Name: added_variable, dtype: int64

DataFrame.join(left_object, right_object)¶

You can use .join() method two join two data sources into one. The .join() method works by joining the two sources on their row index values.

In [59]:
variable_added = DataFrame.join(DF5_1, series_obj)

variable_added
Out[59]:
0 1 2 3 4 5 added_variable
0 0 1 2 3 4 5 0
1 6 7 8 9 10 11 1
2 12 13 14 15 16 17 2
3 18 19 20 21 22 23 3
4 24 25 26 27 28 29 4
5 30 31 32 33 34 35 5

With the ignore_index parameter you can append a DF to another DF (or itself) while maintaining its original index.

In [60]:
added_datatable = variable_added.append(variable_added, ignore_index=False)

added_datatable
Out[60]:
0 1 2 3 4 5 added_variable
0 0 1 2 3 4 5 0
1 6 7 8 9 10 11 1
2 12 13 14 15 16 17 2
3 18 19 20 21 22 23 3
4 24 25 26 27 28 29 4
5 30 31 32 33 34 35 5
0 0 1 2 3 4 5 0
1 6 7 8 9 10 11 1
2 12 13 14 15 16 17 2
3 18 19 20 21 22 23 3
4 24 25 26 27 28 29 4
5 30 31 32 33 34 35 5

If you set the ignore_index parameter to True, then Pandas reindexes the final product and provides you with a DF with a single index.

In [94]:
added_datatable = variable_added.append(variable_added, ignore_index=True)

added_datatable
Out[94]:
0 1 2 3 4 5 added_variable
0 0 1 2 3 4 5 0
1 6 7 8 9 10 11 1
2 12 13 14 15 16 17 2
3 18 19 20 21 22 23 3
4 24 25 26 27 28 29 4
5 30 31 32 33 34 35 5
6 0 1 2 3 4 5 0
7 6 7 8 9 10 11 1
8 12 13 14 15 16 17 2
9 18 19 20 21 22 23 3
10 24 25 26 27 28 29 4
11 30 31 32 33 34 35 5

Transforming data¶

Dropping data¶

object_name.drop([row indexes])¶

You can easily drop rows from a DF by calling the .drop() method and passing in the index values for the rows you want dropped.

In [62]:
DF5_1.drop([0,2])
Out[62]:
0 1 2 3 4 5
1 6 7 8 9 10 11
3 18 19 20 21 22 23
4 24 25 26 27 28 29
5 30 31 32 33 34 35

If you're looking to drop a column, simply pass in the axis parameter set to 1.

In [95]:
DF5_1.drop([0,2], axis=1)
Out[95]:
1 3 4 5
0 1 3 4 5
1 7 9 10 11
2 13 15 16 17
3 19 21 22 23
4 25 27 28 29
5 31 33 34 35

Sorting data¶

object_name.sort_values(by=[index value], ascending=[False])¶

To sort rows in a DF, either in ascending or descending order, call the .sort_values() method off of the DF, and pass in the "by" parameter to specify the column index you want to use to sort your Data Frame.

In [ ]:
DF_sorted = DF5_1.sort_values(by=[5], ascending=[False])

DF_sorted

Grouping and data aggregation¶

Reading in the data

In [103]:
cars = pd.read_csv("/Users/Steglitz/jupyter/mtcars.csv")

Setting the columns that you want to use. NOTE: You can chose the display order by reordering the columns the way you want.

In [104]:
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.index = cars.car_names

Returns the first five rows.

In [105]:
cars.head()
Out[105]:
car_names mpg cyl disp hp drat wt qsec vs am gear carb
car_names
Mazda RX4 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

object_name.groupby('Series_name')¶

To group a DF by its values in a particular column, call the .groupby() method, and then pass in the column Series you want the DF to be grouped by. Here we want to group the listed cars by their number of cylinders.

In [96]:
cars_groups = cars.groupby(cars['cyl'])

Then you can call the mean() method to calculated the mean values of the cars in each cylinder category.

In [97]:
cars_groups.mean()
Out[97]:
mpg disp hp drat wt qsec vs am gear carb
cyl
4 26.663636 105.136364 82.636364 4.070909 2.285727 19.137273 0.909091 0.727273 4.090909 1.545455
6 19.742857 183.314286 122.285714 3.585714 3.117143 17.977143 0.571429 0.428571 3.857143 3.428571
8 15.100000 353.100000 209.214286 3.229286 3.999214 16.772143 0.000000 0.142857 3.285714 3.500000

Graphs and Data Visualization¶

You only need to import what you're adding to a notebook. If this was your first import, you'd also have to add:

In [67]:
# import numpy as np
# import pandas as pd
# from pandas import Series, DataFrame

# We're adding the following imports for the next part. 
from numpy.random import randn
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sb

When you add "%matplotlib inline", it tells matplotlib to print the data visualization within the Python notebook instead of opening it in an external graphical user interface.

In [68]:
%matplotlib inline

# Figsize is represented in inches.
rcParams['figure.figsize']= 5,4
sb.set_style('whitegrid')
In [69]:
# Setting the range and step size for the x-axis. 
x = range(1,10, 1)
# Sets the points to be plotted on the y-axis in the order of plotting from left to right.
y = [1,2,3,4,0,4,3,2,1]
# Plots the points. The range and the number of plots must be the same.
plt.plot(x,y)
Out[69]:
[<matplotlib.lines.Line2D at 0x10ab20898>]

You can render the data as a bar chart

In [98]:
plt.bar(x,y)
Out[98]:
<Container object of 9 artists>

Assigns the cars DF calling the MPG series and assigns it to the variable "mpg".

In [99]:
mpg = cars['mpg']

mpg.plot()
Out[99]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c60f550>

You can represent this same data in bar form by adding the kind attribute to the plot method.

In [100]:
mpg.plot(kind='bar')
Out[100]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c6f0a20>

You can render your bar chart horizontally by changing the kind attribute from 'bar' to 'barh'.

In [101]:
mpg.plot(kind='barh')
Out[101]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b3605c0>

You can plot several data series at once by calling the axis labels from the DF.

In [102]:
DF6 = cars[['cyl', 'wt', 'mpg']]

DF6.plot()
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x10bfedc18>

The pie chart represents the data as a percentage of the whole. For example, x=[1,1,1] will be represented the same way as x=[9,9,9] as they share equal . You could also represent a 25%/75% chart as x= [25,75] or as x=[1,3]. This relationship is rendered below to demonstrate that point.

Aditionally you can save your files to your working directory by using the savefig method.

savefig(fname, dpi=None, facecolor='w', edgecolor='w', orientation='portrait', papertype=None, format=None, transparent=False, bbox_inches=None, pad_inches=0.1, frameon=None)

Save needs to come before .show() as the .show() will clear things out and you'll end up with an empty image.

In [123]:
x = [1,3]
plt.pie(x)

plt.savefig('pie_chart1.png', transparent=True, dpi=72)
plt.show()

Object-Oriented Plotting¶

  • Create a blank figure object.
  • Add axes to the figure.
  • Generate plots within the figure.
  • Specify plotting and layout parameters for the plots
In [76]:
x = range(1,10)
y = [1,2,3,4,0,4,3,2,1]

# Blank Figure Object
fig = plt.figure()
# Figure axes. [left side, bottom, width, height]
# Blank figure with axes added
ax = fig.add_axes([.1, .1, 1, 1])
# Pass in the variables you want to plot. 
ax.plot(x,y)
Out[76]:
[<matplotlib.lines.Line2D at 0x10af0a9b0>]
In [77]:
# Setting the axes limits and tic marks. Each time you create a plot you need 
# to start with a blank figure and add axes again.
fig = plt.figure()
ax = fig.add_axes([.1, .1, 1, 1]) 

# Sets x and y axis limits
ax.set_xlim([1,9])
ax.set_ylim([0,5])

# Sets x and y axis tick marks
# You'll notice that 3 and 7 are removed from the chart below. 
ax.set_xticks([0,1,2,4,5,6,8,9,10])
ax.set_yticks([0,1,2,3,4,5])

ax.plot(x,y)
Out[77]:
[<matplotlib.lines.Line2D at 0x10b52dc18>]
In [78]:
# Creates blank figure object
fig = plt.figure()
# The figure will have two axes at once, ax1 and ax2
# Subplot 1 row with two columns
fig, (ax1, ax2) = plt.subplots(1,2)
# Plots x in axis 1
ax1.plot(x)
# Plots x and y in axis 2
ax2.plot(x,y)
Out[78]:
[<matplotlib.lines.Line2D at 0x10b452a20>]
<matplotlib.figure.Figure at 0x10b2cc668>

Plot Formatting¶

You can set colors by using the name or passing in the hex code.

Line Styles¶

  • Line style argument: ls='...'
  • Line width argument: lw='...'
    • '--' = Dashed Line
    • ':' = Dotted Line
    • '-' = Solid line
    • '.' = Dotted line
In [116]:
sb.set_style('whitegrid')
x = range(1, 10)
y = [1,2,3,4,0.5,4,3,2,1]

plt.bar(x, y)
Out[116]:
<Container object of 9 artists>

You can adjust the width of individual columns

In [117]:
wide = [0.5, 0.5, 0.5, 0.9, 0.9, 0.9, 0.5, 0.5, 0.5]

and set the color

In [118]:
color = ['salmon']

by passing in those variables as arguments to your bar method.

In [119]:
plt.bar(x, y, width=wide, color=color, align='center')
Out[119]:
<Container object of 9 artists>

You can plot multiple lines of data at once by passing in a list of which columns you would like to plot.

In [120]:
DF6 = cars[['cyl', 'mpg','wt']]

DF6.plot()
Out[120]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d4377f0>

Again, you can select the desired colors of you chart.

In [121]:
color_theme = ['darkgray', 'lightsalmon', 'powderblue']

DF6.plot(color=color_theme)
Out[121]:
<matplotlib.axes._subplots.AxesSubplot at 0x10cf0cda0>
In [83]:
# Resetting the pie graph.
z = [1,2,3,4,0.5]
plt.pie(z)
plt.show()

You can choose hex values or named colors. See the documentation for the list of named colors.

In [125]:
color_theme = ['#A9A9A9', '#FFA07A', '#B0E0E6', '#FFE4C4', '#BDB76B']
plt.pie(z, colors = color_theme)

plt.show()
In [85]:
# Line Styles - Default style

x1 = range(0,10)
y1 = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]

plt.plot(x, y)
plt.plot(x1,y1)
Out[85]:
[<matplotlib.lines.Line2D at 0x10bbc39b0>]
In [86]:
# Setting the line width and graph style.
plt.plot(x, y, ls = 'steps', lw=5)

# Dashed line style with a width of 10.
plt.plot(x1,y1, ls='--', lw=10)
Out[86]:
[<matplotlib.lines.Line2D at 0x10bbca128>]
In [87]:
# Marker styles

# You can choose from a few different marker styles.
plt.plot(x, y, marker = '1', mew=20)
plt.plot(x1,y1, marker = '+', mew=15)
Out[87]:
[<matplotlib.lines.Line2D at 0x10b9089e8>]
In [88]:
rcParams['figure.figsize'] = 8,4
sb.set_style('whitegrid')
x = range(1,10)
y = [1,2,3,4,0.5,4,3,2,1]
plt.bar(x,y)

# You can set custom x- and y-axis labels.
plt.xlabel('your x-axis label goes here')
plt.ylabel('your y-axis label goes here')
Out[88]:
<matplotlib.text.Text at 0x10bf2d128>
In [89]:
# Choose the values to represent in your pie chart.
z = [1 , 2, 3, 4, 0.5]

# Assign lables to that data by passing them as a list in the same order.
veh_type = ['bicycle', 'motorbike','car', 'van', 'stroller']

# Plot values and labels as a pie chart.
plt.pie(z, labels= veh_type)
plt.show()
In [90]:
# Adding a legend
# You can also represent the labels with a legend and let Pandas choose the "best"
# display location.
plt.pie(z)
plt.legend(veh_type, loc='best')
plt.show()
In [91]:
mpg = cars.mpg

fig = plt.figure()
ax = fig.add_axes([.1, .1, 1, 1])

mpg.plot()

# Sets x-axis ticks
ax.set_xticks(range(32))

# Sets the labels, label rotation, and font size for x-axis labels.
ax.set_xticklabels(cars.car_names, rotation=60, fontsize='medium')

# Title
ax.set_title('Miles per Gallon of Cars in mtcars')

# Axes Labels
ax.set_xlabel('car names')
ax.set_ylabel('miles/gal')
Out[91]:
<matplotlib.text.Text at 0x10baaed68>
In [92]:
fig = plt.figure()
ax = fig.add_axes([.1,.1,1,1])
mpg.plot()

ax.set_xticks(range(32))

ax.set_xticklabels(cars.car_names, rotation=60, fontsize='medium')
ax.set_title('Miles per Gallon of Cars in mtcars')

ax.set_xlabel('car names')
ax.set_ylabel('miles/gal')

# Adding a legend.
ax.legend(loc='best')
Out[92]:
<matplotlib.legend.Legend at 0x10c257128>
In [93]:
fig = plt.figure()
ax = fig.add_axes([.1,.1,1,1])
mpg.plot()
ax.set_title('Miles per Gallon of Cars in mtcars')
ax.set_ylabel('miles/gal')

ax.set_ylim([0,45])

# Adds an in graph annotation. The value of the xy attribute sets the location 
# of the tip of the arrow. The xytext value sets the location of the text. The
# arrow will adjust between the two declared points. 
ax.annotate('Toyota Corolla', xy=(19,33.9), xytext = (21,35),
           arrowprops=dict(facecolor='black', shrink=0.05))
Out[93]:
<matplotlib.text.Annotation at 0x10bbd7588>

This concludes part one of our intro to Pandas.