It's good at:
Pandas is built on top of NumPy so it's easy to work with arrays and matrices.
Terms Let's begin by defining a couple terms and concepts and then slowly build to something. We'll cover:
NOTE: Statistical methods from ndarrays have been overridden to automatically exclude missing data (currently represented as NaN).
You'll need to import the numpy and pandas libraries as well as the Series and DF objects from the pandas library like so.
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
We'll start by creating a single series object with 8 values by calling mp.arrange and passing an index to the Series() method and assigning that to our series_obj variable. In this case, we're choosing to assign lables to our rows.
# NP.arrange creates a series of 8 values from 0 to 7.
series_obj = Series(np.arange(8), index=['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6', 'row 7', 'row 8'])
print(series_obj)
row 1 0 row 2 1 row 3 2 row 4 3 row 5 4 row 6 5 row 7 6 row 8 7 dtype: int64
You can select a row using a label-index. You can look at the chart in the previous slide to see that row 3 is indeed 2.
series_obj['row 3']
2
You can also select a row using integer indexing. Here we select rows 6 and 8 by passing a list with the values of 5 and 7.
series_obj[[5,7]]
row 6 5 row 8 7 dtype: int64
Now let's build on series objects and a take a look at Data-frames.
Data-frames are a collection of series objects that form a two-dimensional, size-mutable, potentially heterogeneous tabular data structures with labeled axes (rows and columns).
They can be thought of as a dict-like container for Series objects.
Basically, they look and feel like spreadsheets that you might see in something like Excel or Numbers.
Arithmetic operations align on both row and column labels.
They're indexable.
Here we generate a random array with a size of 36 and then reshape it to a 6 by 6 using the reshape method. We again pass in custom index and column names.
DF = DataFrame(np.random.rand(36).reshape((6,6)),
index=['row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6'],
columns=['column 1', 'column 2', 'column 3', 'column 4', 'column 5', 'column 6'])
print(DF)
column 1 column 2 column 3 column 4 column 5 column 6 row 1 0.077140 0.644862 0.309258 0.524254 0.958092 0.883201 row 2 0.295432 0.512376 0.088702 0.641717 0.132421 0.766486 row 3 0.076742 0.331044 0.679852 0.509213 0.655146 0.602120 row 4 0.719055 0.415219 0.396542 0.825139 0.712552 0.097937 row 5 0.842154 0.440821 0.373989 0.913676 0.547778 0.251937 row 6 0.027474 0.206257 0.590885 0.163652 0.836928 0.775203
When you call the .ix[ ] special indexer, and pass in a set of row and column indexes, you're telling Python to select and retrieve only those specific rows and columns. The format is as follows:
# object_name.ix[[row indexes], [column indexes]]
DF.ix[['row 2', 'row 5'], ['column 5', 'column 2']]
column 5 | column 2 | |
---|---|---|
row 2 | 0.132421 | 0.512376 |
row 5 | 0.547778 | 0.440821 |
Data slicing allows you to select and retrieve all records from the starting label-index, to the ending label-index, and every record in between.
# object_name['starting label-index':'ending label-index']
DF.ix['row 3':'row 6']
column 1 | column 2 | column 3 | column 4 | column 5 | column 6 | |
---|---|---|---|---|---|---|
row 3 | 0.076742 | 0.331044 | 0.679852 | 0.509213 | 0.655146 | 0.602120 |
row 4 | 0.719055 | 0.415219 | 0.396542 | 0.825139 | 0.712552 | 0.097937 |
row 5 | 0.842154 | 0.440821 | 0.373989 | 0.913676 | 0.547778 | 0.251937 |
row 6 | 0.027474 | 0.206257 | 0.590885 | 0.163652 | 0.836928 | 0.775203 |
You can use comparison operators (like greater than or less than) to return True / False values for all records, to indicate how each element compares to a scalar value using the following format:
Here we want to check whether the returned value is less than 0.2.
DF < .2
column 1 | column 2 | column 3 | column 4 | column 5 | column 6 | |
---|---|---|---|---|---|---|
row 1 | True | False | False | False | False | False |
row 2 | False | False | True | False | True | False |
row 3 | True | False | False | False | False | False |
row 4 | False | False | False | False | False | True |
row 5 | False | False | False | False | False | False |
row 6 | True | False | False | True | False | False |
You can also use comparison operators and scalar values for indexing, to return only the records that satisfy the comparison expression you write.
# object_name[object_name > scalar value]
series_obj[series_obj > 6]
row 8 7 dtype: int64
is cleaning up messy data. Here are a couple ways you can use Pandas to take care of some problems you may run into with your data.
Using ['label-index', 'label-index', 'label-index'] = scalar value, you can set the value of one or many objects at once to a scalar value by using label-indexes. One could use this to set approximate values or throw-away numbers for specific cases.
series_obj['row 1', 'row 5', 'row 8'] = 8
print(series_obj)
row 1 8 row 2 1 row 3 2 row 4 3 row 5 8 row 6 5 row 7 6 row 8 8 dtype: int64
There are different ways to handle missing values. Here we'll cover different ways to fill them, or filter them out.
Here we're going to set some missing values to simulate missing data in our dataset.
missing = np.nan
series_obj = Series(['row 1', 'row 2', missing, 'row 4','row 5', 'row 6', missing, 'row 8'])
series_obj
0 row 1 1 row 2 2 NaN 3 row 4 4 row 5 5 row 6 6 NaN 7 row 8 dtype: object
You can use the .isnull() method to return a Boolean value that describes whether an element in a Pandas object is a null value.
series_obj.isnull()
0 False 1 False 2 True 3 False 4 False 5 False 6 True 7 False dtype: bool
# Setting the seed so our numbers stay consistent for this demonstration.
np.random.seed(25)
DF = DataFrame(np.random.randn(36).reshape(6,6))
# Setting rows three through five in column zero and rows one through four in
# column five to missing.
DF.ix[3:5, 0] = missing
DF.ix[1:4, 5] = missing
DF
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | NaN |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | NaN |
3 | NaN | -0.419678 | 2.294842 | -2.594487 | 2.822756 | NaN |
4 | NaN | -1.976254 | 0.533340 | -0.290870 | -0.513520 | NaN |
5 | NaN | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
The .fillna method() finds each missing value from within a Pandas object and fills it with the numeric value that you've passed in. Here, we'll set the values to zero.
filled_DF = DF.fillna(0)
filled_DF
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | 0.000000 |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | 0.000000 |
3 | 0.000000 | -0.419678 | 2.294842 | -2.594487 | 2.822756 | 0.000000 |
4 | 0.000000 | -1.976254 | 0.533340 | -0.290870 | -0.513520 | 0.000000 |
5 | 0.000000 | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
You can also pass a dictionary into the .fillna() method. The method will then fill in missing values from each column Series (as designated by the dictionary key) with its own unique value (as specified in the corresponding dictionary value).
Here we'll fill missing values in column zero with 0.1 and in column five we'll use 1.25. This allows you to get more granular instead of treating the entire dataset as one entity.
filled_DF = DF.fillna({0: 0.1, 5: 1.25})
filled_DF
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | 1.250000 |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | 1.250000 |
3 | 0.100000 | -0.419678 | 2.294842 | -2.594487 | 2.822756 | 1.250000 |
4 | 0.100000 | -1.976254 | 0.533340 | -0.290870 | -0.513520 | 1.250000 |
5 | 0.100000 | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
You can also pass in "method='ffill'" as an arguement, and the .fillna() method will fill-forward any missing values with values from the last non-null element in the column series. Note rows 3 to 5 in column 0 and rows 1 to 4 in column 5.
fill_DF = DF.fillna(method='ffill')
fill_DF
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | -0.222326 |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | -0.222326 |
3 | 2.152957 | -0.419678 | 2.294842 | -2.594487 | 2.822756 | -0.222326 |
4 | 2.152957 | -1.976254 | 0.533340 | -0.290870 | -0.513520 | -0.222326 |
5 | 2.152957 | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
# Here's were setting another data-frame object with missing values so we can
# continue with our example.
np.random.seed(25)
DF1 = DataFrame(np.random.randn(36).reshape(6,6))
DF1.ix[3:5, 0] = missing
DF1.ix[1:4, 5] = missing
DF1
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | NaN |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | NaN |
3 | NaN | -0.419678 | 2.294842 | -2.594487 | 2.822756 | NaN |
4 | NaN | -1.976254 | 0.533340 | -0.290870 | -0.513520 | NaN |
5 | NaN | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
You can generate a True|False table which identifies the NaNs by calling the .isnull() method. Then you can add the .sum() method to count of how many missing instances you have by column. Here you can see column zero and five have missing values. I've divided them up to illustrated what's produced.
DF1.isnull()
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | False |
1 | False | False | False | False | False | True |
2 | False | False | False | False | False | True |
3 | True | False | False | False | False | True |
4 | True | False | False | False | False | True |
5 | True | False | False | False | False | False |
DF1.isnull().sum()
0 3 1 0 2 0 3 0 4 0 5 4 dtype: int64
DF_no_NaN = DF1.dropna()
DF_no_NaN
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.02689 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
If you wanted to drop columns that contain any missing values, you'd just pass in the axis=1 argument to select and search the DF by columns, instead of by row.
DF_no_NaN = DF1.dropna(axis=1)
DF_no_NaN
1 | 2 | 3 | 4 | |
---|---|---|---|---|
0 | 1.026890 | -0.839585 | -0.591182 | -0.956888 |
1 | 1.837905 | -2.053231 | 0.868583 | -0.920734 |
2 | -1.334661 | 0.076380 | -1.246089 | 1.202272 |
3 | -0.419678 | 2.294842 | -2.594487 | 2.822756 |
4 | -1.976254 | 0.533340 | -0.290870 | -0.513520 |
5 | -1.839905 | 1.607671 | 0.388292 | 0.399732 |
# Here's were setting another data-frame object with missing values so we can
# continue with our example.
np.random.seed(25)
DF2 = DataFrame(np.random.randn(36).reshape(6,6))
DF2.ix[3:5, 0] = missing
DF2.ix[3, 1] = missing
DF2.ix[3, 2] = missing
DF2.ix[3, 3] = missing
DF2.ix[3, 4] = missing
DF2.ix[1:4, 5] = missing
DF2
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | NaN |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | NaN |
3 | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | -1.976254 | 0.533340 | -0.290870 | -0.513520 | NaN |
5 | NaN | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
To identify and drop only the rows from a DF that contain ALL missing values, simply call the .dropna() method off of the DF object, and pass in the how='all' argument.
DF2.dropna(how='all')
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0.228273 | 1.026890 | -0.839585 | -0.591182 | -0.956888 | -0.222326 |
1 | -0.619915 | 1.837905 | -2.053231 | 0.868583 | -0.920734 | NaN |
2 | 2.152957 | -1.334661 | 0.076380 | -1.246089 | 1.202272 | NaN |
4 | NaN | -1.976254 | 0.533340 | -0.290870 | -0.513520 | NaN |
5 | NaN | -1.839905 | 1.607671 | 0.388292 | 0.399732 | 0.405477 |
# Here we set our demonstration object
DF3 = DataFrame({'column 1': [1, 1, 2, 2, 3, 3, 3],
'column 2': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
'column 3': ['A', 'A', 'B', 'B', 'C', 'C', 'C']})
DF3
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1 | a | A |
1 | 1 | a | A |
2 | 2 | b | B |
3 | 2 | b | B |
4 | 3 | c | C |
5 | 3 | c | C |
6 | 3 | c | C |
The .duplicated() method searches each row in the DF, and returns a True or False value to indicate whether it is a duplicate of another row found earlier in the DF.
DF3.duplicated()
0 False 1 True 2 False 3 True 4 False 5 True 6 True dtype: bool
To drop all duplicate rows, just call the drop_duplicates() method off of the DF.
DF3.drop_duplicates()
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1 | a | A |
2 | 2 | b | B |
4 | 3 | c | C |
# Now we reset our demonstration object to look at column-wise filtering.
DF4 = DataFrame({'column 1': [1, 1, 2, 2, 3, 3, 4],
'column 2': ['a', 'a', 'b', 'b', 'c', 'c', 'c'],
'column 3': ['A', 'A', 'B', 'B', 'C', 'D', 'C']})
DF4
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1 | a | A |
1 | 1 | a | A |
2 | 2 | b | B |
3 | 2 | b | B |
4 | 3 | c | C |
5 | 3 | c | D |
6 | 4 | c | C |
To drop the rows that have duplicates found in a column Series, just call the drop_duplicates() method and pass in the label-index of the column. This method will drop all rows that have duplicates in the column you specify. As you can see from the previous chart, it's not inspecting the other columns as we still have a duplicate in column 2.
DF4.drop_duplicates(['column 3'])
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1 | a | A |
2 | 2 | b | B |
4 | 3 | c | C |
5 | 3 | c | D |
# Setting up the first Data Frame.
DF5_1 = pd.DataFrame(np.arange(36).reshape(6,6))
DF5_1
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 |
1 | 6 | 7 | 8 | 9 | 10 | 11 |
2 | 12 | 13 | 14 | 15 | 16 | 17 |
3 | 18 | 19 | 20 | 21 | 22 | 23 |
4 | 24 | 25 | 26 | 27 | 28 | 29 |
5 | 30 | 31 | 32 | 33 | 34 | 35 |
# Setting up the second Data Frame.
DF5_2 = pd.DataFrame(np.arange(15).reshape(5,3))
DF5_2
0 | 1 | 2 | |
---|---|---|---|
0 | 0 | 1 | 2 |
1 | 3 | 4 | 5 |
2 | 6 | 7 | 8 |
3 | 9 | 10 | 11 |
4 | 12 | 13 | 14 |
The concat() method joins data from seperate sources into one combined data table. If you want to join objects based on their row index values, just call the pd.concat() method on the objects you want joined, and then pass in the axis=1 argument. The axis=1 argument tells Python to concatenate the DFs by adding columns (in other words, joining on the row index values).
pd.concat([DF5_1, DF5_2], axis =1)
0 | 1 | 2 | 3 | 4 | 5 | 0 | 1 | 2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 0.0 | 1.0 | 2.0 |
1 | 6 | 7 | 8 | 9 | 10 | 11 | 3.0 | 4.0 | 5.0 |
2 | 12 | 13 | 14 | 15 | 16 | 17 | 6.0 | 7.0 | 8.0 |
3 | 18 | 19 | 20 | 21 | 22 | 23 | 9.0 | 10.0 | 11.0 |
4 | 24 | 25 | 26 | 27 | 28 | 29 | 12.0 | 13.0 | 14.0 |
5 | 30 | 31 | 32 | 33 | 34 | 35 | NaN | NaN | NaN |
If you simply pass in the left and right Data Fram, Pantas joins them column-wise, keeping the width of the widest Data Frame while adding "NaN" cells in the other DF to keep its shape.
pd.concat([DF5_1, DF5_2])
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3.0 | 4.0 | 5.0 |
1 | 6 | 7 | 8 | 9.0 | 10.0 | 11.0 |
2 | 12 | 13 | 14 | 15.0 | 16.0 | 17.0 |
3 | 18 | 19 | 20 | 21.0 | 22.0 | 23.0 |
4 | 24 | 25 | 26 | 27.0 | 28.0 | 29.0 |
5 | 30 | 31 | 32 | 33.0 | 34.0 | 35.0 |
0 | 0 | 1 | 2 | NaN | NaN | NaN |
1 | 3 | 4 | 5 | NaN | NaN | NaN |
2 | 6 | 7 | 8 | NaN | NaN | NaN |
3 | 9 | 10 | 11 | NaN | NaN | NaN |
4 | 12 | 13 | 14 | NaN | NaN | NaN |
# Setting up a series object to add to another Data Frame.
series_obj = Series(np.arange(6))
series_obj.name = "added_variable"
series_obj
0 0 1 1 2 2 3 3 4 4 5 5 Name: added_variable, dtype: int64
You can use .join() method two join two data sources into one. The .join() method works by joining the two sources on their row index values.
variable_added = DataFrame.join(DF5_1, series_obj)
variable_added
0 | 1 | 2 | 3 | 4 | 5 | added_variable | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 0 |
1 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
2 | 12 | 13 | 14 | 15 | 16 | 17 | 2 |
3 | 18 | 19 | 20 | 21 | 22 | 23 | 3 |
4 | 24 | 25 | 26 | 27 | 28 | 29 | 4 |
5 | 30 | 31 | 32 | 33 | 34 | 35 | 5 |
With the ignore_index parameter you can append a DF to another DF (or itself) while maintaining its original index.
added_datatable = variable_added.append(variable_added, ignore_index=False)
added_datatable
0 | 1 | 2 | 3 | 4 | 5 | added_variable | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 0 |
1 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
2 | 12 | 13 | 14 | 15 | 16 | 17 | 2 |
3 | 18 | 19 | 20 | 21 | 22 | 23 | 3 |
4 | 24 | 25 | 26 | 27 | 28 | 29 | 4 |
5 | 30 | 31 | 32 | 33 | 34 | 35 | 5 |
0 | 0 | 1 | 2 | 3 | 4 | 5 | 0 |
1 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
2 | 12 | 13 | 14 | 15 | 16 | 17 | 2 |
3 | 18 | 19 | 20 | 21 | 22 | 23 | 3 |
4 | 24 | 25 | 26 | 27 | 28 | 29 | 4 |
5 | 30 | 31 | 32 | 33 | 34 | 35 | 5 |
If you set the ignore_index parameter to True, then Pandas reindexes the final product and provides you with a DF with a single index.
added_datatable = variable_added.append(variable_added, ignore_index=True)
added_datatable
0 | 1 | 2 | 3 | 4 | 5 | added_variable | |
---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 0 |
1 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
2 | 12 | 13 | 14 | 15 | 16 | 17 | 2 |
3 | 18 | 19 | 20 | 21 | 22 | 23 | 3 |
4 | 24 | 25 | 26 | 27 | 28 | 29 | 4 |
5 | 30 | 31 | 32 | 33 | 34 | 35 | 5 |
6 | 0 | 1 | 2 | 3 | 4 | 5 | 0 |
7 | 6 | 7 | 8 | 9 | 10 | 11 | 1 |
8 | 12 | 13 | 14 | 15 | 16 | 17 | 2 |
9 | 18 | 19 | 20 | 21 | 22 | 23 | 3 |
10 | 24 | 25 | 26 | 27 | 28 | 29 | 4 |
11 | 30 | 31 | 32 | 33 | 34 | 35 | 5 |
You can easily drop rows from a DF by calling the .drop() method and passing in the index values for the rows you want dropped.
DF5_1.drop([0,2])
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
1 | 6 | 7 | 8 | 9 | 10 | 11 |
3 | 18 | 19 | 20 | 21 | 22 | 23 |
4 | 24 | 25 | 26 | 27 | 28 | 29 |
5 | 30 | 31 | 32 | 33 | 34 | 35 |
If you're looking to drop a column, simply pass in the axis parameter set to 1.
DF5_1.drop([0,2], axis=1)
1 | 3 | 4 | 5 | |
---|---|---|---|---|
0 | 1 | 3 | 4 | 5 |
1 | 7 | 9 | 10 | 11 |
2 | 13 | 15 | 16 | 17 |
3 | 19 | 21 | 22 | 23 |
4 | 25 | 27 | 28 | 29 |
5 | 31 | 33 | 34 | 35 |
To sort rows in a DF, either in ascending or descending order, call the .sort_values() method off of the DF, and pass in the "by" parameter to specify the column index you want to use to sort your Data Frame.
DF_sorted = DF5_1.sort_values(by=[5], ascending=[False])
DF_sorted
Reading in the data
cars = pd.read_csv("/Users/Steglitz/jupyter/mtcars.csv")
Setting the columns that you want to use. NOTE: You can chose the display order by reordering the columns the way you want.
cars.columns = ['car_names','mpg','cyl','disp', 'hp', 'drat', 'wt', 'qsec', 'vs', 'am', 'gear', 'carb']
cars.index = cars.car_names
Returns the first five rows.
cars.head()
car_names | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
car_names | ||||||||||||
Mazda RX4 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
To group a DF by its values in a particular column, call the .groupby() method, and then pass in the column Series you want the DF to be grouped by. Here we want to group the listed cars by their number of cylinders.
cars_groups = cars.groupby(cars['cyl'])
Then you can call the mean() method to calculated the mean values of the cars in each cylinder category.
cars_groups.mean()
mpg | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|
cyl | ||||||||||
4 | 26.663636 | 105.136364 | 82.636364 | 4.070909 | 2.285727 | 19.137273 | 0.909091 | 0.727273 | 4.090909 | 1.545455 |
6 | 19.742857 | 183.314286 | 122.285714 | 3.585714 | 3.117143 | 17.977143 | 0.571429 | 0.428571 | 3.857143 | 3.428571 |
8 | 15.100000 | 353.100000 | 209.214286 | 3.229286 | 3.999214 | 16.772143 | 0.000000 | 0.142857 | 3.285714 | 3.500000 |
You only need to import what you're adding to a notebook. If this was your first import, you'd also have to add:
# import numpy as np
# import pandas as pd
# from pandas import Series, DataFrame
# We're adding the following imports for the next part.
from numpy.random import randn
import matplotlib.pyplot as plt
from matplotlib import rcParams
import seaborn as sb
When you add "%matplotlib inline", it tells matplotlib to print the data visualization within the Python notebook instead of opening it in an external graphical user interface.
%matplotlib inline
# Figsize is represented in inches.
rcParams['figure.figsize']= 5,4
sb.set_style('whitegrid')
# Setting the range and step size for the x-axis.
x = range(1,10, 1)
# Sets the points to be plotted on the y-axis in the order of plotting from left to right.
y = [1,2,3,4,0,4,3,2,1]
# Plots the points. The range and the number of plots must be the same.
plt.plot(x,y)
[<matplotlib.lines.Line2D at 0x10ab20898>]
You can render the data as a bar chart
plt.bar(x,y)
<Container object of 9 artists>
Assigns the cars DF calling the MPG series and assigns it to the variable "mpg".
mpg = cars['mpg']
mpg.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10c60f550>
You can represent this same data in bar form by adding the kind attribute to the plot method.
mpg.plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x10c6f0a20>
You can render your bar chart horizontally by changing the kind attribute from 'bar' to 'barh'.
mpg.plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x10b3605c0>
You can plot several data series at once by calling the axis labels from the DF.
DF6 = cars[['cyl', 'wt', 'mpg']]
DF6.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10bfedc18>
The pie chart represents the data as a percentage of the whole. For example, x=[1,1,1] will be represented the same way as x=[9,9,9] as they share equal . You could also represent a 25%/75% chart as x= [25,75] or as x=[1,3]. This relationship is rendered below to demonstrate that point.
Aditionally you can save your files to your working directory by using the savefig method.
savefig(fname, dpi=None, facecolor='w', edgecolor='w', orientation='portrait', papertype=None, format=None, transparent=False, bbox_inches=None, pad_inches=0.1, frameon=None)
Save needs to come before .show() as the .show() will clear things out and you'll end up with an empty image.
x = [1,3]
plt.pie(x)
plt.savefig('pie_chart1.png', transparent=True, dpi=72)
plt.show()
x = range(1,10)
y = [1,2,3,4,0,4,3,2,1]
# Blank Figure Object
fig = plt.figure()
# Figure axes. [left side, bottom, width, height]
# Blank figure with axes added
ax = fig.add_axes([.1, .1, 1, 1])
# Pass in the variables you want to plot.
ax.plot(x,y)
[<matplotlib.lines.Line2D at 0x10af0a9b0>]
# Setting the axes limits and tic marks. Each time you create a plot you need
# to start with a blank figure and add axes again.
fig = plt.figure()
ax = fig.add_axes([.1, .1, 1, 1])
# Sets x and y axis limits
ax.set_xlim([1,9])
ax.set_ylim([0,5])
# Sets x and y axis tick marks
# You'll notice that 3 and 7 are removed from the chart below.
ax.set_xticks([0,1,2,4,5,6,8,9,10])
ax.set_yticks([0,1,2,3,4,5])
ax.plot(x,y)
[<matplotlib.lines.Line2D at 0x10b52dc18>]
# Creates blank figure object
fig = plt.figure()
# The figure will have two axes at once, ax1 and ax2
# Subplot 1 row with two columns
fig, (ax1, ax2) = plt.subplots(1,2)
# Plots x in axis 1
ax1.plot(x)
# Plots x and y in axis 2
ax2.plot(x,y)
[<matplotlib.lines.Line2D at 0x10b452a20>]
<matplotlib.figure.Figure at 0x10b2cc668>
You can set colors by using the name or passing in the hex code.
sb.set_style('whitegrid')
x = range(1, 10)
y = [1,2,3,4,0.5,4,3,2,1]
plt.bar(x, y)
<Container object of 9 artists>
You can adjust the width of individual columns
wide = [0.5, 0.5, 0.5, 0.9, 0.9, 0.9, 0.5, 0.5, 0.5]
and set the color
color = ['salmon']
by passing in those variables as arguments to your bar method.
plt.bar(x, y, width=wide, color=color, align='center')
<Container object of 9 artists>
You can plot multiple lines of data at once by passing in a list of which columns you would like to plot.
DF6 = cars[['cyl', 'mpg','wt']]
DF6.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x10d4377f0>
Again, you can select the desired colors of you chart.
color_theme = ['darkgray', 'lightsalmon', 'powderblue']
DF6.plot(color=color_theme)
<matplotlib.axes._subplots.AxesSubplot at 0x10cf0cda0>
# Resetting the pie graph.
z = [1,2,3,4,0.5]
plt.pie(z)
plt.show()
You can choose hex values or named colors. See the documentation for the list of named colors.
color_theme = ['#A9A9A9', '#FFA07A', '#B0E0E6', '#FFE4C4', '#BDB76B']
plt.pie(z, colors = color_theme)
plt.show()
# Line Styles - Default style
x1 = range(0,10)
y1 = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1]
plt.plot(x, y)
plt.plot(x1,y1)
[<matplotlib.lines.Line2D at 0x10bbc39b0>]
# Setting the line width and graph style.
plt.plot(x, y, ls = 'steps', lw=5)
# Dashed line style with a width of 10.
plt.plot(x1,y1, ls='--', lw=10)
[<matplotlib.lines.Line2D at 0x10bbca128>]
# Marker styles
# You can choose from a few different marker styles.
plt.plot(x, y, marker = '1', mew=20)
plt.plot(x1,y1, marker = '+', mew=15)
[<matplotlib.lines.Line2D at 0x10b9089e8>]
rcParams['figure.figsize'] = 8,4
sb.set_style('whitegrid')
x = range(1,10)
y = [1,2,3,4,0.5,4,3,2,1]
plt.bar(x,y)
# You can set custom x- and y-axis labels.
plt.xlabel('your x-axis label goes here')
plt.ylabel('your y-axis label goes here')
<matplotlib.text.Text at 0x10bf2d128>
# Choose the values to represent in your pie chart.
z = [1 , 2, 3, 4, 0.5]
# Assign lables to that data by passing them as a list in the same order.
veh_type = ['bicycle', 'motorbike','car', 'van', 'stroller']
# Plot values and labels as a pie chart.
plt.pie(z, labels= veh_type)
plt.show()
# Adding a legend
# You can also represent the labels with a legend and let Pandas choose the "best"
# display location.
plt.pie(z)
plt.legend(veh_type, loc='best')
plt.show()
mpg = cars.mpg
fig = plt.figure()
ax = fig.add_axes([.1, .1, 1, 1])
mpg.plot()
# Sets x-axis ticks
ax.set_xticks(range(32))
# Sets the labels, label rotation, and font size for x-axis labels.
ax.set_xticklabels(cars.car_names, rotation=60, fontsize='medium')
# Title
ax.set_title('Miles per Gallon of Cars in mtcars')
# Axes Labels
ax.set_xlabel('car names')
ax.set_ylabel('miles/gal')
<matplotlib.text.Text at 0x10baaed68>
fig = plt.figure()
ax = fig.add_axes([.1,.1,1,1])
mpg.plot()
ax.set_xticks(range(32))
ax.set_xticklabels(cars.car_names, rotation=60, fontsize='medium')
ax.set_title('Miles per Gallon of Cars in mtcars')
ax.set_xlabel('car names')
ax.set_ylabel('miles/gal')
# Adding a legend.
ax.legend(loc='best')
<matplotlib.legend.Legend at 0x10c257128>
fig = plt.figure()
ax = fig.add_axes([.1,.1,1,1])
mpg.plot()
ax.set_title('Miles per Gallon of Cars in mtcars')
ax.set_ylabel('miles/gal')
ax.set_ylim([0,45])
# Adds an in graph annotation. The value of the xy attribute sets the location
# of the tip of the arrow. The xytext value sets the location of the text. The
# arrow will adjust between the two declared points.
ax.annotate('Toyota Corolla', xy=(19,33.9), xytext = (21,35),
arrowprops=dict(facecolor='black', shrink=0.05))
<matplotlib.text.Annotation at 0x10bbd7588>