7/18/2018 - 5:21 AM

Pandas Basics


Series -----------------------------------------------------------

Create a pandas series from a dicts, an ndarray, or a scalar value

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. Series can be created from a dicts, an ndarray, or a scalar value. (tmp)

s = pd.Series(np.array([1, 2, 3]), index=['a', 'b', 'c']) # optional index must be the same length as data if provided
s = pd.Series({'a' : 1, 'b' : 2, 'c' : 3})
s = pd.Series(5., index=['a', 'b', 'c']) # scalar values will be repeated

Series behaves like an ndarray and a dict.

s[0]                # index
s[:3]               # slice
s[s > s.median()]   # index conditionally
s[[4, 3, 1]]        # index with array indices
s['a']              # index by key
s['e'] = 12         # insert key/value
'e' in s            # inclusion
s.get('f', np.nan)  # index by key, return default if not found
s * 2               # vectorized multiplication
s + s               # vectorized addition

Note, slicing also slices the index.

Unlike ndarray, Series operations align data by label

Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.

>>> a = np.array([1,2,3])
>>> s = pd.Series(a)

>>> a[1:] + a[:-1]
array([3, 5])

>>> s[1:] + s[:-1]
0    NaN
1    4.0
2    NaN

Dataframes --------------------------------------------------------

2-dimensional labeled data structure with columns of potentially different types. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If axis labels are not passed, they will be constructed from the input data based on common sense rules. (tmp)


Dataframe from dict of nd arrays/lists

Given feature vectors of same length (represented as ndarrays or list), create dictionary where each key/value pair corresponds to feature label and feature vector (ndarray or list).

d = {'one' : [1., 2., 3., 4.],
     'two' : [4., 3., 2., 1.]}

df = pd.DataFrame(d)

   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

Dataframe from dict of Series/dicts

Given feature vectors of unequal length, cast each as a series and assign to an entry in dictionary, where key/value corresponds to feature label/vector, and indeces are optionally passed as a list to "index" keyword arg. If no index is passed, the result will be range(n), where n is the array length. Can then create a dataframe.

d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

Python numpy


Array Creation

  1. Use Numpy for fixed-length homogeneous multidimensional arrays
  2. NumPy supports way more numerical types than Python does
  3. Specify type, with optional dype argument, when creating arrays
  4. Prefer astype() attribute over static type casting functions.
  5. Create n-dimensional numpy array with np.array()
  6. Initialize array with zeros(), ones(), and empty()
  7. Create regular sequences with arange() and linspace()
  8. Create logorithmic sequences with logspace()
  9. Generate indices of a grid with indices()
  10. Partition a 1d vector into nd array with reshape()
  11. Inspect properties of a numpy array with shape, ndim, itemsize, and size attributes.

Use Numpy for fixed-length homogeneous multidimensional arrays

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes. The number of axes is rank. (quickstart tutorial)


(If you try to create heterogeneous numpy array, it will convert everything to numbers if possible, strings if not.)

np.array([True, 1, 2]) + np.array([3, 4, False])    # array([4,5,2])
np.array(['Cat', 1, 2])                             # array('Cat', '1', '2')
np.array(['Cat', 1, 2]) + np.array([3, 4, False])   # TypeError

Note, cannot do element-wise addition of np array of strings.

NumPy supports way more numerical types than Python does

NumPy numerical types. are instances of dtype (data-type) objects, available as np.bool_, np.float32, etc.

Specify type, with optional dype argument, when creating arrays

Datatype is typically specified when creating arrays, but data-types can be used themselves as casting functions to convert python lists/numbers to np arrays/scalars. If not specified, default data types of np arrays are infered as int32 or float64.

a = np.array( [1,2,3] )
b = np.array( [1,2,3], dtype=np.float32 )
c = np.array( [1,2,3], dtype='f' )
d = np.float32(1.0)
e = np.int_([1,2,4])

Prefer astype() attribute over static type casting functions.

Can also cast existing np array with datatype function, but preferable to use astype().

y = np.int_([1,2,4])
z = y.astype(float)

Array Creation

Array Creation

Create n-dimensional numpy array with np.array()

Can create numpy array by converting python lists, or list of list, etc..., into n-dimensional array. array() converts lists, tuples or any object that supports array-protocol.

a = np.array( [1,2,3] )
b = np.array( [[1,2,3], [4,5,6]] )
c = np.array( [ [[1,2,3], [4,5,6]], [[7,8,9], [0,1,2]] ] )

Initialize array with zeros(), ones(), and empty()

Array are fixed-length, so initialize them instead of growing them as you would with a list. zeros() and ones() creates an array of zeros/ones, and empty() creates an array of random content.

np.zeros( (3,4) )
np.zeros( (3,4), dtype=np.int16)
np.ones( (2,3,4) )
np.empty( (2,3) )

Create regular sequences with arange() and linspace()

Numpy's arange() functions just like python range(), but returns a numpy array and also accepts optional dtype argument.

np.arange(10)                       # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(2, 10, dtype=np.float)    # array([ 2., 3., 4., 5., 6., 7., 8., 9.])
np.arange(2, 2.5, 0.1)              # array([ 2. , 2.1, 2.2, 2.3, 2.4])

linspace() takes first and last(inclusive) elements and the total number of elements and generates calculates the spacing for you.

np.linspace(1., 4., 6)          # array([ 1. ,  1.6,  2.2,  2.8,  3.4,  4. ])

Create logorithmic sequences with logspace()

logspace() works like linspace() but first and last elements are the base 10 log of first/last values (optional argument to change base)

np.logspace(1, 2, 10, base=10.0)
np.logspace(np.log10(10), np.log10(100), 10, base=10.0)

Generate indices of a grid with indices()

np.indices() takes the shape of a n-dimensional grid and generates the indices!

grid = np.indices((3,4))

# equivalent
rows, cols = [], []
for r in range(3):
    for c in range(4):
rows = np.array(rows).reshape(3,4)
cols = np.array(cols).reshape(3,4)
grid = np.stack( (rows, cols), axis=0 )

Partition a 1d vector into nd array with reshape()

Create/initialize nd array with a numeric sequence (or any pattern) by first creating vector representation of unfolded nd array, then simply reshape it.

a = np.arange(15).reshape(3, 5)

Inspect properties of a numpy array with shape, ndim, itemsize, and size attributes.

type(a)         # <type 'numpy.ndarray'>    # 'int64'
a.shape         # (3, 5)
a.ndim          # 2 (rank, i.e. number of dimensions)
a.itemsize      # 8 (size in bytes of each element, i.e. 64/8 = 8)
a.size          # 15 (total number of elements)

Print arrays

Basic Operations

Universal Functions


Indexing, Slicing and Iterating

Indexing with Arrays of Indices

Indexing with Boolean Arrays

Index np array like a list or better yet with comma indexing notation

np arrays indexed like python lists, BUT multiple elements on np array can be indexed with a list/tuple/array of array indices to be indexed.

>>> b = np.array([1,2,3,4,5])
>>> b[[1,3]]
a = np.array([[1,2,3,4], 

a[0][2]     # 3
a[0,2]      # 3
a[:,1:3]    # array([[2,3],[7,8]])
a[1,:]      # array([6,7,8,9])

np arrays can also be indexed by a boolean array (also called a logic array), where elements corresponding to True values in boolean array get indexed. Boolean arrays can also be constructed by applying a comparator to a np array.

>>> b[[False, True, True, False, False]]
>>> b[b > 3]

Shape Manipulation

Changing the shape of an array

Stacking together different arrays

Splitting one array into several smaller ones

Copies and Views

This shold be moved to new file. its basically a cmd reference

np.mean(a) np.std(a) np.corrcoef(a,b)

Functions and Methods Overview

Broadcasting-Quick Start


The ix_() function

I/O with NumPy


Structured Arrays

Subclassing Arrays


np.argsort() returns indeces of array elements in array-sorted order recall you can index a numpy array with an array of indeces, so can use to get the k smallest elements in array in sorted order. (eg., to implement knn)

a = [4,7,3,5,6,8,1,0,2]
i = np.argsort(a)           # [7, 6, 8, ...]
a[i[0:3]]                   # [0, 1, 2]

gotchas overview

'+' operator concatenates python lists, but performs element-wise addition on np arrays

When you slice a python list you get a copy!!! When you slice a np array you get VIEW!!! When you index a np array you get a copy!!!! So indexing a np array is actually more similar to slicing a list!!!!

np.any(), np.all() and np.where() to apply condition to all elements in array. any/all return True/False if any/all elements meet condition. np.where() returns a list(tuple) of all indices that meet the condition. Compare that with indexing array with a condition (returns a boolean array)

data munging

  1. Create dictionary from a lists of keys & values with dict(zip(keys,values)) (unzip with keys() & values() dictionary attributes)
  2. Get list of tuples from dictionary with d.items() (useful for iterating over dictionary, ie for for k,v in d.items(): ...)
  3. Create dictionary with auto-generated keys with dict(enumerate(values))
  4. Implement an empty bag for counting (ie. multiset) with d = Collections.defaultdict(int) (eg. word count)
  5. Generate full bag with Collections.Counter(list_of_items)
  6. Count named tuples by type with collections.Counter(tpl.typ for tpl in named_tpls), e.g., collections.Counter( for medal in medals) (see Exploit Python Collections)
  7. Count number of non-zero elements in array, a, with np.count_nonzero(a).
  8. Count number of elements greater than x np.count_nonzero(np.array(a) > x)
  9. Zipping uneven lists truncates the shorter, e.g. list(zip((1,2,3,4,5),('a','b','c')))
  10. rotate a matrix with list(zip(*reversed(m))) or [[row[i] for row in reversed(a)] for i in range(len(m))]

generating synthetic data

Generate 5 x 2 matrix with values from standard normal distribution with mean 0 and standard deviation 1.

from scipy import stats as ss

Data Analysis



Aquiring Data

Download data with urllib.request.urlretrieve(<url>, <new file name>)

import urllib.request
urllib.request.urlretrieve('', 'stations.txt')

# now read/parse local file

open(...) returns iterable object, so peek at data with open(...).readlines()[:5] or simply cast to list list(open(...))[:5]




which is equiv to:

with open('stations.txt','r') as f:
    # or list(f)[:5]

Read in text file data to dataframe with parsing functions, e.g., read_csv, read_table, read_json, etc.

Use read_csv to if data is comma-seperated, use read_table if tab-separated, or specify arbitrary delimitter with sep option.

df = pd.read_csv('examples/ex1.csv')
pd.read_table('examples/ex1.csv', sep='/')

Note, sep accepts regular expressions. e.g., \s+ for variable white space.

Inspect data with info() and head()/tail()/unique(column_name) dataframe methods.

Iff no header line in file, specify column names with names=[...] or accept defaults with header=None option.

pd.read_csv('examples/ex2.csv', header=None)
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])

Note, if header AND names options ommitted, the first line will be used as header, and if options included, any header line in file will be NOT be overridden, it will become the first line of data.

Specify index column with index_col or accept default row ids by default.

colums = ['a','b','c']
pd.read_csv('examples/ex2.csv', names=columns, index_col='c')
pd.read_csv('examples/ex2.csv', names=columns, index_col=['a','b']) # hierarchical index
pd.read_csv('examples/ex2.csv', names=columns) # default index

Note, by passing a list of columns names, i.e., index_col=['a','b'], you can specify a hierarchical index.

Skip n number of rows from begining/end with skiprows/skipfooter options, or pass a list to skiprows to specify line numbers to skip.

Replace specified values with "NA" with na_values (missing values are replace with "NA" by default).

Cull rows with nan values by indexing data with ~np.isnan(column) for each column

for col in data.columns:
    data = data[~np.isnan(data.col)]

Specify encoding with encoding='...'

data = b'word,length\nTr\xc3\xa4umen,7\nGr\xc3\xbc\xc3\x9fe,5'.decode('utf8').encode('latin-1')
df = pd.read_csv(BytesIO(data), encoding='latin-1')

(from Dealing with Unicode Data)

See Encodings and Unicode for full list of python encodings.

Adjust number of rows pandas displays to output with pd.options.display.max_rows setting.

Read in first n number of rows with nrows, e.g. nrows=5.

Read file chunks into an iterable by specifying number of rows per chunk with chunksize

chunks = pd.read_csv('in.csv', chunksize=1000)
for chunk in chunks:
    # ...

Write a dataframe (OR a series) to file with to_csv, specifying a filename and delimitter with sep (or accepting commas by default). Pass in sys.stdout instead filename to write to console.

data.to_csv('out.csv', sep='|')

Note, missing values are represented as empty strings unless you specify with na_rep, e.g., na_rep='NULL'

Omit column / row(index) names with header=False and index=False.

Write a subset of columns in a specified order with columns=['b','a',...]

Get list of lines from file with list(csv.reader(open_file)),

# get list of lines
with open('in.csv') as f:
    lines = list(csv.reader(f))

# wrange/fix/etc. data manually here
# ...

# recreate dataframe
head, vals = lines[0], lines[1:]
data_dict = {h:v for h,v in zip(head,zip(*vals))}
dataframe = pd.DataFrame(data_dict)

Build feature vectors (columns) from rows, i.e., transpose rows to columns, with zip(*vals).

rows = [['1','2','3'],

# transpose rows to cols
cols = zip(*rows) 
names = range(len(cols))

# create dictionary with entries of form: feature:feature_vector
data_dict = {h:v for h,v in zip(names, cols)}

# which is equiv to
data_dict = {h:v for h,v in zip(range(len(vals)),zip(*vals))}

Create a custom CSV format by definging a subclass of csv.Dialect

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

reader = csv.reader(f, dialect=my_dialect)

(McKinney 177)

Write delimited files manually with csv.writer(open_file)

with open('out.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(('a', 'b', 'c'))
    writer.writerow(('1', '2', '3'))
    writer.writerow(('4', '5', '6'))

Convert to/from json object and python dictionary with json.loads() and json.dumps() from standard library.

Convert to/from json object and dataframe with pandas.read_json() and to_json() dataframe/series method.

McKinney, Wes. Python for Data Analysis. 2nd ed., O’Reilly, 2018.

Data Cleaning and Preparation

Data Science Workflow


1_Acquiring Data - making, downloading, api requesting, html scraping

2_Data Analysis - parsing, cleaning, munging

3_Predictive Modeling - machine learning w/ Python

4_Static Visualizing w/ Python - matplotlib, seaborn, bokeh, etc.

5_Getting data into databases - mysql, nosql, graph

6_Creating/Deploying endpoints (API)

7_Interactive/Web Visualization w/ javascript - d3.js, p5.js

8_Analytic Apps/Dapps

Data Science Resources

ton of resources:


---Using Python for ResearchHarvard/edX
---pandas for datasciencelynda------
---Numpy Data Science Essential Traininglynda------
---Python for Data Science Essential Traininglynda------

see also:

Local Bootcamps (12wks/$16,000):


WEEK 1: Introduction to the Data Science Toolkit Exploratory Data Analysis, Bash, Git & GitHub, Python, pandas, matplotlib, Seaborn

WEEK 2: Linear Regression and Machine Learning Intro Web scraping via BeautifulSoup and Selenium, regression with statsmodels and scikit-learn, feature selection overfitting and train/test splits, probability theory.

WEEK 3: Linear Regression and Machine Learning Continued Regularization, hypothesis testing , intro to Bayes Theorem

WEEK 4: Databases and Introduction to Machine Learning Concepts Classification and regression algorithms (Knn, logistic regression, SVM, decision trees, and random forest), SQL concepts, cloud servers

WEEK 5: More supervised learning algorithms & web tools Naive Bayes, stochastic gradient descent and intro to Deep Learning, Full stack in a nutshell: Python Flask, Javascript and D3.js

WEEK 6: Statistical Fundamentals MLE, GLM, Distributions, Databases ( RESTful APIs, NoSQL databases, MongoDB, pymongo) Natural Language Processing techniques

WEEK 7: Unsupervised Machine Learning Various clustering algorithms, including K-means and DBSCAN, dimension reduction techniques (PCA, SVD, LDA, NMF)

WEEK 8: More Deep Learning & Unsupervised Learning Deep Learning via Keras, Recommender Systems

WEEK 9: Big Data Hadoop, Hive & Spark, Final project initiated

WEEK 10-12: Final Project


supplementary (tutorials, etc):



McKinney, Wes. Python for Data Analysis. 2nd ed., O’Reilly, 2018. Print Mitchell, Ryan. Web Scraping with Python. O'Reilly, 2015. Print VanderPlas, Jake. Python Data Science Handbook. O'Reilly, 2017. Print Grus, Joel. Data Science from Scratch. O'Reilly, 2015. Print Skiena, Steven S. The Data Science Design Manual. Springer, 2017. Print Cielen, Davy, Arno D. B. Meysman, and Mohamed Ali. Introducing Data Science: Big Data, Machine Learning, and More, Using Python Tools. Manning, 2016. Print. Bruce, Peter C, and Andrew Bruce. Practical Statistics for Data Scientists: 50essential Concepts. O’Reilly, 2017. Internet resource. Downey, Allen. Think Stats: Exploratory Data Analysis in Python. Version 2.0.35, O’Reilly, 2014. Internet resource.



Getting Started


Wes McKinney - author of "Python for Data Analysis" Jake VanderPlas - author of "Python Data Science Handbook"

data sources

web scraping