[TOC]
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. Series can be created from a dicts, an ndarray, or a scalar value. (tmp)
s = pd.Series(np.array([1, 2, 3]), index=['a', 'b', 'c']) # optional index must be the same length as data if provided
s = pd.Series({'a' : 1, 'b' : 2, 'c' : 3})
s = pd.Series(5., index=['a', 'b', 'c']) # scalar values will be repeated
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series
s[0] # index
s[:3] # slice
s[s > s.median()] # index conditionally
s[[4, 3, 1]] # index with array indices
s['a'] # index by key
s['e'] = 12 # insert key/value
'e' in s # inclusion
s.get('f', np.nan) # index by key, return default if not found
s * 2 # vectorized multiplication
s + s # vectorized addition
Note, slicing also slices the index.
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series
Being able to write code without doing any explicit data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data alignment features of the pandas data structures set pandas apart from the majority of related tools for working with labeled data.
>>> a = np.array([1,2,3])
>>> s = pd.Series(a)
>>> a[1:] + a[:-1]
array([3, 5])
>>> s[1:] + s[:-1]
0 NaN
1 4.0
2 NaN
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series
2-dimensional labeled data structure with columns of potentially different types. Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If axis labels are not passed, they will be constructed from the input data based on common sense rules. (tmp)
adf
Given feature vectors of same length (represented as ndarrays or list), create dictionary where each key/value pair corresponds to feature label and feature vector (ndarray or list).
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
df = pd.DataFrame(d)
---
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
Given feature vectors of unequal length, cast each as a series and assign to an entry in dictionary, where key/value corresponds to feature label/vector, and indeces are optionally passed as a list to "index" keyword arg. If no index is passed, the result will be range(n), where n is the array length. Can then create a dataframe.
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
---
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#from-dict-of-series-or-dicts
[TOC]
Array Creation
dype
argument, when creating arraysastype()
attribute over static type casting functions.np.array()
zeros()
, ones()
, and empty()
arange()
and linspace()
logspace()
indices()
reshape()
shape
, ndim
, itemsize
, and size
attributes.moreNumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes. The number of axes is rank. (quickstart tutorial)
(If you try to create heterogeneous numpy array, it will convert everything to numbers if possible, strings if not.)
np.array([True, 1, 2]) + np.array([3, 4, False]) # array([4,5,2])
np.array(['Cat', 1, 2]) # array('Cat', '1', '2')
np.array(['Cat', 1, 2]) + np.array([3, 4, False]) # TypeError
Note, cannot do element-wise addition of np array of strings.
NumPy numerical types. are instances of dtype (data-type) objects, available as np.bool_, np.float32, etc.
dype
argument, when creating arraysDatatype is typically specified when creating arrays, but data-types can be used themselves as casting functions to convert python lists/numbers to np arrays/scalars. If not specified, default data types of np arrays are infered as int32 or float64.
a = np.array( [1,2,3] )
b = np.array( [1,2,3], dtype=np.float32 )
c = np.array( [1,2,3], dtype='f' )
d = np.float32(1.0)
e = np.int_([1,2,4])
astype()
attribute over static type casting functions.Can also cast existing np array with datatype function, but preferable to use astype().
y = np.int_([1,2,4])
z = y.astype(float)
np.array()
Can create numpy array by converting python lists, or list of list, etc..., into n-dimensional array. array()
converts lists, tuples or any object that supports array-protocol.
a = np.array( [1,2,3] )
b = np.array( [[1,2,3], [4,5,6]] )
c = np.array( [ [[1,2,3], [4,5,6]], [[7,8,9], [0,1,2]] ] )
zeros()
, ones()
, and empty()
Array are fixed-length, so initialize them instead of growing them as you would with a list. zeros()
and ones()
creates an array of zeros/ones, and empty()
creates an array of random content.
np.zeros( (3,4) )
np.zeros( (3,4), dtype=np.int16)
np.ones( (2,3,4) )
np.empty( (2,3) )
arange()
and linspace()
Numpy's arange()
functions just like python range()
, but returns a numpy array and also accepts optional dtype argument.
np.arange(10) # array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
np.arange(2, 10, dtype=np.float) # array([ 2., 3., 4., 5., 6., 7., 8., 9.])
np.arange(2, 2.5, 0.1) # array([ 2. , 2.1, 2.2, 2.3, 2.4])
linspace() takes first and last(inclusive) elements and the total number of elements and generates calculates the spacing for you.
np.linspace(1., 4., 6) # array([ 1. , 1.6, 2.2, 2.8, 3.4, 4. ])
logspace()
logspace() works like linspace() but first and last elements are the base 10 log of first/last values (optional argument to change base)
np.logspace(1, 2, 10, base=10.0)
np.logspace(np.log10(10), np.log10(100), 10, base=10.0)
indices()
np.indices()
takes the shape of a n-dimensional grid and generates the indices!
grid = np.indices((3,4))
# equivalent
rows, cols = [], []
for r in range(3):
for c in range(4):
rows.append(r)
cols.append(c)
rows = np.array(rows).reshape(3,4)
cols = np.array(cols).reshape(3,4)
grid = np.stack( (rows, cols), axis=0 )
reshape()
Create/initialize nd array with a numeric sequence (or any pattern) by first creating vector representation of unfolded nd array, then simply reshape it.
a = np.arange(15).reshape(3, 5)
shape
, ndim
, itemsize
, and size
attributes.type(a) # <type 'numpy.ndarray'>
a.dtype.name # 'int64'
a.shape # (3, 5)
a.ndim # 2 (rank, i.e. number of dimensions)
a.itemsize # 8 (size in bytes of each element, i.e. 64/8 = 8)
a.size # 15 (total number of elements)
Print arrays
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html#printing-arrays
Universal Functions
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html#universal-functions
Indexing, Slicing and Iterating
Indexing with Arrays of Indices
np arrays indexed like python lists, BUT multiple elements on np array can be indexed with a list/tuple/array of array indices to be indexed.
>>> b = np.array([1,2,3,4,5])
>>> b[[1,3]]
array([2,4])
a = np.array([[1,2,3,4],
[6,7,8,9]])
a[0][2] # 3
a[0,2] # 3
a[:,1:3] # array([[2,3],[7,8]])
a[1,:] # array([6,7,8,9])
np arrays can also be indexed by a boolean array (also called a logic array), where elements corresponding to True values in boolean array get indexed. Boolean arrays can also be constructed by applying a comparator to a np array.
>>> b[[False, True, True, False, False]]
array([2,3])
>>> b[b > 3]
array([4,5])
Shape Manipulation
Changing the shape of an array
Stacking together different arrays
Splitting one array into several smaller ones
This shold be moved to new file. its basically a cmd reference
np.mean(a) np.media(a) np.std(a) np.corrcoef(a,b)
Functions and Methods Overview
np.argsort() returns indeces of array elements in array-sorted order recall you can index a numpy array with an array of indeces, so can use to get the k smallest elements in array in sorted order. (eg., to implement knn)
a = [4,7,3,5,6,8,1,0,2]
i = np.argsort(a) # [7, 6, 8, ...]
a[i[0:3]] # [0, 1, 2]
'+' operator concatenates python lists, but performs element-wise addition on np arrays
When you slice a python list you get a copy!!! When you slice a np array you get VIEW!!! When you index a np array you get a copy!!!! So indexing a np array is actually more similar to slicing a list!!!!
np.any(), np.all() and np.where() to apply condition to all elements in array. any/all return True/False if any/all elements meet condition. np.where() returns a list(tuple) of all indices that meet the condition. Compare that with indexing array with a condition (returns a boolean array)
dict(zip(keys,values))
(unzip with keys()
& values()
dictionary attributes)d.items()
(useful for iterating over dictionary, ie for for k,v in d.items(): ...
)dict(enumerate(values))
d = Collections.defaultdict(int)
(eg. word count)Collections.Counter(list_of_items)
collections.Counter(tpl.typ for tpl in named_tpls)
, e.g., collections.Counter(medal.team for medal in medals)
(see Exploit Python Collections)np.count_nonzero(a)
.np.count_nonzero(np.array(a) > x)
list(zip(*reversed(m)))
or [[row[i] for row in reversed(a)] for i in range(len(m))]
Generate 5 x 2 matrix with values from standard normal distribution with mean 0 and standard deviation 1.
from scipy import stats as ss
ss.norm(0,1).rvs((5,2))
description
[TOC]
urllib.request.urlretrieve(<url>, <new file name>)
import urllib.request
urllib.request.urlretrieve('ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt', 'stations.txt')
# now read/parse local file
open(...)
returns iterable object, so peek at data with open(...).readlines()[:5]
or simply cast to list list(open(...))[:5]
open('stations.txt','r').readlines()[:10]
or
list(open('stations.txt','r'))[:5]
which is equiv to:
with open('stations.txt','r') as f:
f.readlines()[:5]
# or list(f)[:5]
read_csv
, read_table
, read_json
, etc.read_csv
to if data is comma-seperated, use read_table
if tab-separated, or specify arbitrary delimitter with sep
option.df = pd.read_csv('examples/ex1.csv')
pd.read_table('examples/ex1.csv', sep='/')
Note, sep
accepts regular expressions. e.g., \s+
for variable white space.
info()
and head()/tail()/unique(column_name)
dataframe methods.names=[...]
or accept defaults with header=None
option.pd.read_csv('examples/ex2.csv', header=None)
pd.read_csv('examples/ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
Note, if header
AND names
options ommitted, the first line will be used as header, and if options included, any header line in file will be NOT be overridden, it will become the first line of data.
index_col
or accept default row ids by default.colums = ['a','b','c']
pd.read_csv('examples/ex2.csv', names=columns, index_col='c')
pd.read_csv('examples/ex2.csv', names=columns, index_col=['a','b']) # hierarchical index
pd.read_csv('examples/ex2.csv', names=columns) # default index
Note, by passing a list of columns names, i.e., index_col=['a','b']
, you can specify a hierarchical index.
skiprows
/skipfooter
options, or pass a list to skiprows
to specify line numbers to skip.na_values
(missing values are replace with "NA" by default).for col in data.columns:
data = data[~np.isnan(data.col)]
encoding='...'
data = b'word,length\nTr\xc3\xa4umen,7\nGr\xc3\xbc\xc3\x9fe,5'.decode('utf8').encode('latin-1')
df = pd.read_csv(BytesIO(data), encoding='latin-1')
(from Dealing with Unicode Data)
See Encodings and Unicode for full list of python encodings.
pd.options.display.max_rows
setting.nrows
, e.g. nrows=5
.chunksize
chunks = pd.read_csv('in.csv', chunksize=1000)
for chunk in chunks:
# ...
to_csv
, specifying a filename and delimitter with sep
(or accepting commas by default). Pass in sys.stdout instead filename to write to console.data.to_csv(sys.stdout)
data.to_csv('out.csv')
data.to_csv('out.csv', sep='|')
Note, missing values are represented as empty strings unless you specify with na_rep
, e.g., na_rep='NULL'
header=False
and index=False
.columns=['b','a',...]
list(csv.reader(open_file))
,# get list of lines
with open('in.csv') as f:
lines = list(csv.reader(f))
# wrange/fix/etc. data manually here
# ...
# recreate dataframe
head, vals = lines[0], lines[1:]
data_dict = {h:v for h,v in zip(head,zip(*vals))}
dataframe = pd.DataFrame(data_dict)
zip(*vals)
.rows = [['1','2','3'],
['1','2','3'],
['1','2','3']]
# transpose rows to cols
cols = zip(*rows)
names = range(len(cols))
# create dictionary with entries of form: feature:feature_vector
data_dict = {h:v for h,v in zip(names, cols)}
# which is equiv to
data_dict = {h:v for h,v in zip(range(len(vals)),zip(*vals))}
csv.Dialect
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
reader = csv.reader(f, dialect=my_dialect)
(McKinney 177)
csv.writer(open_file)
with open('out.csv', 'w') as f:
writer = csv.writer(f)
writer.writerow(('a', 'b', 'c'))
writer.writerow(('1', '2', '3'))
writer.writerow(('4', '5', '6'))
json.loads()
and json.dumps()
from standard library.pandas.read_json()
and to_json()
dataframe/series method.McKinney, Wes. Python for Data Analysis. 2nd ed., O’Reilly, 2018.
[TOC]
ton of resources: https://blog.peoplemaven.com/best-data-science-books-articles-f2fa755f2b9d
No. | Course | Institution | Effort | Status |
---|---|---|---|---|
--- | Using Python for Research | Harvard/edX | ||
--- | pandas for datascience | lynda | --- | --- |
--- | Numpy Data Science Essential Training | lynda | --- | --- |
--- | Python for Data Science Essential Training | lynda | --- | --- |
--- | xxx | --- | --- | --- |
--- | xxx | --- | --- | --- |
see also: https://www.datacamp.com/tracks/data-scientist-with-python https://www.coursera.org/specializations/data-science-python https://www.coursera.org/specializations/data-science https://www.udemy.com/web-scraping-with-python-beautifulsoup/
Local Bootcamps (12wks/$16,000): https://www.thisismetis.com/data-science-bootcamps https://www.galvanize.com/seattle/data-science
WEEK 1: Introduction to the Data Science Toolkit Exploratory Data Analysis, Bash, Git & GitHub, Python, pandas, matplotlib, Seaborn
WEEK 2: Linear Regression and Machine Learning Intro Web scraping via BeautifulSoup and Selenium, regression with statsmodels and scikit-learn, feature selection overfitting and train/test splits, probability theory.
WEEK 3: Linear Regression and Machine Learning Continued Regularization, hypothesis testing , intro to Bayes Theorem
WEEK 4: Databases and Introduction to Machine Learning Concepts Classification and regression algorithms (Knn, logistic regression, SVM, decision trees, and random forest), SQL concepts, cloud servers
WEEK 5: More supervised learning algorithms & web tools Naive Bayes, stochastic gradient descent and intro to Deep Learning, Full stack in a nutshell: Python Flask, Javascript and D3.js
WEEK 6: Statistical Fundamentals MLE, GLM, Distributions, Databases ( RESTful APIs, NoSQL databases, MongoDB, pymongo) Natural Language Processing techniques
WEEK 7: Unsupervised Machine Learning Various clustering algorithms, including K-means and DBSCAN, dimension reduction techniques (PCA, SVD, LDA, NMF)
WEEK 8: More Deep Learning & Unsupervised Learning Deep Learning via Keras, Recommender Systems
WEEK 9: Big Data Hadoop, Hive & Spark, Final project initiated
WEEK 10-12: Final Project
from https://www.thisismetis.com/data-science-bootcamps
tbd...
McKinney, Wes. Python for Data Analysis. 2nd ed., O’Reilly, 2018. Print Mitchell, Ryan. Web Scraping with Python. O'Reilly, 2015. Print VanderPlas, Jake. Python Data Science Handbook. O'Reilly, 2017. Print Grus, Joel. Data Science from Scratch. O'Reilly, 2015. Print Skiena, Steven S. The Data Science Design Manual. Springer, 2017. Print Cielen, Davy, Arno D. B. Meysman, and Mohamed Ali. Introducing Data Science: Big Data, Machine Learning, and More, Using Python Tools. Manning, 2016. Print. Bruce, Peter C, and Andrew Bruce. Practical Statistics for Data Scientists: 50essential Concepts. O’Reilly, 2017. Internet resource. Downey, Allen. Think Stats: Exploratory Data Analysis in Python. Version 2.0.35, O’Reilly, 2014. Internet resource.
Getting Started https://www.kaggle.com/dansbecker/learning-materials-on-kaggle https://www.kaggle.com/rtatman/the-5-day-data-challenge/ https://www.kaggle.com/rtatman/fun-beginner-friendly-datasets
Wes McKinney - author of "Python for Data Analysis" Jake VanderPlas - author of "Python Data Science Handbook"