7/21/2016 - 8:06 AM

2. explore_view_subset_selecting (python data science).md

2. explore_view_subset_selecting python data science.md

Rendered
Source

[TOC]

Selecting

Cells

Y = data['TV'] # column
Y = data.TV
df.ix['Arizona', 2]  # Select the third cell in the row named Arizona
df.ix[2, 'deaths']  # Select the third cell down in the column named deaths

df[:2] # first two rows
df.ix['Maricopa'] # view a row
df.ix[:, 'coverage'] # view a column
df.ix['Yuma', 'coverage'] # view the value based on a row and column

Rows

by index

df.iloc[:2] # rows by row number
df.iloc[1:2] # Select the second and third row
df.iloc[2:] # Select every row after the third row
df.iloc[3:6,0:3] #

by label

df.loc[:'Arizona'] # all rows by index label
df.ix[['Arizona', 'Texas']]   # .ix is the combination of both .loc and .iloc. Integers are first considered labels,if not found, falls back on pos indexing

conditional

df.query('A > C')
df.query('A > 0') 
df.query('A > 0 & A < 1')
df.query('A > B | A > C')
df[df['coverage'] > 50] # all rows where coverage is more than 50
df[(df['deaths'] > 500) | (df['deaths'] < 50)]
df[(df['score'] > 1) & (df['score'] < 5)]
df[~(df['regiment'] == 'Dragoons')] # Select all the regiments not named "Dragoons"

df[df['age'].notnull() & df['sex'].notnull()]  # ignore the missing data points

is in

df[df.name.isin(value_list)]   # value_list = ['Tina', 'Molly', 'Jason']
df[~df.name.isin(value_list)]

partial matching

df2[df2.E.str.contains("tw|ou")]

regex

df['raw'].str.contains('....-..-..', regex=True)  # regex

where cells are arrays

df[df['country'].map(lambda country: 'Syria' in country)]

random

df.take(np.random.permutation(len(df))[:2])


## Cell ranges

```python
df['x3'][:1]  # take column 'x3',

Columns

df.iloc[:,:2] # Select the first 2 columns
feature_cols = ['TV','Radio','Newspaper']
x = data[feature_cols]
data[['TV','Radio','Newspaper']]

by column labels

df.loc[:,['A','B']]  # syntax is: df.loc[rows_index, cols_index]

conditional

df.filter(like='data')
df['preTestScore'].where(df['postTestScore'] > 50) # Find where a value exists in a column

Sorting

df.sort_values(by='reports', ascending=0)  # Sort the dataframe's rows by reports, in descending order
df.sort_values(by=['coverage', 'reports'])  # Sort the dataframe's rows by coverage and then by reports, in ascending order

Exploring

general

data.head()
data.tail()
data.tail().transpose()

dimensions

data.shape()

List unique values in the df['name'] column

df.name.unique()

Stats

Descriptive statistics by group

data['preTestScore'].groupby(df['company']).describe()
data['preTestScore'].describe()

Count the number of non-NA values

data['preTestScore'].count()
data['preTestScore'].min()

Value counts

data.group.value_counts()

Correlation Matrix Of Values

df.corr()

pd.rolling_mean(df, 2)  # Calculate the moving average. That is, take
                        # the first two values, average them, 
                        # then drop the first and add the third, etc.

Mean preTestScores grouped by regiment and company

#regiment    company
#Dragoons    1st         3.5
#            2nd        27.5
#Nighthawks  1st        14.0
#            2nd        16.5
#Scouts      1st         2.5
#            2nd         2.5
#dtype: float64

data['preTestScore'].groupby([df['regiment'], data['company']]).mean()

Mean preTestScores grouped by regiment and company without heirarchical indexing

#company    1st 2nd
#regiment       
#Dragoons   3.5 27.5
#Nighthawks 14.0    16.5
#Scouts 2.5 2.5

data['preTestScore'].groupby([data['regiment'], data['company']]).mean().unstack()

Group the entire dataframe by regiment and company

#preTestScore   postTestScore
#regiment   company     
#Dragoons   1st 3.5 47.5
#2nd    27.5    75.5
#Nighthawks 1st 14.0    59.5
#2nd    16.5    59.5
#Scouts 1st 2.5 66.0
#2nd    2.5 66.0

df.groupby(['regiment', 'company']).mean()

Count the number of times each number of deaths occurs in each regiment

 Input
    guardCorps  corps1  corps2  corps3  corps4  corps5  corps6  corps7  corps8  corps9  corps10 corps11 corps14 corps15
 1875   0   0   0   0   0   0   0   1   1   0   0   0   1   0
 1876   2   0   0   0   1   0   0   0   0   0   0   0   1   1
 1877   2   0   0   0   0   0   1   1   0   0   1   0   2   0

result = horsekick.apply(pd.value_counts).fillna(0); result

Create a crosstab table by company and regiment

 regiment   company experience  name    preTestScore    postTestScore
 0  Nighthawks  infantry    veteran Miller  4   25
 1  Nighthawks  infantry    rookie  Jacobson    24  94
 2  Nighthawks  cavalry veteran Ali 31  57
 -->
 company    cavalry infantry    All
 regiment           
 Dragoons   2   2   4
 Nighthawks 2   2   4

Counting the number of observations by regiment and category

pd.crosstab(df.regiment, df.company, margins=True)

?

df['preTestScore'].idxmax() # row with max value

Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

2. explore_view_subset_selecting (python data science).md

Selecting

Cells

Rows

by index

by label

conditional

is in

partial matching

regex

where cells are arrays

random

Columns

by column labels

conditional

Sorting

Exploring

general

dimensions

List unique values in the df['name'] column

Stats

Descriptive statistics by group

Count the number of non-NA values

Value counts

Correlation Matrix Of Values

Mean preTestScores grouped by regiment and company

Mean preTestScores grouped by regiment and company without heirarchical indexing

Group the entire dataframe by regiment and company

Count the number of times each number of deaths occurs in each regiment

Create a crosstab table by company and regiment

Counting the number of observations by regiment and category

?