Python.Modules.Pandas

2/14/2017 - 8:23 AM

Python.Modules.Pandas #python #Python #Modules #PythonModules #pandas #Pandas #datascience

PYTHON PANDAS MODULE

#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()

#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex 
df1.index = [df1.index.get_level_values(0),
            df1.index.get_level_values(1),
            ['Total'] * len(df1)]
print (df1)
SCENARIO  STATUS       
AAA       51      Total    3
          53      Total    1
BBB       51      Total    1
CCC       51      Total    1
Name: TYPE, dtype: int64

#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO  STATUS  TYPE 
AAA       51      1        2
                  9        1
                  Total    3
          53      228      1
                  Total    1
BBB       51      43       1
                  Total    1
CCC       51      187      1
                  Total    1
Name: TYPE, dtype: int64


#############################################################################################################

Use GroupBy.count with specify column in [] if dont need count NaNs:

table = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
print (table)
SCENARIO  STATUS  TYPE
AAA       51      1       2
                  9       1
          53      228     1
BBB       51      43      1
CCC       51      187     1
Name: TYPE, dtype: int64
Or use GroupBy.size, column specify is not necessary, but difference is it count NaNs also:

table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).size()
print (table)
SCENARIO  STATUS  TYPE
AAA       51      1       2
                  9       1
          53      228     1
BBB       51      43      1
CCC       51      187     1
dtype: int64
If need columns from MultiIndex Series:

table = (df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE']
           .count()
           .reset_index(name='COUNT(TYPE)'))
print (table)
  SCENARIO  STATUS  TYPE  COUNT(TYPE)
0      AAA      51     1            2
1      AAA      51     9            1
2      AAA      53   228            1
3      BBB      51    43            1
4      CCC      51   187            1
table = (df.groupby(["SCENARIO", "STATUS", "TYPE"])
           .size()
           .reset_index(name='COUNT(TYPE)') )

print (table)
  SCENARIO  STATUS  TYPE  COUNT(TYPE)
0      AAA      51     1            2
1      AAA      51     9            1
2      AAA      53   228            1
3      BBB      51    43            1
4      CCC      51   187            1
Last if dont need write first column called index to excel:

table.to_excel(writer, sheet_name='Sheet1', index=False)

PYTHON_PANDA.PY

1. import_pandas.py                             - Import Pandas
2. read_csv.py                                  - Read an csv file
3. fix_import                                   - Fix import by Row Column    
4. print_whole                                  - Print whole table
5. Print column                                 - Print a single column(or more)
6. Print Row                                    - Print a row (or more)
7. extract_df.py                                - Extract dataframe from dataframe
8. extract_mult.py                              - Extract Multiple columns/rows
9. add_col.py                                   - Add columns to dataframe
10. edit_row.py                                 - Edit Row / Column
11. Column_prod.py                              - Column as a product of others
12. manual_dec_df.py                            - Manually declare dataframe
13. condition_index.py                          - Conditional indexing / selecting
14. rename_col.py                               - Rename column
15. handle_missing.py                           - Handle missing values
16. map_apply.py                                - Map / Apply to all elements
17. vector_op.py                                - Vectorized operation on dataframe
18. groupby.py                                  - Perform SQL like groupings
19. newfcolumn.py                               - One/Multipe column as a function of one/multiple columns
20. Dataframe_stats.py                          - Produce stats for each column in dataframe 
21. MergeJoin.py                                - Merge Tables, various types of joins
22. plotpanda.py                                - Plotting with panda
23. Removespecificcolumns                       - Remove specific rows columns
24. closeexcelfile.py                           - How to close excel file
25. copyexcelsheet.py                           - How to copy excel sheet to another workbook
26. renameexcelsheet.py                         - How to rename excel sheeet
27. fillNaNinexcel.py                           - How to fill NaN values in excel
28. subtotalswithgroupby.py                     - Subtotals with group by in excel

fillNaNexcel.py

I believe DataFrame.fillna() will do this for you.

Link to Docs for a dataframe and for a Series.

Example:

In [7]: df
Out[7]: 
          0         1
0       NaN       NaN
1 -0.494375  0.570994
2       NaN       NaN
3  1.876360 -0.229738
4       NaN       NaN

In [8]: df.fillna(0)
Out[8]: 
          0         1
0  0.000000  0.000000
1 -0.494375  0.570994
2  0.000000  0.000000
3  1.876360 -0.229738
4  0.000000  0.000000
To fill the NaNs in only one column, select just that column. in this case I'm using inplace=True to actually change the contents of df.

In [12]: df[1].fillna(0, inplace=True)
Out[12]: 
0    0.000000
1    0.570994
2    0.000000
3   -0.229738
4    0.000000
Name: 1

In [13]: df
Out[13]: 
          0         1
0       NaN  0.000000
1 -0.494375  0.570994
2       NaN  0.000000
3  1.876360 -0.229738
4       NaN  0.000000

renameexcelsheet.py

You can do this by doing the following:

import openpyxl
ss=openpyxl.load_workbook("file.xlsx")
#printing the sheet names
ss_sheet = ss.get_sheet_by_name('Sheet')
ss_sheet.title = 'Fruit'
ss.save("file.xlsx")
This works for me. Hope this helps.

copyexcelsheet.py

Thank you both! After taking your advice and trying for 5 minutes, the following worked!

import win32com.client as win32
import os

excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Add()

for f in [os.path.join(os.getcwd(), "CM1.xlsx"), os.path.join(os.getcwd(), "CM2.xlsx")]: 
    w = excel.Workbooks.Open(f) 
    w.Sheets(1).Copy(wb.Sheets(1))

wb.SaveAs(os.path.join(os.getcwd(), "CM.xlsx"))
excel.Application.Quit()
shareeditflag
answered Oct 12 '15 at 15:03

Tian He
127213
add a comment
up vote
0
down vote
this will paste sheet 1 ot to_copy to sheet 1 of emptyWorkbook

empty_wb = xl.Workbooks.Open(util.getTestRessourcePath("emptyWorkbook.xlsx"))
tocopy_wb = xl.Workbooks.Open(util.getTestRessourcePath("toCopy.xls"))

tocopy_wb.Sheets(1).Cells.Copy()
empty_wb.Sheets(1).Paste(empty_wb.Sheets(1).Range("A1"))

closeexcel.py

I think you are forgetting to close the file - Once you are done writing data to the file, you will need to close the file handler using

excel.close()
as mentioned in the docs here.

removespecrowscol.py

You can use drop(...) to remove rows, and drop(..., axis=1) to remove columns

data = [
    ['', '', '', ''],
    ['', 1, 2, 3],
    ['', '', '', ''],
    ['', 7, 8, 9],
]

import pandas as pd

df = pd.DataFrame(data)

# drop first column - [0]
df = df.drop(0, axis=1)

# drop first and third row - [0,2]
df = df.drop([0,2])

print(df)
Before:

  0  1  2  3
0           
1    1  2  3
2           
3    7  8  9
After:

   1  2  3
1  1  2  3
3  7  8  9

#####################################################################

You could use dropna:

In [13]: df2[df1 != df2].dropna(how='all')
Out[13]: 
   col1  col2
0     4     6
2     5     5
shareeditflag
edited Dec 20 '14 at 14:52
answered Dec 20 '14 at 10:35

unutbu
464k789461053
  	
 		
what is the difference between dropna() and dropna(how='all')? – aerokite Dec 20 '14 at 10:39
1	
 		
dropna() is the same as dropna(how='any') be default. This will drop any row which has a NaN. dropna(how='all') will drop a row only if all the values in the row are NaN. – unutbu Dec 20 '14 at 10:41
add a comment
up vote
1
down vote
>>> df = df2[~df2.isin(df1).all(1)]
>>> df
    col1  col2
0     4     6
2     5     5

##################################################################################################

Use iloc like you were using ix, but apply a different slice...

df2 = df1.iloc[3:] #edited since .ix is now deprecated.
will give you a new df without the first three rows.

 think a more explicit way of doing this is to use drop.

The syntax is:

df.drop(label)
And as pointed out by @tim and @ChaimG, this can be done in-place:

df.drop(label, inplace=True)
One way of implementing this could be:

df.drop(df.index[:3], inplace=True)
And another "in place" use:

df.drop(df.head(3).index, inplace=True)

df.drop(df.index[[0,2]])
Pandas uses zero based numbering, so 0 is the first row, 1 is the second row and 2 is the third row.

vector_op.py

#Vectorized mathematical and string operations
#One can perform vectorized calculations using simple operators and numpy functions.


In [4]: df = pd.DataFrame(data={"A":[1,2], "B":[1.2,1.3]})

In [5]: df["C"] = df["A"]+df["B"]

In [6]: df
Out[6]:
   A    B    C
0  1  1.2  2.2
1  2  1.3  3.3

In [7]: df["D"] = df["A"]*3

In [8]: df
Out[8]:
   A    B    C  D
0  1  1.2  2.2  3
1  2  1.3  3.3  6

In [9]: df["E"] = np.sqrt(df["A"])

In [10]: df
Out[10]:
   A    B    C  D         E
0  1  1.2  2.2  3  1.000000
1  2  1.3  3.3  6  1.414214

#Also, vectorized string operations are easy to use.

In [11]: df = pd.DataFrame(data={"A":[1,2], "B":[1.2,1.3], "Z":["a","b"]})

In [12]: df
Out[12]:
   A    B  Z
0  1  1.2  a
1  2  1.3  b

In [13]: df["F"] = df.Z.str.upper()

In [14]: df
Out[14]:
   A    B  Z  F
0  1  1.2  a  A
1  2  1.3  b  B

tosortpandas.py

# Import pandas as pd
import pandas as pd


# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")

# Print out cars
print (cars)


# Fix import by including index_col
cars = pd.read_csv('cars.csv', index_col = 0)

# Print out cars
print(cars)

# Print out country column as Pandas Series
print (cars['country'])

# Print out country column as Pandas DataFrame
print (cars[['country']])


cars = pd.read_csv('cars.csv', index_col = 0)

# Print out observation for Japan
print(cars.loc['JAP'])

# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])

# Print out drives_right value of Morocco
print (cars.loc['MOR', 'drives_right'])

# Print sub-DataFrame
print (cars.loc[['RU', 'MOR'], [ 'country', 'drives_right' ]])

select_specific.py

# We need both the row, and the column to extrapolate a single element
# Just like an actual table, we would need x,y
# ie

bricks.loc["CH","Capital"]
#or
bricks["Capital"].loc["CH"]
#or
bricks.loc["CH"]["Capital"]
#

select_row.py

# To select a row we require the row label
# ie 
brics.loc["RowLabel"]

select_column.py

# Way to call on all data in a given column
bricks["columnlabel"]

rename_col.py

#Use the rename method to rename columns. It copies the data to another DataFrame.

In [9]: df2 = df.rename(columns={'int_col' : 'some_other_name'})

In [10]: df2
Out[10]:
   float_col  some_other_name str_col
0        0.1                1       a
1        0.2                2       b
2        0.2                6    None
3       10.1                8       c
4        NaN               -1       a


#Set the inplace = True flag incase you want to modify the existing DataFrame.


In [11]: df2.rename(columns={'some_other_name' : 'int_col'}, inplace = True)
Out[11]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

print_column.py

#Called as tablename["columnlabel"]
a = brics["column"]
print(a)

# Multiple column subsetting using IX
In [6]: df.ix[:,['float_col','int_col']]
Out[6]:
   float_col  int_col
0        0.1        1
1        0.2        2
2        0.2        6
3       10.1        8
4        NaN       -1

# Multiple column subsetting using [[]]

2
3
4
5
6
7
8
In [9]: df[['float_col','int_col']]
Out[9]:
   float_col  int_col
0        0.1        1
1        0.2        2
2        0.2        6
3       10.1        8
4        NaN       -1

plotpanda.py

#Plot
#I was thoroughly surprised by the plotting capabilities of the pandas library. 
#There are several plotting methods available. I am highlighting a couple of 
#simple plots that I use the most. Let’s start with a simple data frame to plot.

In [3]: plot_df = DataFrame(np.random.randn(1000,2),columns=['x','y'])

In [4]: plot_df['y'] = plot_df['y'].map(lambda x : x + 1)
Plot

A simple plot command goes a long way.

In [5]: plot_df.plot()
Out[5]: <matplotlib.axes.AxesSubplot at 0x10e6cad10>


#Histograms
#I really enjoy histograms to get a quick idea about the distribution of the data.

In [6]: plot_df.hist()
Out[6]: array([[Axes(0.125,0.1;0.336957x0.8), Axes(0.563043,0.1;0.336957x0.8)]], dtype=object)

newfcolumn.py

#New Columns = f(Existing Columns)
#Generating new columns from existing columns in a data frame is an integral part of my workflow. This was one of the hardest parts for me to figure out. I hope these examples will save time and effort for other people.
#I will try to illustrate it in a piecemeal manner – multiple columns as a function of a single column, single column as a function of multiple columns, and finally multiple columns as a function of multiple columns.
#multiple columns as a function of a single column
#I often have to generate multiple columns of a DataFrame as a function of a single columns. Related Stack Overflow question

In [43]: df4 = df.copy()

In [44]: def two_three_strings(x):
   ....:   return x*2, x*3
   ....:

In [45]: df4['twice'],df4['thrice'] = zip(*df4['int_col'].map(two_three_strings))

In [46]: df4
Out[46]:
   float_col  int_col str_col  twice  thrice
0        0.1        1       a      2       3
1        0.2        2       b      4       6
2        0.2        6    None     12      18
3       10.1        8       c     16      24
4        NaN       -1       a     -2      -3

#single column as a function of multiple columns
#It’s sometimes useful to generate multiple DataFrame columns from a single column. It comes in handy especially when methods return tuples. Related Stack Overflow question

In [47]: df5 = df.copy()

In [48]: def sum_two_cols(series):
   ....:   return series['int_col'] + series['float_col']
   ....:

In [49]: df5['sum_col'] = df5.apply(sum_two_cols,axis=1)

In [50]: df5
Out[50]:
   float_col  int_col str_col  sum_col
0        0.1        1       a      1.1
1        0.2        2       b      2.2
2        0.2        6    None      6.2
3       10.1        8       c     18.1
4        NaN       -1       a      NaN


#multiple columns as a function of multiple columns
#Finally, a way to generate a new DataFrame with multiple columns based on multiple columns in an existing DataFrame. Related Stack Overflow question

In [51]: import math

In [52]: def int_float_squares(series):
   ....:   return pd.Series({'int_sq' : series['int_col']**2, 'flt_sq' : series['float_col']**2})
   ....:

In [53]: df.apply(int_float_squares, axis = 1)
Out[53]:
   flt_sq  int_sq
0    0.01       1
1    0.04       4
2    0.04      36
3  102.01      64
4     NaN       1

mergejoin.py

#Merge and Join
#Pandas supports database-like joins which makes it easy to link data frames.
#I will use the simple example to highlight the joins using the merge command.


In [57]: df
Out[57]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

In [58]: other = DataFrame({'str_col' : ['a','b'], 'some_val' : [1, 2]})

In [59]: other
Out[59]:
   some_val str_col
0         1       a
1         2       b

#The inner, outer, left and right joins are show below. The data frames are joined using the str_col keys.

In [60]: pd.merge(df,other,on='str_col',how='inner')
Out[60]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2

In [61]: pd.merge(df,other,on='str_col',how='outer')
Out[61]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2
3        0.2        6    None       NaN
4       10.1        8       c       NaN

In [62]: pd.merge(df,other,on='str_col',how='left')
Out[62]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2
3        0.2        6    None       NaN
4       10.1        8       c       NaN

In [63]: pd.merge(df,other,on='str_col',how='right')
Out[63]:
   float_col  int_col str_col  some_val
0        0.1        1       a         1
1        NaN       -1       a         1
2        0.2        2       b         2

map_apply.py

#Forget writing for loops while using pandas. One can do beautiful vectorized computation by applying function over rows and columns using the map, apply and applymap methods.
#map
#The map operation operates over each element of a Series

In [18]: df['str_col'].dropna().map(lambda x : 'map_' + x)
Out[18]:
0    map_a
1    map_b
3    map_c
4    map_a
Name: str_col

#apply
#The apply is a pretty flexible function which, as the name suggests, applies a function along any axis of the DataFrame. The examples show the application of the sum function over columns. (Thanks to Mindey in the comments below to use np.sum instead of np.sqrt in the example)

In [19]: df.ix[:,['int_col','float_col']].apply(np.sqrt)
Out[19]:
int_col      16.0
float_col    10.6
dtype: float64

#apply to Row
b =  df.ix[:,['int_col','float_col']].apply(sum)
c =  df.ix[['int_col','float_col'],:].apply(sum)

#applymap
#The applymap operation can be used to apply the function to each element of the DataFrame.


In [39]: def some_fn(x):
   ....:   if type(x) is str:
   ....:     return 'applymap_' + x
   ....:   elif x:
   ....:     return 100 * x
   ....:   else:
   ....:     return
   ....:

In [40]: df.applymap(some_fn)
Out[40]:
   float_col  int_col     str_col
0         10      100  applymap_a
1         20      200  applymap_b
2         20      600        None
3       1010      800  applymap_c
4        NaN     -100  applymap_a

manual_declare.py

# For when we want to manually declare a dataframe

import pandas as pd
df = pd.DataFrame({'int_col' : [1,2,6,8,-1], 'float_col' : [0.1, 0.2,0.2,10.1,None], 'str_col' : ['a','b',None,'c','a']})

handle_missing.py

#The dropna can used to drop rows or columns with missing data (NaN). By default, it drops all rows with any missing entry.

n [12]: df2
Out[12]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

In [13]: df2.dropna()
Out[13]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
3       10.1        8       c
Fill missing values

#The fillna method on the other hand can be used to fill missing data (NaN). The example below shows a simple replacement using the mean of the available values.


In [14]: df3 = df.copy()

In [15]: mean = df3['float_col'].mean()

In [16]: df3
Out[16]:
   float_col  int_col str_col
0        0.1        1       a
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c
4        NaN       -1       a

In [17]: df3['float_col'].fillna(mean)
Out[17]:
0     0.10
1     0.20
2     0.20
3    10.10
4     2.65
Name: float_col

dataframe_stats.py

#Stats
#Pandas provides nifty methods to understand your data. I am highlighting the describe, correlation, 
#covariance, and correlation methods that I use to quickly make sense of my data.

#describe

The describe method provides quick stats on all suitable columns.

In [54]: df.describe()
Out[54]:
       float_col   int_col
count    4.00000  5.000000
mean     2.65000  3.200000
std      4.96689  3.701351
min      0.10000 -1.000000
25%      0.17500  1.000000
50%      0.20000  2.000000
75%      2.67500  6.000000
max     10.10000  8.000000

#covariance
#The cov method provides the covariance between suitable columns.

In [55]: df.cov()
Out[55]:
           float_col    int_col
float_col  24.670000  12.483333
int_col    12.483333  13.700000

#correlation
#The corr method provides the correlation between suitable columns.

In [56]: df.corr()
Out[56]:
           float_col   int_col
float_col   1.000000  0.760678
int_col     0.760678  1.000000

cond_indexing.py

#One can index using boolean indexing

In [7]: df[df['float_col'] > 0.15]
Out[7]:
   float_col  int_col str_col
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c

In [8]: df[df['float_col'] == 0.1]
Out[8]:
   float_col  int_col str_col
0        0.1        1       a
EDIT Suggestion by Roby Levy in the comments below. One can select multiple boolean operators (| for or, & for and, and ~ for not) and group them by parenthisis.


In [10]: df[(df['float_col'] > 0.1) & (df['int_col']>2)]
Out[10]:
   float_col  int_col str_col
2        0.2        6    None
3       10.1        8       c

In [11]: df[(df['float_col'] > 0.1) | (df['int_col']>2)]
Out[11]:
   float_col  int_col str_col
1        0.2        2       b
2        0.2        6    None
3       10.1        8       c

In [12]: df[~(df['float_col'] > 0.1)]
Out[12]:
   float_col  int_col str_col
0        0.1        1       a
4        NaN       -1       a


###########################################

import pandas as pd

df = pd.DataFrame({'int_col' : [1,2,6,8,-1], 'float_col' : [0.1, 0.2,0.2,10.1,None], 'str_col' : ['a','b',None,'c','a']})

print (df)
print (df[df['int_col'] > 3])

read_csv.py

# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")

new_dataframe.py

# To extra a column and construct a new dataframe with it
b = bricks["Columnname"]        # normal calling just data
b = bricks[["Columnname"]]

import_pandas.py

# Import pandas as pd
import pandas as pd

groupby.py

GroupBy

#The groupby method let’s you perform SQL-like grouping operations. The example below shows a 
#grouping operation performed with str_col columns entries as keys. It is used to calculate 
#the mean of the float_col for each key. For more details, please refer to the split-apply-combine description on the pandas website.

In [41]: grouped = df['float_col'].groupby(df['str_col'])

In [42]: grouped.mean()
Out[42]:
str_col
a           0.1
b           0.2
c          10.1

column_product.py

# Construct a new column out of two others
# ie
bricks["density"] = bricks["pop"]/bricks["area"] * 10

add_column.py

# table["newcolumnname"] = [ 1, 2, 3, 4, ]     - list of elements
a = brics["newcolname"] = [ 1, 2, 3, 4, 5 ]

Cacher is the code snippet organizer for pro developers

We empower you and your team to get more done, faster

Python.Modules.Pandas