Python.Modules.Pandas #python #Python #Modules #PythonModules #pandas #Pandas #datascience
PYTHON PANDAS MODULE
#if necessary convert TYPE column to string
df['TYPE'] = df['TYPE'].astype(str)
df = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
#aggregate sum by first 2 levels
df1 = df.groupby(["SCENARIO", "STATUS"]).sum()
#add 3 level of MultiIndex
df1.index = [df1.index.get_level_values(0),
df1.index.get_level_values(1),
['Total'] * len(df1)]
print (df1)
SCENARIO STATUS
AAA 51 Total 3
53 Total 1
BBB 51 Total 1
CCC 51 Total 1
Name: TYPE, dtype: int64
#join together and sorts
df = pd.concat([df, df1]).sort_index(level=[0,1])
print (df)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
Total 3
53 228 1
Total 1
BBB 51 43 1
Total 1
CCC 51 187 1
Total 1
Name: TYPE, dtype: int64
#############################################################################################################
Use GroupBy.count with specify column in [] if dont need count NaNs:
table = df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE'].count()
print (table)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
53 228 1
BBB 51 43 1
CCC 51 187 1
Name: TYPE, dtype: int64
Or use GroupBy.size, column specify is not necessary, but difference is it count NaNs also:
table = df.groupby(["SCENARIO", "STATUS", "TYPE"]).size()
print (table)
SCENARIO STATUS TYPE
AAA 51 1 2
9 1
53 228 1
BBB 51 43 1
CCC 51 187 1
dtype: int64
If need columns from MultiIndex Series:
table = (df.groupby(["SCENARIO", "STATUS", "TYPE"])['TYPE']
.count()
.reset_index(name='COUNT(TYPE)'))
print (table)
SCENARIO STATUS TYPE COUNT(TYPE)
0 AAA 51 1 2
1 AAA 51 9 1
2 AAA 53 228 1
3 BBB 51 43 1
4 CCC 51 187 1
table = (df.groupby(["SCENARIO", "STATUS", "TYPE"])
.size()
.reset_index(name='COUNT(TYPE)') )
print (table)
SCENARIO STATUS TYPE COUNT(TYPE)
0 AAA 51 1 2
1 AAA 51 9 1
2 AAA 53 228 1
3 BBB 51 43 1
4 CCC 51 187 1
Last if dont need write first column called index to excel:
table.to_excel(writer, sheet_name='Sheet1', index=False)
1. import_pandas.py - Import Pandas
2. read_csv.py - Read an csv file
3. fix_import - Fix import by Row Column
4. print_whole - Print whole table
5. Print column - Print a single column(or more)
6. Print Row - Print a row (or more)
7. extract_df.py - Extract dataframe from dataframe
8. extract_mult.py - Extract Multiple columns/rows
9. add_col.py - Add columns to dataframe
10. edit_row.py - Edit Row / Column
11. Column_prod.py - Column as a product of others
12. manual_dec_df.py - Manually declare dataframe
13. condition_index.py - Conditional indexing / selecting
14. rename_col.py - Rename column
15. handle_missing.py - Handle missing values
16. map_apply.py - Map / Apply to all elements
17. vector_op.py - Vectorized operation on dataframe
18. groupby.py - Perform SQL like groupings
19. newfcolumn.py - One/Multipe column as a function of one/multiple columns
20. Dataframe_stats.py - Produce stats for each column in dataframe
21. MergeJoin.py - Merge Tables, various types of joins
22. plotpanda.py - Plotting with panda
23. Removespecificcolumns - Remove specific rows columns
24. closeexcelfile.py - How to close excel file
25. copyexcelsheet.py - How to copy excel sheet to another workbook
26. renameexcelsheet.py - How to rename excel sheeet
27. fillNaNinexcel.py - How to fill NaN values in excel
28. subtotalswithgroupby.py - Subtotals with group by in excel
I believe DataFrame.fillna() will do this for you.
Link to Docs for a dataframe and for a Series.
Example:
In [7]: df
Out[7]:
0 1
0 NaN NaN
1 -0.494375 0.570994
2 NaN NaN
3 1.876360 -0.229738
4 NaN NaN
In [8]: df.fillna(0)
Out[8]:
0 1
0 0.000000 0.000000
1 -0.494375 0.570994
2 0.000000 0.000000
3 1.876360 -0.229738
4 0.000000 0.000000
To fill the NaNs in only one column, select just that column. in this case I'm using inplace=True to actually change the contents of df.
In [12]: df[1].fillna(0, inplace=True)
Out[12]:
0 0.000000
1 0.570994
2 0.000000
3 -0.229738
4 0.000000
Name: 1
In [13]: df
Out[13]:
0 1
0 NaN 0.000000
1 -0.494375 0.570994
2 NaN 0.000000
3 1.876360 -0.229738
4 NaN 0.000000
You can do this by doing the following:
import openpyxl
ss=openpyxl.load_workbook("file.xlsx")
#printing the sheet names
ss_sheet = ss.get_sheet_by_name('Sheet')
ss_sheet.title = 'Fruit'
ss.save("file.xlsx")
This works for me. Hope this helps.
Thank you both! After taking your advice and trying for 5 minutes, the following worked!
import win32com.client as win32
import os
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Add()
for f in [os.path.join(os.getcwd(), "CM1.xlsx"), os.path.join(os.getcwd(), "CM2.xlsx")]:
w = excel.Workbooks.Open(f)
w.Sheets(1).Copy(wb.Sheets(1))
wb.SaveAs(os.path.join(os.getcwd(), "CM.xlsx"))
excel.Application.Quit()
shareeditflag
answered Oct 12 '15 at 15:03
Tian He
127213
add a comment
up vote
0
down vote
this will paste sheet 1 ot to_copy to sheet 1 of emptyWorkbook
empty_wb = xl.Workbooks.Open(util.getTestRessourcePath("emptyWorkbook.xlsx"))
tocopy_wb = xl.Workbooks.Open(util.getTestRessourcePath("toCopy.xls"))
tocopy_wb.Sheets(1).Cells.Copy()
empty_wb.Sheets(1).Paste(empty_wb.Sheets(1).Range("A1"))
I think you are forgetting to close the file - Once you are done writing data to the file, you will need to close the file handler using
excel.close()
as mentioned in the docs here.
You can use drop(...) to remove rows, and drop(..., axis=1) to remove columns
data = [
['', '', '', ''],
['', 1, 2, 3],
['', '', '', ''],
['', 7, 8, 9],
]
import pandas as pd
df = pd.DataFrame(data)
# drop first column - [0]
df = df.drop(0, axis=1)
# drop first and third row - [0,2]
df = df.drop([0,2])
print(df)
Before:
0 1 2 3
0
1 1 2 3
2
3 7 8 9
After:
1 2 3
1 1 2 3
3 7 8 9
#####################################################################
You could use dropna:
In [13]: df2[df1 != df2].dropna(how='all')
Out[13]:
col1 col2
0 4 6
2 5 5
shareeditflag
edited Dec 20 '14 at 14:52
answered Dec 20 '14 at 10:35
unutbu
464k789461053
what is the difference between dropna() and dropna(how='all')? – aerokite Dec 20 '14 at 10:39
1
dropna() is the same as dropna(how='any') be default. This will drop any row which has a NaN. dropna(how='all') will drop a row only if all the values in the row are NaN. – unutbu Dec 20 '14 at 10:41
add a comment
up vote
1
down vote
>>> df = df2[~df2.isin(df1).all(1)]
>>> df
col1 col2
0 4 6
2 5 5
##################################################################################################
Use iloc like you were using ix, but apply a different slice...
df2 = df1.iloc[3:] #edited since .ix is now deprecated.
will give you a new df without the first three rows.
think a more explicit way of doing this is to use drop.
The syntax is:
df.drop(label)
And as pointed out by @tim and @ChaimG, this can be done in-place:
df.drop(label, inplace=True)
One way of implementing this could be:
df.drop(df.index[:3], inplace=True)
And another "in place" use:
df.drop(df.head(3).index, inplace=True)
df.drop(df.index[[0,2]])
Pandas uses zero based numbering, so 0 is the first row, 1 is the second row and 2 is the third row.
#Vectorized mathematical and string operations
#One can perform vectorized calculations using simple operators and numpy functions.
In [4]: df = pd.DataFrame(data={"A":[1,2], "B":[1.2,1.3]})
In [5]: df["C"] = df["A"]+df["B"]
In [6]: df
Out[6]:
A B C
0 1 1.2 2.2
1 2 1.3 3.3
In [7]: df["D"] = df["A"]*3
In [8]: df
Out[8]:
A B C D
0 1 1.2 2.2 3
1 2 1.3 3.3 6
In [9]: df["E"] = np.sqrt(df["A"])
In [10]: df
Out[10]:
A B C D E
0 1 1.2 2.2 3 1.000000
1 2 1.3 3.3 6 1.414214
#Also, vectorized string operations are easy to use.
In [11]: df = pd.DataFrame(data={"A":[1,2], "B":[1.2,1.3], "Z":["a","b"]})
In [12]: df
Out[12]:
A B Z
0 1 1.2 a
1 2 1.3 b
In [13]: df["F"] = df.Z.str.upper()
In [14]: df
Out[14]:
A B Z F
0 1 1.2 a A
1 2 1.3 b B
# Import pandas as pd
import pandas as pd
# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")
# Print out cars
print (cars)
# Fix import by including index_col
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out cars
print(cars)
# Print out country column as Pandas Series
print (cars['country'])
# Print out country column as Pandas DataFrame
print (cars[['country']])
cars = pd.read_csv('cars.csv', index_col = 0)
# Print out observation for Japan
print(cars.loc['JAP'])
# Print out observations for Australia and Egypt
print(cars.loc[['AUS', 'EG']])
# Print out drives_right value of Morocco
print (cars.loc['MOR', 'drives_right'])
# Print sub-DataFrame
print (cars.loc[['RU', 'MOR'], [ 'country', 'drives_right' ]])
# We need both the row, and the column to extrapolate a single element
# Just like an actual table, we would need x,y
# ie
bricks.loc["CH","Capital"]
#or
bricks["Capital"].loc["CH"]
#or
bricks.loc["CH"]["Capital"]
#
# To select a row we require the row label
# ie
brics.loc["RowLabel"]
# Way to call on all data in a given column
bricks["columnlabel"]
#Use the rename method to rename columns. It copies the data to another DataFrame.
In [9]: df2 = df.rename(columns={'int_col' : 'some_other_name'})
In [10]: df2
Out[10]:
float_col some_other_name str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
#Set the inplace = True flag incase you want to modify the existing DataFrame.
In [11]: df2.rename(columns={'some_other_name' : 'int_col'}, inplace = True)
Out[11]:
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
#Called as tablename["columnlabel"]
a = brics["column"]
print(a)
# Multiple column subsetting using IX
In [6]: df.ix[:,['float_col','int_col']]
Out[6]:
float_col int_col
0 0.1 1
1 0.2 2
2 0.2 6
3 10.1 8
4 NaN -1
# Multiple column subsetting using [[]]
2
3
4
5
6
7
8
In [9]: df[['float_col','int_col']]
Out[9]:
float_col int_col
0 0.1 1
1 0.2 2
2 0.2 6
3 10.1 8
4 NaN -1
#Plot
#I was thoroughly surprised by the plotting capabilities of the pandas library.
#There are several plotting methods available. I am highlighting a couple of
#simple plots that I use the most. Let’s start with a simple data frame to plot.
In [3]: plot_df = DataFrame(np.random.randn(1000,2),columns=['x','y'])
In [4]: plot_df['y'] = plot_df['y'].map(lambda x : x + 1)
Plot
A simple plot command goes a long way.
In [5]: plot_df.plot()
Out[5]: <matplotlib.axes.AxesSubplot at 0x10e6cad10>
#Histograms
#I really enjoy histograms to get a quick idea about the distribution of the data.
In [6]: plot_df.hist()
Out[6]: array([[Axes(0.125,0.1;0.336957x0.8), Axes(0.563043,0.1;0.336957x0.8)]], dtype=object)
#New Columns = f(Existing Columns)
#Generating new columns from existing columns in a data frame is an integral part of my workflow. This was one of the hardest parts for me to figure out. I hope these examples will save time and effort for other people.
#I will try to illustrate it in a piecemeal manner – multiple columns as a function of a single column, single column as a function of multiple columns, and finally multiple columns as a function of multiple columns.
#multiple columns as a function of a single column
#I often have to generate multiple columns of a DataFrame as a function of a single columns. Related Stack Overflow question
In [43]: df4 = df.copy()
In [44]: def two_three_strings(x):
....: return x*2, x*3
....:
In [45]: df4['twice'],df4['thrice'] = zip(*df4['int_col'].map(two_three_strings))
In [46]: df4
Out[46]:
float_col int_col str_col twice thrice
0 0.1 1 a 2 3
1 0.2 2 b 4 6
2 0.2 6 None 12 18
3 10.1 8 c 16 24
4 NaN -1 a -2 -3
#single column as a function of multiple columns
#It’s sometimes useful to generate multiple DataFrame columns from a single column. It comes in handy especially when methods return tuples. Related Stack Overflow question
In [47]: df5 = df.copy()
In [48]: def sum_two_cols(series):
....: return series['int_col'] + series['float_col']
....:
In [49]: df5['sum_col'] = df5.apply(sum_two_cols,axis=1)
In [50]: df5
Out[50]:
float_col int_col str_col sum_col
0 0.1 1 a 1.1
1 0.2 2 b 2.2
2 0.2 6 None 6.2
3 10.1 8 c 18.1
4 NaN -1 a NaN
#multiple columns as a function of multiple columns
#Finally, a way to generate a new DataFrame with multiple columns based on multiple columns in an existing DataFrame. Related Stack Overflow question
In [51]: import math
In [52]: def int_float_squares(series):
....: return pd.Series({'int_sq' : series['int_col']**2, 'flt_sq' : series['float_col']**2})
....:
In [53]: df.apply(int_float_squares, axis = 1)
Out[53]:
flt_sq int_sq
0 0.01 1
1 0.04 4
2 0.04 36
3 102.01 64
4 NaN 1
#Merge and Join
#Pandas supports database-like joins which makes it easy to link data frames.
#I will use the simple example to highlight the joins using the merge command.
In [57]: df
Out[57]:
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
In [58]: other = DataFrame({'str_col' : ['a','b'], 'some_val' : [1, 2]})
In [59]: other
Out[59]:
some_val str_col
0 1 a
1 2 b
#The inner, outer, left and right joins are show below. The data frames are joined using the str_col keys.
In [60]: pd.merge(df,other,on='str_col',how='inner')
Out[60]:
float_col int_col str_col some_val
0 0.1 1 a 1
1 NaN -1 a 1
2 0.2 2 b 2
In [61]: pd.merge(df,other,on='str_col',how='outer')
Out[61]:
float_col int_col str_col some_val
0 0.1 1 a 1
1 NaN -1 a 1
2 0.2 2 b 2
3 0.2 6 None NaN
4 10.1 8 c NaN
In [62]: pd.merge(df,other,on='str_col',how='left')
Out[62]:
float_col int_col str_col some_val
0 0.1 1 a 1
1 NaN -1 a 1
2 0.2 2 b 2
3 0.2 6 None NaN
4 10.1 8 c NaN
In [63]: pd.merge(df,other,on='str_col',how='right')
Out[63]:
float_col int_col str_col some_val
0 0.1 1 a 1
1 NaN -1 a 1
2 0.2 2 b 2
#Forget writing for loops while using pandas. One can do beautiful vectorized computation by applying function over rows and columns using the map, apply and applymap methods.
#map
#The map operation operates over each element of a Series
In [18]: df['str_col'].dropna().map(lambda x : 'map_' + x)
Out[18]:
0 map_a
1 map_b
3 map_c
4 map_a
Name: str_col
#apply
#The apply is a pretty flexible function which, as the name suggests, applies a function along any axis of the DataFrame. The examples show the application of the sum function over columns. (Thanks to Mindey in the comments below to use np.sum instead of np.sqrt in the example)
In [19]: df.ix[:,['int_col','float_col']].apply(np.sqrt)
Out[19]:
int_col 16.0
float_col 10.6
dtype: float64
#apply to Row
b = df.ix[:,['int_col','float_col']].apply(sum)
c = df.ix[['int_col','float_col'],:].apply(sum)
#applymap
#The applymap operation can be used to apply the function to each element of the DataFrame.
In [39]: def some_fn(x):
....: if type(x) is str:
....: return 'applymap_' + x
....: elif x:
....: return 100 * x
....: else:
....: return
....:
In [40]: df.applymap(some_fn)
Out[40]:
float_col int_col str_col
0 10 100 applymap_a
1 20 200 applymap_b
2 20 600 None
3 1010 800 applymap_c
4 NaN -100 applymap_a
# For when we want to manually declare a dataframe
import pandas as pd
df = pd.DataFrame({'int_col' : [1,2,6,8,-1], 'float_col' : [0.1, 0.2,0.2,10.1,None], 'str_col' : ['a','b',None,'c','a']})
#The dropna can used to drop rows or columns with missing data (NaN). By default, it drops all rows with any missing entry.
n [12]: df2
Out[12]:
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
In [13]: df2.dropna()
Out[13]:
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
3 10.1 8 c
Fill missing values
#The fillna method on the other hand can be used to fill missing data (NaN). The example below shows a simple replacement using the mean of the available values.
In [14]: df3 = df.copy()
In [15]: mean = df3['float_col'].mean()
In [16]: df3
Out[16]:
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
In [17]: df3['float_col'].fillna(mean)
Out[17]:
0 0.10
1 0.20
2 0.20
3 10.10
4 2.65
Name: float_col
#Stats
#Pandas provides nifty methods to understand your data. I am highlighting the describe, correlation,
#covariance, and correlation methods that I use to quickly make sense of my data.
#describe
The describe method provides quick stats on all suitable columns.
In [54]: df.describe()
Out[54]:
float_col int_col
count 4.00000 5.000000
mean 2.65000 3.200000
std 4.96689 3.701351
min 0.10000 -1.000000
25% 0.17500 1.000000
50% 0.20000 2.000000
75% 2.67500 6.000000
max 10.10000 8.000000
#covariance
#The cov method provides the covariance between suitable columns.
In [55]: df.cov()
Out[55]:
float_col int_col
float_col 24.670000 12.483333
int_col 12.483333 13.700000
#correlation
#The corr method provides the correlation between suitable columns.
In [56]: df.corr()
Out[56]:
float_col int_col
float_col 1.000000 0.760678
int_col 0.760678 1.000000
#One can index using boolean indexing
In [7]: df[df['float_col'] > 0.15]
Out[7]:
float_col int_col str_col
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
In [8]: df[df['float_col'] == 0.1]
Out[8]:
float_col int_col str_col
0 0.1 1 a
EDIT Suggestion by Roby Levy in the comments below. One can select multiple boolean operators (| for or, & for and, and ~ for not) and group them by parenthisis.
In [10]: df[(df['float_col'] > 0.1) & (df['int_col']>2)]
Out[10]:
float_col int_col str_col
2 0.2 6 None
3 10.1 8 c
In [11]: df[(df['float_col'] > 0.1) | (df['int_col']>2)]
Out[11]:
float_col int_col str_col
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
In [12]: df[~(df['float_col'] > 0.1)]
Out[12]:
float_col int_col str_col
0 0.1 1 a
4 NaN -1 a
###########################################
import pandas as pd
df = pd.DataFrame({'int_col' : [1,2,6,8,-1], 'float_col' : [0.1, 0.2,0.2,10.1,None], 'str_col' : ['a','b',None,'c','a']})
print (df)
print (df[df['int_col'] > 3])
# Import the cars.csv data: cars
cars = pd.read_csv("cars.csv")
# To extra a column and construct a new dataframe with it
b = bricks["Columnname"] # normal calling just data
b = bricks[["Columnname"]]
# Import pandas as pd
import pandas as pd
GroupBy
#The groupby method let’s you perform SQL-like grouping operations. The example below shows a
#grouping operation performed with str_col columns entries as keys. It is used to calculate
#the mean of the float_col for each key. For more details, please refer to the split-apply-combine description on the pandas website.
In [41]: grouped = df['float_col'].groupby(df['str_col'])
In [42]: grouped.mean()
Out[42]:
str_col
a 0.1
b 0.2
c 10.1
# Construct a new column out of two others
# ie
bricks["density"] = bricks["pop"]/bricks["area"] * 10
# table["newcolumnname"] = [ 1, 2, 3, 4, ] - list of elements
a = brics["newcolname"] = [ 1, 2, 3, 4, 5 ]