BeautifulSoup - Scrape table data
# http://codereview.stackexchange.com/questions/60769/scrape-an-html-table-with-python
"""
Does't return the table rows, just mostly the header
2nd part shows correction
"""
import sys
from urllib2 import urlopen, URLError
from argparse import ArgumentParser
from bs4 import BeautifulSoup
def parse_arguments():
""" Process command line arguments """
parser = ArgumentParser(description='Grabs tables from html')
parser.add_argument('-u', '--url', help='url to grab from',
required=True)
args = parser.parse_args()
return args
def parse_rows(rows):
""" Get data from rows """
results = []
for row in rows:
table_headers = row.find_all('th')
if table_headers:
results.append([headers.get_text() for headers in table_headers])
table_data = row.find_all('td')
if table_data:
results.append([data.get_text() for data in table_data])
return results
def main():
# Get arguments
args = parse_arguments()
if args.url:
url = args.url
# Make soup
try:
resp = urlopen(url)
except URLError as e:
print 'An error occured fetching %s \n %s' % (url, e.reason)
return 1
soup = BeautifulSoup(resp.read())
# Get table
try:
table = soup.find('table')
except AttributeError as e:
print 'No tables found, exiting'
return 1
# Get rows
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
return 1
# Get data
table_data = parse_rows(rows)
# Print data
for i in table_data:
print '\t'.join(i)
if __name__ == '__main__':
status = main()
sys.exit(status)
http://codereview.stackexchange.com/questions/60769/scrape-an-html-table-with-python
In general, your code is very well written. PEP8 compliance is very good, names make sense, you have docstrings, you use if __name__ == '__main__:
.
Here's a few comments on what I would do differently:
# Get table
try:
table = soup.find('table')
except AttributeError as e:
print 'No tables found, exiting'
return 1
# Get rows
try:
rows = table.find_all('tr')
except AttributeError as e:
print 'No table rows found, exiting'
return 1
This is essentially the same code duplicated twice. The fact that you differentiate between not finding a table and not finding rows within it is nice, but ultimately useless, as a table with nothing in it might as well be no table at all.
Therefore, I propose changing it to detect a valid table, that is one with rows. This tidies up the code a bit like so:
try:
table = soup.find('table')
rows = table.find_all('tr')
except AttributeError as e:
raise ValueError("No valid table found")
You'll notice I eliminated the print and the return 1
. Errors should be errors, they shouldn't magically become print statements. The only exception to this is if this is supposed to be for end users, then you will want to retain the print
. Secondly, sys.exit()
is unnecessary as the script will end automatically when it reaches the end of the file. Therefore, all the return 1
's are unnecessary and are extra fluff.
Lastly, let's talk about your comments. Most of them are useless.
# Make soup
Is that really the best description? You are retrieving a url and reading the HTML. THAT should be your comment.
# Get table
# Get rows
As I said, I think these two can be wonderfully merged. Also, I'm pretty sure the names are more than sufficient to understand that you are retrieving the table and the rows of the table.
# Get data
table_data = parse_rows(rows)
Hmm... what else could table_data
be? This is a comment that seems pretty useless and adds un-needed fluff.
# Print data
This one I actually think should be more descriptive, such as:
# Print table data
As indicated in the comments, it is suggested that you remove all of your comments, as they are largely plain fluff considering your naming is so well done.
Another last thing to note is that
if args.url:
Is a completely useless check. It is guaranteed that args.url will always exist as per how the argparse
module functions. If you made the url an optional parameter, that would be a different story, but then your program would be useless.
import urllib2
from bs4 import BeautifulSoup
# http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/
f = open('cricket-data.txt','w')
linksFile = open("linksSource.txt")
lines = list(linksFile.readlines())
for i in lines[12:108]: #12:108
url = "http://www.gunnercricket.com/"+str(i)
try:
page = urllib2.urlopen(url)
except:
continue
soup = BeautifulSoup(page)
title = soup.title
date = title.string[:4]+',' #take first 4 characters from title
try:
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
text_data = []
for td in cols:
text = ''.join(td)
utftext = str(text.encode('utf-8'))
text_data.append(utftext) # EDIT
text = date+','.join(text_data)
f.write(text + '\n')
except:
pass
f.close()
http://codereview.stackexchange.com/questions/60769/scrape-an-html-table-with-python
I would prefer to see a parse_table(…)
function rather than a parse_rows(…)
function. That is, after all, a function that is closer to the essence of what this program accomplishes. Extracting the <tr>
rows within the table is just a detail within that goal.
I don't believe that BeautifulSoup raises AttributeError
the way you expect. When .find()
finds no matching elements, it just returns None
. When .find_all()
finds no matching elements, it returns an empty list.
Error messages should be printed to sys.stderr
to avoid contaminating the sys.stdout
, which should be reserved for outputting scraped data.
I recommend calling .strip()
on the strings returned by .get_text()
, so that leading/trailing whitespace in the HTML source of each table cell doesn't mess up the output format. Otherwise, the printout from a table like this would be a mess:
<table>
<tr>
<th>
Word
</th>
<th>
Definition
</th>
</tr>
<!-- etc -->
</table>
Consider adding some kind of quoting mechanism, so that the output can be in a useful tab-separated-values format even if the HTML source code happens to contain Tab or Newline characters.
In your parse_rows()
, you find all <th>
elements, then all <td>
elements. This is problematic for two reasons:
<th>
elements would be pulled to the beginning of the resulting list, before all <td>
elements. While headers usually come before the data cells, you shouldn't make that assumption.In my opinion, <tr>
that contains no <th>
or <td>
descendants should still produce an empty element in the result. (You skip such empty rows.)
Taking into account all of the other advice above as well, I would personally write the cell extraction function this way, using list comprehensions.
def parse_table(table):
""" Get data from table """
return [
[cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
for row in table.find_all('tr')
]