tkue
6/10/2017 - 2:54 AM

BeautifulSoup - Scrape table data https://gist.github.com/hybridjosto/4573849

BeautifulSoup - Scrape table data

https://gist.github.com/hybridjosto/4573849

# http://codereview.stackexchange.com/questions/60769/scrape-an-html-table-with-python
"""
Does't return the table rows, just mostly the header 
2nd part shows correction 
"""
import sys
from urllib2 import urlopen, URLError
from argparse import ArgumentParser
from bs4 import BeautifulSoup


def parse_arguments():
    """ Process command line arguments """
    parser = ArgumentParser(description='Grabs tables from html')
    parser.add_argument('-u', '--url', help='url to grab from',
                        required=True)
    args = parser.parse_args()
    return args


def parse_rows(rows):
    """ Get data from rows """
    results = []
    for row in rows:
        table_headers = row.find_all('th')
        if table_headers:
            results.append([headers.get_text() for headers in table_headers])

        table_data = row.find_all('td')
        if table_data:
            results.append([data.get_text() for data in table_data])
    return results


def main():
    # Get arguments
    args = parse_arguments()
    if args.url:
        url = args.url

    # Make soup
    try:
        resp = urlopen(url)
    except URLError as e:
        print 'An error occured fetching %s \n %s' % (url, e.reason)   
        return 1
    soup = BeautifulSoup(resp.read())

    # Get table
    try:
        table = soup.find('table')
    except AttributeError as e:
        print 'No tables found, exiting'
        return 1

    # Get rows
    try:
        rows = table.find_all('tr')
    except AttributeError as e:
        print 'No table rows found, exiting'
        return 1

    # Get data
    table_data = parse_rows(rows)

    # Print data
    for i in table_data:
       print '\t'.join(i)


if __name__ == '__main__':
    status = main()
    sys.exit(status)

http://codereview.stackexchange.com/questions/60769/scrape-an-html-table-with-python

In general, your code is very well written. PEP8 compliance is very good, names make sense, you have docstrings, you use if __name__ == '__main__:.

Here's a few comments on what I would do differently:

# Get table
try:
    table = soup.find('table')
except AttributeError as e:
    print 'No tables found, exiting'
    return 1

# Get rows
try:
    rows = table.find_all('tr')
except AttributeError as e:
    print 'No table rows found, exiting'
    return 1

This is essentially the same code duplicated twice. The fact that you differentiate between not finding a table and not finding rows within it is nice, but ultimately useless, as a table with nothing in it might as well be no table at all.

Therefore, I propose changing it to detect a valid table, that is one with rows. This tidies up the code a bit like so:

try:
    table = soup.find('table')
    rows = table.find_all('tr')
except AttributeError as e:
    raise ValueError("No valid table found")

You'll notice I eliminated the print and the return 1. Errors should be errors, they shouldn't magically become print statements. The only exception to this is if this is supposed to be for end users, then you will want to retain the print. Secondly, sys.exit() is unnecessary as the script will end automatically when it reaches the end of the file. Therefore, all the return 1's are unnecessary and are extra fluff.

Lastly, let's talk about your comments. Most of them are useless.

# Make soup

Is that really the best description? You are retrieving a url and reading the HTML. THAT should be your comment.

# Get table

# Get rows

As I said, I think these two can be wonderfully merged. Also, I'm pretty sure the names are more than sufficient to understand that you are retrieving the table and the rows of the table.

# Get data
table_data = parse_rows(rows)

Hmm... what else could table_data be? This is a comment that seems pretty useless and adds un-needed fluff.

# Print data

This one I actually think should be more descriptive, such as:

# Print table data

As indicated in the comments, it is suggested that you remove all of your comments, as they are largely plain fluff considering your naming is so well done.

Another last thing to note is that

if args.url:

Is a completely useless check. It is guaranteed that args.url will always exist as per how the argparse module functions. If you made the url an optional parameter, that would be a different story, but then your program would be useless.


import urllib2
from bs4 import BeautifulSoup
# http://segfault.in/2010/07/parsing-html-table-in-python-with-beautifulsoup/

f = open('cricket-data.txt','w')
linksFile = open("linksSource.txt")
lines = list(linksFile.readlines())
for i in lines[12:108]: #12:108
	url = "http://www.gunnercricket.com/"+str(i)
	try:
		page = urllib2.urlopen(url)
	except:
		continue
	soup = BeautifulSoup(page)
	title = soup.title
	date = title.string[:4]+',' #take first 4 characters from title
	
	try:
		table = soup.find('table')
		rows = table.findAll('tr')

		for tr in rows:
			cols = tr.findAll('td')
			text_data = []
			for td in cols:
				text = ''.join(td)
				utftext = str(text.encode('utf-8'))
				text_data.append(utftext) # EDIT
			text = date+','.join(text_data)
			f.write(text + '\n') 
	except:
		pass
f.close()

http://codereview.stackexchange.com/questions/60769/scrape-an-html-table-with-python

Decomposition

I would prefer to see a parse_table(…) function rather than a parse_rows(…) function. That is, after all, a function that is closer to the essence of what this program accomplishes. Extracting the <tr> rows within the table is just a detail within that goal.

Error handling

I don't believe that BeautifulSoup raises AttributeError the way you expect. When .find() finds no matching elements, it just returns None. When .find_all() finds no matching elements, it returns an empty list.

Error messages should be printed to sys.stderr to avoid contaminating the sys.stdout, which should be reserved for outputting scraped data.

Output

I recommend calling .strip() on the strings returned by .get_text(), so that leading/trailing whitespace in the HTML source of each table cell doesn't mess up the output format. Otherwise, the printout from a table like this would be a mess:

<table>
  <tr>
    <th>
      Word
    </th>
    <th>
      Definition
    </th>
  </tr>
  <!-- etc -->
</table>

Consider adding some kind of quoting mechanism, so that the output can be in a useful tab-separated-values format even if the HTML source code happens to contain Tab or Newline characters.

Cell extraction

In your parse_rows(), you find all <th> elements, then all <td> elements. This is problematic for two reasons:

  1. Code duplication: both loops have essentially the same bodies.
  2. Element reordering: any <th> elements would be pulled to the beginning of the resulting list, before all <td> elements. While headers usually come before the data cells, you shouldn't make that assumption.

In my opinion, <tr> that contains no <th> or <td> descendants should still produce an empty element in the result. (You skip such empty rows.)

Taking into account all of the other advice above as well, I would personally write the cell extraction function this way, using list comprehensions.

def parse_table(table):
    """ Get data from table """
    return [
        [cell.get_text().strip() for cell in row.find_all(['th', 'td'])]
           for row in table.find_all('tr')
    ]

share|improve this answer