Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python turn giant function into Class

Tags:

python

Ive been trying to learn classes for a while now. Is there anyone that could help me with my understanding in using Classes. Any help is much appreciated.

The code below works when the def scraper is called directly. The class Scraper is my interpretation of classes. I believe im suppose to call instances now but am not sure exactly how or why. The script does not produce any errors so I'm very lost. Thanks and I hope this was an appropriate question.

import requests
import re
from bs4 import BeautifulSoup as BS 



class Scraper():
    def __init__(self):
        self = self
    
    def requests(self):
        url_var = raw_input("Please input url: ")
        url = requests.get('%s' %url_var, auth=('',''))
        return BS(url.text)

    def file_(self):
        with open("Desktop/Scripts/%s.txt"%url_var,"w+") as links:
        return links.write(self)

    def parse(self):
        links = self .find_all('a')
        for tag in links:
            link = tag.get('href',None)
            if link != None:
                return link
            else:
                print "Error in parsing html"

if __name__ == "__main__":
    Scraper()
    

###
def scraper():
    url_var = raw_input("Please input url: ")
    url = requests.get('%s' %url_var, auth=('user','pass'))
    with open('Desktop/Scripts/formgrab.txt',"w+") as links:
        soup = BS(url.text)
        links.write(soup.prettify().encode('utf8'))
        links = soup.find_all('a')
        for tag in links:
            link = tag.get('href',None)
            if link != None:
                print link
###
like image 366
orphansec Avatar asked Feb 23 '15 21:02

orphansec


3 Answers

You've asked a very broad question here that is probably technically off-topic for the site, but it's bugging me so much that no one can explain to you a topic so fundamental and crucial to modern programming that I'm going to answer you anyway. This moment of zen you are trying to reach eluded me for a good long while, too, so I understand.

Warning: Long-Winded Tutorial Ahead. Not for the faint of heart.

Understanding Classes

This may be a bit jarring, but since you're trying to understand classes, you're already wrong. If you think carefully about where classes are found, you might already see why. Classes are part of object-oriented programming. You see? Programming that is oriented around objects. So we shouldn't be talking about classes, we should be talking about objects. We'll get to classes later.

OK, so... Understanding Objects

You said elsewhere:

I've read several tutorials... I understand that classes are used for storing data and to layout a blueprint, I've just never had a reason to use them so it's never sunk in. I've read a lot of tutorials that said how to create 'class dog', 'class cat', 'dog.name = "scruff"', but for the life of me I can't see how it applies to my situation. Thanks anyway guys, sorry for wasting your time.

Well, I'm sorry those tutorials wasted your time. I certainly don't think you're the only one in this situation.

What those tutorials are trying to encapsulate is one part of OOP, and it's damn useful: Objects model real-world data really well. But that's just one part of the big picture, so let's back up. Before we try to understand why objects are useful, what are they? I mean, if you read some of those tutorials you'll come away with the notion that an object can be anything (meaning really, it can model anything), but since we humans tend to understand things by breaking them down and bracketing them off, this definition is so general as to be almost meaningless.

There's a really great blog post by a fellow Stack Overflow user that I think puts it wonderfully succinctly (emphasis in original):

Object-oriented programming is about objects: bundles of state and behavior. The rest is optional fluff.

State, and behavior. That's what makes an object. (I really do recommend you give the post a read in full, but I digress.)

Having Objects Just to Have Objects

As you've seen in the aforementioned tutorials, part of what people get really excited about with OOP is the ability to model real-world things with programming objects. This goes back to state and behavior. However, since we don't have any dogs and cats in our programs, it's kind of difficult to see the connection. But everything says OOP is really good and also everywhere so you should be using it, right? This leads people to adopt parts of OOP without understanding objects, and leads to wrong-headed refactoring. The title of this very question shows this kind of thinking: I have a giant function, but everyone says I should be using classes, so I should turn it into a class, right? Wrong. At least, not necessarily.

For one thing, it should be noted that object-oriented programming, while popular, is only one paradigm. In functional programming, everything is based on functions rather than objects, with some really interesting effects. Python partially supports functional programming with the inclusion of things like map and reduce, and there exist many shellscript-like Python programs that are straightforwardly imperative. My point is, in general it can produce bad outcomes if you constrain yourself to only using OOP for the sake of using OOP, and if you do this in Python (a multi-paradigm language) you are restricting yourself unnecessarily to a subset of the language's power.

So When Should I Use Objects? And What Are Classes vs Instances?

You should use objects when it helps you conceptualize a problem better and/or helps you organize your program. Objects are not just for code reuse! You can just as easily reuse code with only functions. For a great many decades, this was the only way to reuse code. The real reason to use objects is because, when you get used to doing it, it provides a natural framework for solving tough decisions that programmers face every day: Where should this function live? What should be in control of doing this or that? When everything is broken down into objects, the answers to questions like these start becoming more obvious.

Suppose you were writing a program that would tell you the longest word in a file. Forget libraries and hip one-liners. Let's do this simply:

longest_word = ''
longest_word_len = 0
with open('some_file.txt') as f:
    for line in f:
        for word in line.split():
            word_len = len(word)
            if word_len > longest_word_len:
                longest_word_len = word_len
                longest_word = word
print('The longest word is', longest_word, 'at', longest_word_len, 'characters.')

Hey! That's pretty neat, and it works just fine. It sure would be nice if I didn't have to change the file name any time I wanted to run it though... and if I put it in a function I could throw an __init__.py in its directory and import it!

def longest_word_in_file(filename):
    """Print a message about the longest word in the given file"""
    longest_word = ''
    longest_word_len = 0
    with open(filename) as f:
        for line in f:
            for word in line.split():
                word_len = len(word)
                if word_len > longest_word_len:
                    longest_word_len = word_len
                    longest_word = word
    print('The longest word is', longest_word, 'at',
          longest_word_len, 'characters.')

There, with only two simple modifications, we've powerfully refactored this code into something we can import into other places to be used, and something that can give us the desired result for any file, not just the one we originally hard-coded. (Incidentally, that made it reusable, didn't it?) We've also given it a useful name, indicating what it does (hint: behavior).

Suppose we're so proud of this function we're showing all our friends what we've made, and one friend comments, "Cool! What's the longest word its ever seen?" Hmm, good question. Right now, our little function just does its job, then packs up and calls it a day. It doesn't keep track of anything between calls. That's right: it's stateless! The statistic your friend is asking about calls for another refactoring: this time, into an object.

Consider that now we have our pre-requisites: state and behavior. We know what it does and what it needs to keep up with. But there's a third thing we need to consider: what is it? When you write a function, you're only thinking of what it does. Now we need to know what it is. Since it finds the longest word in a file, I'll say it is a LongestWordFinder. Creative, I know.

class LongestWordFinder():
    """
    Find the longest word in a file and track statistics on words found.
    """
    longest_word = ''
    longest_word_len = 0
    longest_word_ever = ''
    longest_word_ever_len = 0

    def find_in_file(self, filename):
        self.longest_word = ''
        self.longest_word_len = 0
        with open(filename) as f:
            for line in f:
                for word in line.split():
                    word_len = len(word)
                    if word_len > self.longest_word_len:
                        self.longest_word_len = word_len
                        self.longest_word = word
                    if word_len > self.longest_word_ever_len:
                        self.longest_word_ever_len = word_len
                        self.longest_word_ever = word
        print('The longest word is', self.longest_word, 'at',
              self.longest_word_len, 'characters.')
        print('The longest word I have ever seen is',
              self.longest_word_ever, 'at', self.longest_word_ever_len,
               'characters.')

Whoa... a lot changed. Well, not really. Our function now lives in a class, and has a different name. Also, we're keeping track of two things now: both the longest word in the current file, and the longest word ever scanned. The only other thing that has really changed is using self in a few places to access the variables, which now belong to the class instead of just being local to the function. This is how the state persists between calls!

So here's an interesting question to ask yourself if you're still reading this: how long is ever? You know, as in the longest word we've ever seen? This will also tell us the answer to another question: exactly how do we use this thing? If you copy that class into lwf.py then open the Python interpreter in that directory:

>>> from lwf import LongestWordFinder
>>> lwf = LongestWordFinder()
>>> lwf.find_in_file('test.txt')
The longest word is watermelon at 10 characters.
The longest word I have ever seen is watermelon at 10 characters.
>>> lwf.find_in_file('test2.txt')
The longest word is freakshow at 9 characters.
The longest word I have ever seen is watermelon at 10 characters.

In order to use our class, we have to instantiate it. That is, we create an instance of it. Thus, the longest word it has ever seen, means the longest word the instance has seen in its lifetime. Since instances are bound to variables, that's essentially the scope of the variable. When I close the Python interpreter I copied that code from, the instance is gone -- its lifetime is over -- and all the state it was keeping up with is gone with it. Also, if we make another instance:

>>> lwf2 = LongestWordFinder()
>>> lwf2.find_in_file('test.txt')
The longest word is freakshow at 9 characters.
The longest word I have ever seen is freakshow at 9 characters.

it has different memories. It has never seen test.txt, so it doesn't know about watermelon.

>>> lwf.find_in_file('test2.txt')
The longest word is freakshow at 9 characters.
The longest word I have ever seen is watermelon at 10 characters.

The first one still does, though.

Persistent State

What if we want it to remember the state after we close the Python interpreter? Well, if we wanted to be really fancy, we could use a database, but we're doing things simply. Let's write it down in a file. And while we're at it, let's refactor our class so we can pass it the file.

class LongestWordFinder():
    """
    Find the longest word in a file and track statistics on words found.
    """
    def __init__(self, memory_file=None):
        self.longest_word = ''
        self.longest_word_len = 0
        if memory_file is not None:
            with open(memory_file) as memory:
                word, word_len = memory.read().strip().split(':')
                self.longest_word_ever = word
                self.longest_word_ever_len = word_len
        else:
            self.longest_word_ever = ''
            self.longest_word_ever_len = 0

    def find_in_file(self, filename):
        self.longest_word = ''
        self.longest_word_len = 0
        with open(filename) as f:
            for line in f:
                for word in line.split():
                    word_len = len(word)
                    if word_len > self.longest_word_len:
                        self.longest_word_len = word_len
                        self.longest_word = word
                    if word_len > self.longest_word_ever_len:
                        self.longest_word_ever_len = word_len
                        self.longest_word_ever = word
        print('The longest word is', self.longest_word, 'at',
              self.longest_word_len, 'characters.')
        print('The longest word I have ever seen is',
              self.longest_word_ever, 'at', self.longest_word_ever_len,
              'characters.')

    def save(self, memory_file):
        with open(memory_file, 'w') as memory:
            memory.write('%s:%s' % (self.longest_word_ever,
                                    self.longest_word_ever_len))

Two things have happened here. One, we have a new method, save, which writes in a file so it can remember the longest word it's ever seen. Two, we have an __init__ method now setting up our state for us, including reading from that file if we give it one. You have probably noticed by now that __something__ means it's a "magic" thing Python does that it allows us to override. In __init__'s case, it is the function that sets up your instances when you create them. Now, supposing we had already run this a time or two, called save, then exited our interpreter, when we fire it back up we can do this:

>>> lwf = LongestWordFinder('memory.txt')
>>> lwf.longest_word_ever
'watermelon'

and we're back where we started.

It is too much... I will sum up.

As you can see, there are good reasons to want to refactor a function into a class. There are also bad ones. You may also have noticed that it is possible to explain the benefits of objects without ever once mentioning dogs, cats, constructors, encapsulation, or even inheritance!* Objects are great because they give you a way to keep track of state, while providing an intuitive way to model your program, and easily allow you to pass around code that you already got working without ending up with functions with parameter lists a mile long. Hopefully this helps you grok why objects are useful, and when and why you might want to turn a "giant function" into one. Happy coding. :)

* even though those are all wonderful things to have around

like image 175
Two-Bit Alchemist Avatar answered Nov 15 '22 08:11

Two-Bit Alchemist


When you make class, you are usually do so because you want to reuse code in someway. This can be done in at least two ways. The first is that you can have different instances of the same class that stores data in the instance based on how it was created. The other way to reuse class is by using inheritance. Using inheritance you can define a base class which defines the general behavior of the class and have subclass refine that behavior to do something specific. For example:

import requests
import re
from bs4 import BeautifulSoup as BS

class Scraper(object):
    def __init__(self, url, output_dir):
        # store url string on instance
        self.url = url
        self.output_dir = output_dir

    def make_request(self):
        # make a request to get the url and save the response on the instance
        self.response = requests.get(self.url, auth=('',''))
        self.soup = BS(self.response.text)

    def output(self):
        with open("%s" % self.output_dir, "w") as output_file:
            for element in self.get_elements():
                output_file.write(element + '\n')

    def get_elements(self):
        raise NotImplementedError


class HrefScraper(Scraper):
    def get_elements(self):
        elements = []

        links = self.soup.find_all('a')
        for tag in links:
            link = tag.get('href',None)
            if link != None:
                elements.append(link)
            else:
                print "Error in parsing html"
        return elements

class ImageScraper(Scraper):
    def get_elements(self):
        elements = []

        links = self.soup.find_all('img')
        for tag in links:
            link = tag.get('src',None)
            if link != None:
                elements.append(link)
            else:
                print "Error in parsing html"
        return elements


if __name__ == "__main__":
    amazon_href_scraper = HrefScraper('http://www.amazon.com', 'amazon_href.txt')
    amazon_href_scraper.make_request()
    amazon_href_scraper.output()

    google_href_scraper = HrefScraper('http://www.google.com', 'google_href.txt')
    google_href_scrappr.make_request()
    google_href_scrappr.output()

    google_image_scraper = ImageScraper('http://www.google.com', 'google_image.txt')
    google_image_scraper.make_request()
    google_image_scraper.output()

In the above code I defined a base class called Scraper, which inherits from object. This is a good practice to follow this means it will have all the methods and values that objects have.

This scraper outlines a basic recipe for storing a URL and an output file. It then provides methods for making the request and outputting it to a file. What you'll notice is that the method get_elements is not implemented. This means you cannot actually instantiate this class.

However, there is another class called HrefScraper which inherits from Scraper and it does define a get_elements method. This method is a Href specific version of the scraper and will get only href attributes from a page.

Additionally, I provided an image scraper which performs similarly but will scrape image sources.

In the main method you can see the two examples of code reuse. The first is when we create a HrefScraper for Amazon and an HrefScraper for Google. Once the instances are created we don't need pass in any additional parameters because they are stored in the instance.

The second example of code reuse is how we can create an ImageScraper or HrefScraper by simply implementing a get_elements method. The code we are 'reusing' lives in the Scraper base class.

like image 40
Mike Siconolfi Avatar answered Nov 15 '22 09:11

Mike Siconolfi


Looks like you can benefit from learning about classes. A class is basically a blueprint for reusable code. There are a lot of really good resources out there that can help you understand classes.

Python docs https://docs.python.org/3.4/tutorial/classes.html

Python classes tutorials http://www.tutorialspoint.com/python/python_classes_objects.htm http://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial/Classes

like image 39
Alex Avatar answered Nov 15 '22 07:11

Alex