Python

Question

I am writing a python script to basically check every possible url and log it if it responds to a request.

I found a post on StackOverflow that suggested a method of generating the strings for the urls which works well.

    for n in range(1, 4 + 1):
    for comb in product(chars, repeat=n):
        url = ("http://" + ''.join(comb) + ".com")
        currentUrl = url
        checkUrl(url)

As you can imagine there is way to many urls and it is going to take a very long time so I am trying to make a way to save my script and resume from were it left off.

My question is how can I have the loop start from a specific place, or does anyone have a working piece of code that does the same thing and will allow me to specify at starting point.

This is my script soo far..

import urllib.request
from string import digits, ascii_uppercase, ascii_lowercase
from itertools import product

goodUrls = "Valid_urls.txt"
saveFile = "save.txt"
currentUrl = ''


def checkUrl(url):
    print("Trying - "+url)
    try:
        urllib.request.urlopen(url)
    except Exception as e:
        None
    else:
        log = open(goodUrls, 'a')
        log.write(url + '
')



chars = digits + ascii_lowercase

try:
    while True:
        for n in range(1, 4 + 1):
            for comb in product(chars, repeat=n):
                url = ("http://" + ''.join(comb) + ".com")
                currentUrl = url
                checkUrl(url)
except KeyboardInterrupt:
    print("Saving and Exiting")
    open(saveFile,'w').write(currentUrl)

Adam Smith · Accepted Answer

The return value of itertools.product is a generator object. As such all you'll have to do is:

products = product(...)

for foo in products:
    if bar(foo):
        spam(foo)
        break
# other stuff

for foo in products:
    # starts where you left off.

In your case the time taken to iterate through the possibilities is pretty small, at least compared to the time it'll take to make all those network requests. You could either save all the possibilities to disk and dump a list of what's left after every run of the program, or you could just save which number you're on. Since product has deterministic output, that should do it.

try:
    with open("progress.txt") as f:
        first_up = int(f.read().strip())
except FileNotFoundError:
    first_up = 0
try:
    for i, foo in enumerate(products):
        if i <= first_up:
            continue  # skip this iteration
        # do stuff down here
except KeyboardInterrupt:
    # this is really rude to do, by the by....
    print("Saving and exiting"
    with open("progress.txt", "w") as f:
        f.write(str(i))

If there's some reason you need a human-readable "progress" file, you can save your last password as you did above and do:

for foo in itertools.dropwhile(products, lambda p != saved_password):
    # do stuff

Paul Cornelius · Answer

Although the attempt to find all the URLs by this method is ridiculous, the general question posed is a very good one. The short answer is that you cannot pickle an iterator in a straightforward way, because the pickle mechanism can't save the iterator's internal state. However, you can pickle an object that implements both __iter__ and __next__. So if you create a class that has the desired functionality and also works as an iterator (by implementing those two functions), it can be pickled and reloaded. The reloaded object, when you make an iterator from it, will continue from where it left off.

#! python3.6
import pickle

class AllStrings:
    CHARS = "abcdefghijklmnopqrstuvwxyz0123456789"
    def __init__(self):
        self.indices = [0]

    def __iter__(self):
        return self

    def __next__(self):
        s = ''.join([self.CHARS[n] for n in self.indices])
        for m in range(len(self.indices)):
            self.indices[m] += 1
            if self.indices[m] < len(self.CHARS):
                break
            self.indices[m] = 0
        else:
            self.indices.append(0)
        return s

try:
    with open("bookmark.txt", "rb") as f:
        all_strings = pickle.load(f)
except IOError:
    all_strings = AllStrings()

try:    
    for s in iter(all_strings):
        print(s)
except KeyboardInterrupt:
    with open("bookmark.txt", "wb") as f:
        pickle.dump(all_strings, f)

This solution also removes the limitation on the length of the string. The iterator will run forever, eventually generating all possible strings. Of course at some point the application will stop due to the increasing entropy of the universe.

Python - how do i save a itertools.product loop and resume where it left off

Tags:

Codecowboy

2 Answers

Adam Smith

Paul Cornelius

Recent Activity

Donate For Us