Preprocessing 400 million tweets in Python -- faster

Tags:

twitter

I have 400 million tweets ( actually I think its almost like 450 but never mind ) , in the form :

T    "timestamp"
U    "username"
W    "actual tweet"

I want to write them to a file initially in the form "username \t tweet" and then load into a DB . The problem is that before loading into a db, there are a few things I need to do : 1. Preprocess the tweet to remove RT@[names] and urls 2. Take the username out of "http://twitter.com/username".

I am using python and this is the code . Please let me know how this can be made faster :)

'''The aim is  to take all the tweets of a user and store them in a table.  Do this for all the users and then lets see what we can do with it 
   What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started 
'''
def regexSub(line):
    line = re.sub(regRT,'',line)
    line = re.sub(regAt,'',line)
    line = line.lstrip(' ')
    line = re.sub(regHttp,'',line)
    return line
def userName(line):
    return line.split('http://twitter.com/')[1]


import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT 
regRT = 'RT'
global regHttp 
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt 
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')

for line1,line2,line3 in itertools.izip_longest(*[data]*3):
    line1 = line1.split('\t')[1]
    line2 = line2.split('\t')[1]
    line3 = line3.split('\t')[1]

    #print 'line1',line1
    #print 'line2=',line2
    #print 'line3=',line3
    #print 'line3 before preprocessing',line3
    try:
        tweet=regexSub(line3)
        user = userName(line2)
    except:
        print 'Line2 is ',line2
        print 'Line3 is',line3

    #print 'line3 after processig',line3
    processed.write(user.strip("\n")+"\t"+tweet)

I ran the code in the following manner:

python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt

This is the output I get : ( Warning : its pretty big and I did not filter out the small values, thinking, they may also hold some significance )

Sat Jan  7 03:28:51 2012    profile_dump

         3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
528840744  166.402    0.000  166.402    0.000 {method 'split' of 'str' objects}
396630560   81.300    0.000   81.300    0.000 {method 'get' of 'dict' objects}
396630560  326.349    0.000  439.737    0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558  255.662    0.000 1297.705    0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558  602.307    0.000  602.307    0.000 {built-in method sub}
264420442   32.087    0.000   32.087    0.000 {isinstance}
132210186   34.700    0.000   34.700    0.000 {method 'lstrip' of 'str' objects}
132210186   27.296    0.000   27.296    0.000 {method 'strip' of 'str' objects}
132210186  181.287    0.000 1513.691    0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186   79.950    0.000   79.950    0.000 {method 'write' of 'file' objects}
132210186   55.900    0.000  113.960    0.000 TwitterScripts/Preprocessing.py:10(userName)
  313/304    0.000    0.000    0.000    0.000 {len}

Removed the ones which were really low ( like 1, 3 and so on)

Please tell me what other changes can be made. Thanks !

556

asked Jan 06 '12 20:01

crazyaboutliv

3 Answers

This is what multiprocessing is for.

You have a pipeline that can be broken into a large number of small steps. Each step is a Process which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.

You'll have a Process which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.

You'll have a Process which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.

Etc., etc.

Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing module. After that, it's an empirical study to find out what the optimum mix of processing steps is.

When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.

Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.

139

answered Sep 29 '22 11:09

S.Lott

Until you run it through a profiler, it is difficult to know what needs to be changed. However, I would suggest that the most likely slowdowns occur where you are creating and running the regular expressions.

Since your file follows a specific format, you may see significant speed increases by using a lex+yacc combo. If you use python lex+yacc, you won't see as much of a speed increase, but you won't need to muck about with c code.

If this seems like overkill, try compiling the regular expressions before you start the loop. You can also have chunks of the file run by independent worker threads/processes.

Again though, profiling will reveal what actually is causing the bottleneck. Find that out first, then see if these options will solve the problem.

answered Sep 29 '22 12:09

Spencer Rathbun

str.lstrip is probably not doing what you were expecting:

>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'

from the docs:

S.lstrip([chars]) -> string or unicode

Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping

answered Sep 29 '22 12:09

Lie Ryan

Related questions
                            
                                How to extract bits from return code number in Bash
                            
                                Creating a new list for each for loop
                            
                                Python 3.3 blueprints
                            
                                Installing SciPy on Mac OS Lion in Virtualenv
                            
                                Why no python implicit line continuation on 'and', 'or', etc.?
                            
                                Why is GCC ignoring ARCHFLAGS in Snow Leopard?
                            
                                Why can't I get two lists from one list comprehension?
                            
                                How to bind socket to an interface in python (socket.SO_BINDTODEVICE missing)
                            
                                The most efficient way to store large symmetric sparse matrices in python
                            
                                Find day difference between two datetimes (excluding weekend days) in Python? [duplicate]
                            
                                Python: How can I parse { apple: "1" , orange: "2" } into Dictionary?
                            
                                Attribute getters in python
                            
                                Printing named tuples
                            
                                Convert an int to a CSS colour [duplicate]
                            
                                TypeError: unsupported operand type(s) for -: 'str' and 'str'
                            
                                Best Python Module for HTML parsing [closed]
                            
                                Getting doc string of a python file
                            
                                Python, command line argument parsing
                            
                                Choosing GCC version when building ( setup.py )
                            
                                (SWIG C++ to Python) warning 301: class keyword used, but not in C++ mode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With