Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preprocessing 400 million tweets in Python -- faster

Tags:

python

twitter

I have 400 million tweets ( actually I think its almost like 450 but never mind ) , in the form :

T    "timestamp"
U    "username"
W    "actual tweet"

I want to write them to a file initially in the form "username \t tweet" and then load into a DB . The problem is that before loading into a db, there are a few things I need to do : 1. Preprocess the tweet to remove RT@[names] and urls 2. Take the username out of "http://twitter.com/username".

I am using python and this is the code . Please let me know how this can be made faster :)

'''The aim is  to take all the tweets of a user and store them in a table.  Do this for all the users and then lets see what we can do with it 
   What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started 
'''
def regexSub(line):
    line = re.sub(regRT,'',line)
    line = re.sub(regAt,'',line)
    line = line.lstrip(' ')
    line = re.sub(regHttp,'',line)
    return line
def userName(line):
    return line.split('http://twitter.com/')[1]


import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT 
regRT = 'RT'
global regHttp 
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt 
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')

for line1,line2,line3 in itertools.izip_longest(*[data]*3):
    line1 = line1.split('\t')[1]
    line2 = line2.split('\t')[1]
    line3 = line3.split('\t')[1]

    #print 'line1',line1
    #print 'line2=',line2
    #print 'line3=',line3
    #print 'line3 before preprocessing',line3
    try:
        tweet=regexSub(line3)
        user = userName(line2)
    except:
        print 'Line2 is ',line2
        print 'Line3 is',line3

    #print 'line3 after processig',line3
    processed.write(user.strip("\n")+"\t"+tweet)

I ran the code in the following manner:

python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt

This is the output I get : ( Warning : its pretty big and I did not filter out the small values, thinking, they may also hold some significance )

Sat Jan  7 03:28:51 2012    profile_dump

         3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds

   Ordered by: call count

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
528840744  166.402    0.000  166.402    0.000 {method 'split' of 'str' objects}
396630560   81.300    0.000   81.300    0.000 {method 'get' of 'dict' objects}
396630560  326.349    0.000  439.737    0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558  255.662    0.000 1297.705    0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558  602.307    0.000  602.307    0.000 {built-in method sub}
264420442   32.087    0.000   32.087    0.000 {isinstance}
132210186   34.700    0.000   34.700    0.000 {method 'lstrip' of 'str' objects}
132210186   27.296    0.000   27.296    0.000 {method 'strip' of 'str' objects}
132210186  181.287    0.000 1513.691    0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186   79.950    0.000   79.950    0.000 {method 'write' of 'file' objects}
132210186   55.900    0.000  113.960    0.000 TwitterScripts/Preprocessing.py:10(userName)
  313/304    0.000    0.000    0.000    0.000 {len}

Removed the ones which were really low ( like 1, 3 and so on)

Please tell me what other changes can be made. Thanks !

like image 556
crazyaboutliv Avatar asked Jan 06 '12 20:01

crazyaboutliv


People also ask

How do I get rid of RT in Python?

If you want to remove any occasion of retweet label in post, just remove count=1 from code. It is necessary to use mask 'RT @' because 'RT' may occur in tweet body. As the same, re. compile('\#') removes all hashtags from tweet.


3 Answers

This is what multiprocessing is for.

You have a pipeline that can be broken into a large number of small steps. Each step is a Process which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.

You'll have a Process which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.

You'll have a Process which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.

Etc., etc.

Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing module. After that, it's an empirical study to find out what the optimum mix of processing steps is.

When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.

Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.

like image 139
S.Lott Avatar answered Sep 29 '22 11:09

S.Lott


Until you run it through a profiler, it is difficult to know what needs to be changed. However, I would suggest that the most likely slowdowns occur where you are creating and running the regular expressions.

Since your file follows a specific format, you may see significant speed increases by using a lex+yacc combo. If you use python lex+yacc, you won't see as much of a speed increase, but you won't need to muck about with c code.

If this seems like overkill, try compiling the regular expressions before you start the loop. You can also have chunks of the file run by independent worker threads/processes.

Again though, profiling will reveal what actually is causing the bottleneck. Find that out first, then see if these options will solve the problem.

like image 35
Spencer Rathbun Avatar answered Sep 29 '22 12:09

Spencer Rathbun


str.lstrip is probably not doing what you were expecting:

>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'

from the docs:

S.lstrip([chars]) -> string or unicode

Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
like image 26
Lie Ryan Avatar answered Sep 29 '22 12:09

Lie Ryan