I have 400 million tweets ( actually I think its almost like 450 but never mind ) , in the form :
T "timestamp"
U "username"
W "actual tweet"
I want to write them to a file initially in the form "username \t tweet" and then load into a DB . The problem is that before loading into a db, there are a few things I need to do : 1. Preprocess the tweet to remove RT@[names] and urls 2. Take the username out of "http://twitter.com/username".
I am using python and this is the code . Please let me know how this can be made faster :)
'''The aim is to take all the tweets of a user and store them in a table. Do this for all the users and then lets see what we can do with it
What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started
'''
def regexSub(line):
line = re.sub(regRT,'',line)
line = re.sub(regAt,'',line)
line = line.lstrip(' ')
line = re.sub(regHttp,'',line)
return line
def userName(line):
return line.split('http://twitter.com/')[1]
import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT
regRT = 'RT'
global regHttp
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')
for line1,line2,line3 in itertools.izip_longest(*[data]*3):
line1 = line1.split('\t')[1]
line2 = line2.split('\t')[1]
line3 = line3.split('\t')[1]
#print 'line1',line1
#print 'line2=',line2
#print 'line3=',line3
#print 'line3 before preprocessing',line3
try:
tweet=regexSub(line3)
user = userName(line2)
except:
print 'Line2 is ',line2
print 'Line3 is',line3
#print 'line3 after processig',line3
processed.write(user.strip("\n")+"\t"+tweet)
I ran the code in the following manner:
python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt
This is the output I get : ( Warning : its pretty big and I did not filter out the small values, thinking, they may also hold some significance )
Sat Jan 7 03:28:51 2012 profile_dump
3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
528840744 166.402 0.000 166.402 0.000 {method 'split' of 'str' objects}
396630560 81.300 0.000 81.300 0.000 {method 'get' of 'dict' objects}
396630560 326.349 0.000 439.737 0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558 255.662 0.000 1297.705 0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558 602.307 0.000 602.307 0.000 {built-in method sub}
264420442 32.087 0.000 32.087 0.000 {isinstance}
132210186 34.700 0.000 34.700 0.000 {method 'lstrip' of 'str' objects}
132210186 27.296 0.000 27.296 0.000 {method 'strip' of 'str' objects}
132210186 181.287 0.000 1513.691 0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186 79.950 0.000 79.950 0.000 {method 'write' of 'file' objects}
132210186 55.900 0.000 113.960 0.000 TwitterScripts/Preprocessing.py:10(userName)
313/304 0.000 0.000 0.000 0.000 {len}
Removed the ones which were really low ( like 1, 3 and so on)
Please tell me what other changes can be made. Thanks !
If you want to remove any occasion of retweet label in post, just remove count=1 from code. It is necessary to use mask 'RT @' because 'RT' may occur in tweet body. As the same, re. compile('\#') removes all hashtags from tweet.
This is what multiprocessing is for.
You have a pipeline that can be broken into a large number of small steps. Each step is a Process
which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.
You'll have a Process
which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.
You'll have a Process
which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.
Etc., etc.
Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing
module. After that, it's an empirical study to find out what the optimum mix of processing steps is.
When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.
Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.
Until you run it through a profiler, it is difficult to know what needs to be changed. However, I would suggest that the most likely slowdowns occur where you are creating and running the regular expressions.
Since your file follows a specific format, you may see significant speed increases by using a lex+yacc combo. If you use python lex+yacc, you won't see as much of a speed increase, but you won't need to muck about with c code.
If this seems like overkill, try compiling the regular expressions before you start the loop. You can also have chunks of the file run by independent worker threads/processes.
Again though, profiling will reveal what actually is causing the bottleneck. Find that out first, then see if these options will solve the problem.
str.lstrip is probably not doing what you were expecting:
>>> 'http://twitter.com/twitty'.lstrip('http://twitter.com/')
'y'
from the docs:
S.lstrip([chars]) -> string or unicode
Return a copy of the string S with leading whitespace removed.
If chars is given and not None, remove characters in chars instead.
If chars is unicode, S will be converted to unicode before stripping
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With