This is the code I have. Due to content of the raw data to be parsed, I end up with the 'user list' and the 'tweet list' being of different length. When writing the lists as columns in a data frame, I get ValueError: arrays must all be same length
. I realize this, but have been looking for a way to work around it, printing 0
or NaN
in the right places of the shorter array. Any ideas?
import pandas
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('#raw.html'))
chunk = soup.find_all('div', class_='content')
userlist = []
tweetlist = []
for tweet in chunk:
username = tweet.find_all(class_='username js-action-profile-name')
for user in username:
user2 = user.get_text()
userlist.append(user2)
for text in chunk:
tweets = text.find_all(class_='js-tweet-text tweet-text')
for tweet in tweets:
tweet2 = tweet.get_text().encode('utf-8')
tweetlist.append('|'+tweet2)
print len(tweetlist)
print len(userlist)
#MAKE A DATAFRAME WITH THIS
data = {'tweet' : tweetlist, 'user' : userlist}
frame = pandas.DataFrame(data)
print frame
# Export dataframe to csv
frame.to_csv('#parsed.csv', index=False)
Use pandas.DataFrame , from a dict of uneven arrays , and then concat the arrays together in a list-comprehension. This is a way to create a DataFrame of arrays , that are not equal in length.
DataFrames consist of rows, columns, and data. In pandas, if we try to make columns in a DataFrame with each column having a different length, then it is not possible to create a DataFrame like this.
A pandas Series is a one-dimensional labelled data structure which can hold data such as strings, integers and even other Python objects. It is built on top of numpy array and is the primary data structure to hold one-dimensional data in pandas. In Python, a pandas Series can be created using the constructor pandas.
Fixing the error: The error can be fixed by adding the values to the deficient list or deleting the list with a larger length if it has some useless values. NaN or any other value can be added to the deficient value based on the observation of the remaining values in the list.
I'm not sure that this is exactly what you want, but anyway:
d = dict(tweets=tweetlist, users=userlist)
pandas.DataFrame({k : pandas.Series(v) for k, v in d.iteritems()})
Try this:
frame = pandas.DataFrame.from_dict(d, orient='index')
After that, you should transpose your frame with:
frame = frame.transpose()
Then you can export to csv:
frame.to_csv('#parsed.csv', index=False)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With