Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas with different length arrays

This is the code I have. Due to content of the raw data to be parsed, I end up with the 'user list' and the 'tweet list' being of different length. When writing the lists as columns in a data frame, I get ValueError: arrays must all be same length. I realize this, but have been looking for a way to work around it, printing 0 or NaN in the right places of the shorter array. Any ideas?

import pandas
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('#raw.html'))
chunk = soup.find_all('div', class_='content')

userlist = []
tweetlist = []

for tweet in chunk:
    username = tweet.find_all(class_='username js-action-profile-name')
    for user in username:
        user2 = user.get_text()
        userlist.append(user2)

for text in chunk:
    tweets = text.find_all(class_='js-tweet-text tweet-text')
for tweet in tweets:
    tweet2 = tweet.get_text().encode('utf-8')
    tweetlist.append('|'+tweet2)

print len(tweetlist)
print len(userlist)

#MAKE A DATAFRAME WITH THIS
data = {'tweet' : tweetlist, 'user' : userlist}
frame = pandas.DataFrame(data)
print frame

# Export dataframe to csv
frame.to_csv('#parsed.csv', index=False)
like image 623
DIGSUM Avatar asked Mar 01 '15 20:03

DIGSUM


People also ask

How do you create a DataFrame with an array of different lengths?

Use pandas.DataFrame , from a dict of uneven arrays , and then concat the arrays together in a list-comprehension. This is a way to create a DataFrame of arrays , that are not equal in length.

Can DataFrame columns have different lengths?

DataFrames consist of rows, columns, and data. In pandas, if we try to make columns in a DataFrame with each column having a different length, then it is not possible to create a DataFrame like this.

Can a pandas series be multidimensional?

A pandas Series is a one-dimensional labelled data structure which can hold data such as strings, integers and even other Python objects. It is built on top of numpy array and is the primary data structure to hold one-dimensional data in pandas. In Python, a pandas Series can be created using the constructor pandas.

How do you fix all arrays must be of the same length?

Fixing the error: The error can be fixed by adding the values to the deficient list or deleting the list with a larger length if it has some useless values. NaN or any other value can be added to the deficient value based on the observation of the remaining values in the list.


2 Answers

I'm not sure that this is exactly what you want, but anyway:

d = dict(tweets=tweetlist, users=userlist)
pandas.DataFrame({k : pandas.Series(v) for k, v in d.iteritems()})
like image 94
Dmitriy Kuznetsov Avatar answered Sep 28 '22 10:09

Dmitriy Kuznetsov


Try this:

frame = pandas.DataFrame.from_dict(d, orient='index')

After that, you should transpose your frame with:

frame = frame.transpose()

Then you can export to csv:

frame.to_csv('#parsed.csv', index=False)
like image 37
Ekrem Gurdal Avatar answered Sep 28 '22 10:09

Ekrem Gurdal