Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting a list in a dict to a Series

I'm trying to read lines from an HTML input file and prepare Series / DataFrames so I can eventually create graphs. I'm using lxml's objectify to take lines of HTML data and convert them to a list. Whenever I try to take the list data and make a Series or DataFrame, I get a Series (or DataFrame) containing a number of elements equal to the number of items in my list, but the data for the elements is my list itself.

The easiest way I can show my problem is:

from lxml import etree
from lxml import objectify
from pandas import Series
line='<tr class="alt"><td>192.168.1.0</td><td>XXDHCP</td><td>Y</td><td>255</td><td>0</td><td>YYDHCP</td><td>Y</td><td>250</td><td>0</td><td>0%</td><td>505</td><td>505</td><td>0</td><td></td></tr>'
htmldata=(objectify.fromstring(line)).getchildren()
htmlseries=Series(htmldata)

htmlseries ends up looking like:

0     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
1     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
2     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
3     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
4     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
5     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
6     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
7     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
8     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
9     [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
10    [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
11    [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
12    [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...
13    [[[192.168.1.0, XXDHCP, Y, 255, 0, YYDHCP, Y, ...

type(htmldata[0]) is: lxml.objectify.StringElement
type(htmldata[3]) is: lxml.objectify.IntElement

While I'm looking for something like:

0     192.168.1.0
1          XXDHCP
2               Y
3             255
4               0
5          YYDHCP
6               Y
7             250
8               0
9              0%
10            505
11            505
12              0
13               

What am I doing wrong? I'm kind of mystified as to what's going on. When I try reading each column into a list:

data=objectify.fromstring(line).getchildren()
labdata[ip]['Scope'].append(data[0])
labdata[ip]['Cluster1'].append(data[1])
labdata[ip]['Active1'].append(data[2])
...etc...

My list ends up looking like:

labdata['192.168.1.0']['Utilization']
['100%',
 '96%',
 '96%',
 '90%',
 '81%',
 '96%',
 '90%',
 '97%',
 '98%',
 '92%',
 '99%',
 ...etc...
 ]

But for some reason:

Series(labdata['192.168.1.0']['Utilization'])
0     [[[192.168.1.0, XXDHCP, Y, 0, 383, YYDHCP, Y...
1     [[[192.168.1.0, XXDHCP, Y, 28, 355, YYDHCP, ...
2     [[[192.168.1.0, XXDHCP, Y, 28, 355, YYDHCP, ...
3     [[[192.168.1.0, XXDHCP, Y, 76, 307, YYDHCP, ...
4     [[[192.168.1.0, XXDHCP, Y, 104, 279, YYDHCP,...
5     [[[192.168.1.0, XXDHCP, Y, 27, 356, YYDHCP, ...
6     [[[192.168.1.0, XXDHCP, Y, 66, 317, YYDHCP, ...
7     [[[192.168.1.0, XXDHCP, Y, 15, 368, YYDHCP, ...
8     [[[192.168.1.0, XXDHCP, Y, 15, 368, YYDHCP, ...
9     [[[192.168.1.0, XXDHCP, Y, 54, 329, YYDHCP, ...
...etc...

type(labdata['192.168.1.0']['Utilization'][0]) is lxml.objectify.StringElement

Do I need to cast these elements to normal strings and ints?

like image 884
dooz Avatar asked Apr 02 '13 16:04

dooz


People also ask

How do you turn a dictionary into a series?

To make a series from a dictionary, simply pass the dictionary to the command pandas. Series method. The keys of the dictionary form the index values of the series and the values of the dictionary form the values of the series.

Can we convert dictionary to series in Python?

We use series() function of pandas library to convert a dictionary into series by passing the dictionary as an argument.

How do I convert a list to a dictionary in Python?

Since python dictionary is unordered, the output can be in any order. To convert a list to dictionary, we can use list comprehension and make a key:value pair of consecutive elements. Finally, typecase the list to dict type.

How do you convert a list to a key-value pair in Python?

By using enumerate() , we can convert a list into a dictionary with index as key and list item as the value. enumerate() will return an enumerate object. We can convert to dict using the dict() constructor.


2 Answers

The problem is the elements in htmldata are not simple types, and np.isscalar is fooled here (as this is how its determined whether we have list-of-lists or a list of scalars just stringify the elements are this will work

In [23]: print [ type(x) for x in htmldata ]
[<type 'lxml.objectify.StringElement'>, <type 'lxml.objectify.StringElement'>, <type 'lxml.objectify.StringElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.StringElement'>, <type 'lxml.objectify.StringElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.StringElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.IntElement'>, <type 'lxml.objectify.StringElement'>]

In [24]: Series([ str(x) for x in htmldata ])
Out[24]: 
0     192.168.1.0
1          XXDHCP
2               Y
3             255
4               0
5          YYDHCP
6               Y
7             250
8               0
9              0%
10            505
11            505
12              0
13               
like image 65
Jeff Avatar answered Oct 13 '22 00:10

Jeff


Nice question! This is weird behaviour.

The problem occurs because you're passing Series a list lxml.objectify.StringElements. pandas is backed by np.arrays and therefore prefers to have its data stored in uniform arrays. It's therefore abstracting everything into an np.object so that it can shove them into an array. Indeed, if you look at the underlying array (Series.values) of your data, you'll see that it's been created fine, although it's a numpy array of lxml.objectify.StringElements which is probably not what you want.

The easy solution is of course to cast everything to string, which is probably what you want to do in this case.


But why is it printing funny, you ask? Well, if you drill through the code in pandas, you end up at the following function in pandas.core.common:

def _is_sequence(x):
    try:
        iter(x)
        len(x) # it has a length
        return not isinstance(x, basestring) and True
    except Exception:
        return False

In other words, pandas sees that the lxml objects are not basestrings, and hence assumes they're sequences. Pandas should probably check isinstance(x, collections.Sequence)...

like image 20
Katriel Avatar answered Oct 13 '22 01:10

Katriel