I'm new to pandas and that's my first question on stackoverflow, I'm trying to do some analytics with pandas.
I have some text files with data records that I want to process. Each line of the file match to a record which fields are in a fixed place and have a length of a fixed number of characters. There are different kinds of records on the same file, all records share the first field that are two characters depending of the type of record. As an example:
Some file: 01Jhon Smith 555-1234 03Cow Bos primigenius taurus 00401 01Jannette Jhonson 00100000000 ... field start length type 1 2 *common to all records, example: 01 = person, 03 = animal name 3 10 surname 13 10 phone 23 8 credit 31 11 fill of spaces
I'm writing some code to convert one record to a dictionary:
person1 = {'type': 01, 'name': = 'Jhon', 'surname': = 'Smith', 'phone': '555-1234'} person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00} animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 }
If a field is empty (filled with spaces) there will not be in the dictionary).
With all records of one kind I want to create a pandas DataFrame with the dicts keys as columns names, I've try with pandas.DataFrame.from_dict() without success.
And here comes my question: Is any way to do this with pandas so dict keys become column names? Are any other standard method to deal with this kind of files?
To make a DataFrame from a dictionary, you can pass a list of dictionaries:
>>> person1 = {'type': 01, 'name': 'Jhon', 'surname': 'Smith', 'phone': '555-1234'} >>> person2 = {'type': 01, 'name': 'Jannette', 'surname': 'Jhonson', 'credit': 1000000.00} >>> animal1 = {'type': 03, 'cname': 'cow', 'sciname': 'Bos....', 'legs': 4, 'tails': 1 } >>> pd.DataFrame([person1]) name phone surname type 0 Jhon 555-1234 Smith 1 >>> pd.DataFrame([person1, person2]) credit name phone surname type 0 NaN Jhon 555-1234 Smith 1 1 1000000 Jannette NaN Jhonson 1 >>> pd.DataFrame.from_dict([person1, person2]) credit name phone surname type 0 NaN Jhon 555-1234 Smith 1 1 1000000 Jannette NaN Jhonson 1
For the more fundamental issue of two differently-formatted files intermixed, and assuming the files aren't so big that we can't read them and store them in memory, I'd use StringIO
to make an object which is sort of like a file but which only has the lines we want, and then use read_fwf
(fixed-width-file). For example:
from StringIO import StringIO def get_filelike_object(filename, line_prefix): s = StringIO() with open(filename, "r") as fp: for line in fp: if line.startswith(line_prefix): s.write(line) s.seek(0) return s
and then
>>> type01 = get_filelike_object("animal.dat", "01") >>> df = pd.read_fwf(type01, names="type name surname phone credit".split(), widths=[2, 10, 10, 8, 11], header=None) >>> df type name surname phone credit 0 1 Jhon Smith 555-1234 NaN 1 1 Jannette Jhonson NaN 100000000
should work. Of course you could also separate the files into different types before pandas
ever sees them, which might be easiest of all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With