Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: how to do basic data manipulation like in R?

Tags:

python

r

I have been working with R for several years. R is very strong in data manipulation. I'm learning python and I would like to know how to manipulate data using python. Basically my data sets are organized as data frames (e.g excel sheet). I would like to know (by example) how this kind of basic data manipulation task can be done using python?

1. Read csv file like the following

var1, var2, var3
1, 2, 3
4, 5, 6 
7, 8, 9

2. Subset data where var2 in ('5', '8') 
3. Make a new variable --> var4 = var3 * 3
4. Transpose this data
5. Write to csv file

Your help and example is most appreciated!

like image 797
jjoras Avatar asked Feb 04 '11 15:02

jjoras


People also ask

Is Python or R better for beginners?

Overall, Python's easy-to-read syntax gives it a smoother learning curve. R tends to have a steeper learning curve at the beginning, but once you understand how to use its features, it gets significantly easier.

Can Python do anything R can do?

While Python and R can basically both do any data science task you can think of, there are some areas where one language is stronger than the other. The majority of deep learning research is done in Python, so tools such as Keras and PyTorch have "Python-first" development.

Can I convert R code to Python?

This notebook introduces the function r2python which converts R into Python. It does not work for eveything, it is being improved everytime it is needed. It adds some not implemented function such as colnames(MatDFemale) .


2 Answers

I disagree with Cpfohl's comment - perhaps because I've been through this same transition myself, and it's not obvious how a naive user would be able to formulate the problem more precisely. It is actually an active development problem right now with a number of projects that have all come up with non-overlapping functionality (e.g. in the financial timeseries world, in the brain imaging world, etc.).

The short answer is that python's various libraries for dealing with tables and csv files are not as good for a beginner as those in R, which are the end result of many years of users of varying levels.

First, have a look at recarrays in numpy. This is probably the closest data structure that is in a commonly used library that is similar to a data.frame in R. In particular, you'll probably like the numpy.recfromcsv function, though it is not as robust as e.g. read.csv in R (it will have trouble with non-standard line-endings, for example).

Subsetting a recarray is easy (though creating one can seem clunky):

import numpy as np
mydata = np.array([(1.0, 2), (3.0, 4)], dtype=[('x', float), ('y', int)])
mydata = mydata.view(np.recarray)
mydata[mydata.x > 2]

Modifying the nature of a numpy array is not generally as easy as in R, but there is a nice library of functions in numpy.lib.recfunctions (which must be imported separately - it doesn't come along with a simple import numpy). In particular, check out rec_append_fields and rec_join for adding columns.

Numpy has a function numpy.savetxt that will accept a simple delimiter argument to make a csv file, but it will not print column names sadly (at least, I don't see that it does). So, while I discourage adding unnecessary libraries (since it gives less portable code), you might just use matplotlib.mlab.rec2csv (you'll find some other similar functions in that neighborhood as well - the numpy community is trying to port generally useful numeric / data manip code to numpy proper. Who knows, maybe you'll do this?).

You'll notice I didn't answer (4), because that doesn't make sense. Tables don't transpose in python or R. Arrays or matrices do. So, convert your data to an array with a uniform dtype, then just use myarray.T.

Other tools you might look at are pytables (and the related package carray), larry, datarray, pandas and tabular. In particular, datarray is looking to create a system for labelled data arrays which would serve as a foundation to other projects (and I think has developers from the larry and pandas projects as well).

Hope that helps! Dav

like image 156
Dav Clark Avatar answered Oct 13 '22 00:10

Dav Clark


import csv
from itertools import izip

with open('source.csv') as f:
    reader = csv.reader(f)
    # filter data
    data = (row for row in reader if row[1].strip() in ('5', '8'))
    # make a new variable
    data = (row + [int(row[2]) * 3] for row in data)
    # transpose data
    data = izip(*data)
    # write data to a new csv file
    with open('destination.csv', 'w') as fw:
        csv.writer(fw).writerows(data)
like image 27
nosklo Avatar answered Oct 12 '22 23:10

nosklo