Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I remove extra whitespace from strings when parsing a csv file in Pandas?

I have the following file named 'data.csv':

    1997,Ford,E350
    1997, Ford , E350
    1997,Ford,E350,"Super, luxurious truck"
    1997,Ford,E350,"Super ""luxurious"" truck"
    1997,Ford,E350," Super luxurious truck "
    "1997",Ford,E350
    1997,Ford,E350
    2000,Mercury,Cougar

And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:

       Year     Make   Model              Description
    0  1997     Ford    E350                     None
    1  1997     Ford    E350                     None
    2  1997     Ford    E350   Super, luxurious truck
    3  1997     Ford    E350  Super "luxurious" truck
    4  1997     Ford    E350    Super luxurious truck
    5  1997     Ford    E350                     None
    6  1997     Ford    E350                     None
    7  2000  Mercury  Cougar                     None

The best I could do was:

    pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])

Which gets me:

    Year     Make   Model              Description
 0  1997     Ford    E350                     None
 1  1997    Ford     E350                     None
 2  1997     Ford    E350   Super, luxurious truck
 3  1997     Ford    E350  Super "luxurious" truck
 4  1997     Ford    E350   Super luxurious truck 
 5  1997     Ford    E350                     None
 6  1997     Ford    E350                     None
 7  2000  Mercury  Cougar                     None

How can I get the DataFrame without those whitespaces?

like image 313
mpjan Avatar asked Nov 14 '12 19:11

mpjan


People also ask

How do I remove extra spaces from a CSV file in Python?

Create a class based on csv. DictReader , and override the fieldnames property to strip out the whitespace from each field name (aka column header, aka dictionary key).

How do you get rid of white space in pandas?

Pandas provide 3 methods to handle white spaces(including New line) in any text data. As it can be seen in the name, str. lstrip() is used to remove spaces from the left side of string, str. rstrip() to remove spaces from right side of the string and str.

How do I remove spaces between words in pandas?

Series. str. strip()” to remove the whitespace from the string. Using strip function we can easily remove extra whitespace from leading and trailing whitespace from staring.

How do you strip a space in Python?

strip() Python String strip() function will remove leading and trailing whitespaces. If you want to remove only leading or trailing spaces, use lstrip() or rstrip() function instead.


4 Answers

You could use converters:

import pandas as pd

def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

def make_int(text):
    return int(text.strip('" '))

table = pd.read_table("data.csv", sep=r',',
                      names=["Year", "Make", "Model", "Description"],
                      converters = {'Description' : strip,
                                    'Model' : strip,
                                    'Make' : strip,
                                    'Year' : make_int})
print(table)

yields

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None
like image 60
unutbu Avatar answered Oct 17 '22 04:10

unutbu


Adding parameter skipinitialspace=True to read_table worked for me.

So try:

pd.read_table("data.csv", 
              sep=r',', 
              names=["Year", "Make", "Model", "Description"], 
              skipinitialspace=True)

Same thing works in pd.read_csv().

like image 32
TheGrimmScientist Avatar answered Oct 17 '22 04:10

TheGrimmScientist


Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace. However, after you've read it in, you could strip out the whitespace by doing, e.g., df["Make"] = df["Make"].map(str.strip) (where df is your dataframe).

like image 31
BrenBarn Avatar answered Oct 17 '22 04:10

BrenBarn


I don't have enough reputation to leave a comment, but the answer above suggesting using the map function along with strip won't work if you have NaN values, since strip only works on chars and NaN are floats.

There is a built-in pandas function to do this, which I used: pd.core.strings.str_strip(df['Description'])
where df is your dataframe. In my case I used it on a dataframe with ~1.2 million rows and it was very fast.

like image 13
RKD314 Avatar answered Oct 17 '22 05:10

RKD314