I have the following file named 'data.csv':
    1997,Ford,E350
    1997, Ford , E350
    1997,Ford,E350,"Super, luxurious truck"
    1997,Ford,E350,"Super ""luxurious"" truck"
    1997,Ford,E350," Super luxurious truck "
    "1997",Ford,E350
    1997,Ford,E350
    2000,Mercury,Cougar
And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:
       Year     Make   Model              Description
    0  1997     Ford    E350                     None
    1  1997     Ford    E350                     None
    2  1997     Ford    E350   Super, luxurious truck
    3  1997     Ford    E350  Super "luxurious" truck
    4  1997     Ford    E350    Super luxurious truck
    5  1997     Ford    E350                     None
    6  1997     Ford    E350                     None
    7  2000  Mercury  Cougar                     None
The best I could do was:
    pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])
Which gets me:
    Year     Make   Model              Description
 0  1997     Ford    E350                     None
 1  1997    Ford     E350                     None
 2  1997     Ford    E350   Super, luxurious truck
 3  1997     Ford    E350  Super "luxurious" truck
 4  1997     Ford    E350   Super luxurious truck 
 5  1997     Ford    E350                     None
 6  1997     Ford    E350                     None
 7  2000  Mercury  Cougar                     None
How can I get the DataFrame without those whitespaces?
Create a class based on csv. DictReader , and override the fieldnames property to strip out the whitespace from each field name (aka column header, aka dictionary key).
Pandas provide 3 methods to handle white spaces(including New line) in any text data. As it can be seen in the name, str. lstrip() is used to remove spaces from the left side of string, str. rstrip() to remove spaces from right side of the string and str.
Series. str. strip()” to remove the whitespace from the string. Using strip function we can easily remove extra whitespace from leading and trailing whitespace from staring.
strip() Python String strip() function will remove leading and trailing whitespaces. If you want to remove only leading or trailing spaces, use lstrip() or rstrip() function instead.
You could use converters:
import pandas as pd
def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text
def make_int(text):
    return int(text.strip('" '))
table = pd.read_table("data.csv", sep=r',',
                      names=["Year", "Make", "Model", "Description"],
                      converters = {'Description' : strip,
                                    'Model' : strip,
                                    'Make' : strip,
                                    'Year' : make_int})
print(table)
yields
   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None
                        Adding parameter skipinitialspace=True to read_table worked for me.
So try:
pd.read_table("data.csv", 
              sep=r',', 
              names=["Year", "Make", "Model", "Description"], 
              skipinitialspace=True)
Same thing works in pd.read_csv().
Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace.  However, after you've read it in, you could strip out the whitespace by doing, e.g., df["Make"] = df["Make"].map(str.strip) (where df is your dataframe).
I don't have enough reputation to leave a comment, but the answer above suggesting using the map function along with strip won't work if you have NaN values, since strip only works on chars and NaN are floats.
There is a built-in pandas function to do this, which I used:
pd.core.strings.str_strip(df['Description'])
 where df is your dataframe.  In my case I used it on a dataframe with ~1.2 million rows and it was very fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With