Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas .loc and PEP8

I've tried to search this a number of times but I don't see it answered so here goes...

I often use pandas to clean up a dataframe and conform it to my needs. With this comes a lot of .loc accessing to query it and return values. Depending on what I am doing (and column lengths), this can get pretty lengthy. Given PEP8 constrains to 79 characters a line, are there any best practices? Some examples below (these are simplified and for explanatory purposes):

missing_address_df = address_df.loc[address_df['address'].notnull()].copy()

or multiple query points:

nc_drive_df = address.loc[(address_df['address'].str.contains('drive')) & (address_df['state'] == 'NC')]
like image 239
Tom Watson Avatar asked Oct 24 '20 19:10

Tom Watson


People also ask

What does .loc do in pandas?

Pandas provide a unique method to retrieve rows from a Data frame. DataFrame. loc[] method is a method that takes only index labels and returns row or dataframe if the index label exists in the caller data frame.

What is the difference between .loc & ILOC?

When it comes to selecting rows and columns of a pandas DataFrame, loc and iloc are two commonly used functions. Here is the subtle difference between the two functions: loc selects rows and columns with specific labels. iloc selects rows and columns at specific integer positions.

What does .loc in Python means?

loc is label-based, which means that you have to specify rows and columns based on their row and column labels. iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

Is it possible to name variables in snake case in PEP8?

And you can always follow the same principal as naming variables in PEP8, just use snake_case.

What is pandas?

この記事では、Pythonにおけるデータ解析のためのライブラリであるpandasのlocの使い方について解説します。 まずは、pandasについて簡単に説明しておきます。 pandasというのは、Pythonにおいて表形式のデータ解析を効率的に行うためのライブラリです。 CSVなどのデータの読み込みや集計、データの加工、グラフ化、処理結果の保存などの処理ができます。

How do I check for code patterns in my pandas project?

pandas follows the PEP8 standard and uses Black and Flake8 to ensure a consistent code format throughout the project. We encourage you to use pre-commit to automatically run black , flake8, isort, and related code checks when you make a git commit. We use a flake8 plugin, pandas-dev-flaker, to check our codebase for unwanted patterns.

What is hanging indentation in Pep 484 7?

See the relevant section of PEP 484 7. Hanging indentation is a type-setting style where all the lines in a paragraph are indented except the first line.


1 Answers

I'd advise two things

  • Ignore PEP 8's 80 char advice, but try to keep to 120 or 150 lines
    Keeping some line length requirement makes sense to aid readability, but if you're trying to keep to 80 chars in (for example) a class method, it will lead to worse and less-readable code

    PEP 8 actually has a section on this, A Foolish Consistency is the Hobgoblin of Little Minds, which describes cases you should deviate from its other advice, for example

    1. When applying the guideline would make the code less readable, even for someone who is used to reading code that follows this PEP
  • split the .loc contents onto multiple lines

    nc_drive_df = address.loc[
        (address_df['address'].str.contains('drive')) & \
        (address_df['state'] == 'NC')
    ]
    

It's hard to be objective about when code "looks bad", despite being valid syntax, but you will experience it. Practically, PEP 8 and Cyclomatic Complexity checkers are tools which will help you fight against and defend and propose code styles in a scientific way.


If you have a great many boolean statements, you (often must) break them up with parentheses to clarify their order

nc_drive_df = address.loc[
    (
        (address_df['address'].str.contains('drive')) & \
        (address_df['state'] == 'NC')
    ) || (
        address_df['zip'] == "00000"
    )
]

This is somewhat in conflict with conventional Python operators, which are suggested to preceed lines (PEP8), but I challenge this when forming a Pandas boolean array because the dataframes must be the same to get a good result and it's likely easier to observe this when working with many dataframes when they are first.

Finally, often when doing scientific Python, you should absolutely try many possibilities against a partial and full data if possible to draw good performance conclusions, consider their readability to be second, and provide excellent comments about and linking to your research, etc. over any particular style.

like image 129
ti7 Avatar answered Oct 12 '22 21:10

ti7