Create Dataframe

data = {"col1": [1, 2], "col2": [3, 4]}
df = pd.DataFrame(data=data)

Load and Save CSV

  • save to CSV: df.to_csv("path_or_buffer")
  • save to CSV (without row names / index): df.to_csv("path_or_buffer", index=False)
  • load from CSV:
df = pd.read_csv(
  • load csv without header: df = pd.read_csv("path_or_buffer", names=["column_name_1", "column_name_2"], header=None)

Load and Save CSV Parquet

  • save to parquet: df.to_parquet("<file_name>.parquet.gz", compression="gzip")
  • load from parquet: df = read_parquet("<file_name>.parquet.gz")

Display Data

  • count values in column (without NaN values): df["col_name"].value_counts()
  • count values in column (with NaN values): df["col_name"].value_counts(dropna=False)
  • duplicates
    • display duplicate rows: df[df.duplicated(keep=False)]
    • display duplicate entries in column: df[df["column_name"].duplicated(keep=False)]

Delete Data

  • delete column inline
    • df.drop("column_name", axis=1, inplace=True)
    • column_name can also be a list of str
  • remove rows on condition: df.drop(df[df["col_name"] == condition].index, inplace=True)
  • remove duplicates
    • keep first (inplace): df.drop_duplicates(inplace=True, keep="first")
    • only consider certain columns to identify duplicates, keep first (inplace): df.drop_duplicates(list_of_cols, inplace=True, keep="first")

Modify Data

  • sort
    • low to high values: df.sort_values("column_name", inplace=True)
    • high to low values: df.sort_values("column_name", ascending=False, inplace=True)
    • high to low values & Nan values on top: df.sort_values("column_name", ascending=False, na_position="first")
  • shuffle: df = df.sample(frac=1).reset_index(drop=True)

Combine Data

Stack two Dataframes

Never forget to ignore_index or you have duplicate index values and bad things might happen later!

df = pd.concat([df_01, df_02], ignore_index=True)

Display Settings

Examples for display settings:

pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)

# display long sentences in multiple rows (Jupyter)
pd.set_option("display.max_colwidth", None)

Filter nan Values

nan == nan is always false. That is why we can not use == to check for nan-values. Use pd.isnull(obj : scalar or array-like) instead or isnull(). Examples:



  • rename columns: df.rename(columns={"a": "x"}, inplace=True)
Last modified November 4, 2022: add show duplicates (1a12245)