Pandas 檢視資料

拿到一份資料後，第一件事通常是先了解資料的基本資訊。Pandas 提供了多種方法來檢視和探索資料。

範例資料

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'age': [25, 30, 35, 28, None],
    'city': ['Taipei', 'Tokyo', 'Seoul', 'Taipei', 'Tokyo'],
    'salary': [50000, 60000, 70000, 55000, 65000]
})

head() 和 tail()

查看前幾筆和後幾筆資料：

# 前 5 筆（預設）
print(df.head())

# 前 3 筆
print(df.head(3))

# 後 5 筆（預設）
print(df.tail())

# 後 2 筆
print(df.tail(2))

shape

查看 DataFrame 的維度（幾列幾欄）：

print(df.shape)
# (5, 4)  # 5 列 4 欄

columns

查看所有欄位名稱：

print(df.columns)
# Index(['name', 'age', 'city', 'salary'], dtype='object')

# 轉成 list
print(df.columns.tolist())
# ['name', 'age', 'city', 'salary']

dtypes

查看各欄位的資料型別：

print(df.dtypes)

name       object
age       float64
city       object
salary      int64
dtype: object

object：通常是字串
int64：整數
float64：浮點數
bool：布林值
datetime64：日期時間

info()

顯示 DataFrame 的完整資訊，包含欄位數、資料型別、非空值數量、記憶體使用量：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    5 non-null      object 
 1   age     4 non-null      float64
 2   city    5 non-null      object 
 3   salary  5 non-null      int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 288.0+ bytes

從這裡可以快速看出：

總共 5 筆資料、4 個欄位
age 欄位有缺失值（只有 4 個非空值）
各欄位的資料型別

describe()

顯示數值欄位的統計摘要：

print(df.describe())

             age        salary
count   4.000000      5.000000
mean   29.500000  60000.000000
std     4.203173   7905.694150
min    25.000000  50000.000000
25%    27.250000  55000.000000
50%    29.000000  60000.000000
75%    31.250000  65000.000000
max    35.000000  70000.000000

各統計值的意義：

count：非空值數量
mean：平均值
std：標準差
min：最小值
25%、50%、75%：四分位數
max：最大值

包含所有欄位

# 包含非數值欄位
print(df.describe(include='all'))

只看特定統計值

# 只看平均和標準差
print(df.describe().loc[['mean', 'std']])

value_counts()

計算某欄位各值出現的次數：

print(df['city'].value_counts())

Taipei    2
Tokyo     2
Seoul     1
Name: city, dtype: int64

# 顯示比例
print(df['city'].value_counts(normalize=True))

# 包含空值
print(df['age'].value_counts(dropna=False))

unique() 和 nunique()

# 取得不重複的值
print(df['city'].unique())
# ['Taipei' 'Tokyo' 'Seoul']

# 計算不重複值的數量
print(df['city'].nunique())
# 3

查看記憶體使用量

# 簡易版
print(df.memory_usage())

# 詳細版（包含物件型別的實際大小）
print(df.memory_usage(deep=True))

sample()

隨機抽樣查看資料：

# 隨機看 2 筆
print(df.sample(2))

# 設定隨機種子（可重現）
print(df.sample(2, random_state=42))

快速檢視資料的標準流程

拿到新資料時，通常會依序執行：

# 1. 看資料長什麼樣子
print(df.head())

# 2. 看資料維度
print(df.shape)

# 3. 看欄位資訊和資料型別
df.info()

# 4. 看數值統計
print(df.describe())

# 5. 看類別欄位的分布
print(df['city'].value_counts())