Pandas join

join() 是 DataFrame 的方法，用於根據索引合併兩個 DataFrame。它是 merge() 的簡化版，專門用於索引對索引的合併。

基本用法

import pandas as pd

df1 = pd.DataFrame({
    'A': [1, 2, 3]
}, index=['a', 'b', 'c'])

df2 = pd.DataFrame({
    'B': [4, 5, 6]
}, index=['a', 'b', 'd'])

# 使用 join（預設是 left join）
result = df1.join(df2)
print(result)

   A    B
a  1  4.0
b  2  5.0
c  3  NaN

join 類型

# left join（預設）：保留左邊所有資料
result = df1.join(df2, how='left')

# right join：保留右邊所有資料
result = df1.join(df2, how='right')

# inner join：只保留兩邊都有的
result = df1.join(df2, how='inner')

# outer join：保留所有資料
result = df1.join(df2, how='outer')

使用欄位作為鍵

如果要用欄位（而非索引）作為合併鍵，需要先 set_index()：

df1 = pd.DataFrame({
    'key': ['a', 'b', 'c'],
    'value1': [1, 2, 3]
})

df2 = pd.DataFrame({
    'key': ['a', 'b', 'd'],
    'value2': [4, 5, 6]
})

# 設定索引後再 join
result = df1.set_index('key').join(df2.set_index('key'))
print(result)

     value1  value2
key                
a         1     4.0
b         2     5.0
c         3     NaN

或使用 on 參數（左邊用欄位，右邊用索引）：

df2_indexed = df2.set_index('key')
result = df1.join(df2_indexed, on='key')
print(result)

  key  value1  value2
0   a       1     4.0
1   b       2     5.0
2   c       3     NaN

合併多個 DataFrame

df1 = pd.DataFrame({'A': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'B': [3, 4]}, index=['a', 'b'])
df3 = pd.DataFrame({'C': [5, 6]}, index=['a', 'b'])

# 一次合併多個
result = df1.join([df2, df3])
print(result)

   A  B  C
a  1  3  5
b  2  4  6

處理重複欄位名稱

df1 = pd.DataFrame({'value': [1, 2]}, index=['a', 'b'])
df2 = pd.DataFrame({'value': [3, 4]}, index=['a', 'b'])

# 指定後綴
result = df1.join(df2, lsuffix='_left', rsuffix='_right')
print(result)

   value_left  value_right
a           1            3
b           2            4

join vs merge

特性	join	merge
合併鍵	預設用索引	可用任何欄位
語法	`df1.join(df2)`	`pd.merge(df1, df2)`
預設類型	left join	inner join
多表合併	支援 list	需要連續呼叫
靈活性	較低	較高

什麼時候用 join

兩個 DataFrame 的索引有意義，且要根據索引合併
需要一次合併多個 DataFrame
簡單的索引對索引合併

什麼時候用 merge

需要根據欄位（而非索引）合併
需要更複雜的合併邏輯
兩邊的合併鍵欄位名稱不同

等效的 merge 寫法

# join 寫法
result = df1.join(df2)

# 等效的 merge 寫法
result = pd.merge(df1, df2, left_index=True, right_index=True, how='left')

實際應用

合併多個資料來源

# 假設有多個按日期索引的資料
dates = pd.date_range('2024-01-01', periods=5)

sales = pd.DataFrame({'sales': [100, 150, 120, 180, 200]}, index=dates)
visitors = pd.DataFrame({'visitors': [1000, 1200, 1100, 1500, 1800]}, index=dates)
costs = pd.DataFrame({'costs': [50, 60, 55, 70, 80]}, index=dates)

# 一次合併
result = sales.join([visitors, costs])
result['conversion'] = result['sales'] / result['visitors']
print(result)

時間序列資料合併

# 股票資料
stock_a = pd.DataFrame({
    'price_a': [100, 102, 101]
}, index=pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03']))

stock_b = pd.DataFrame({
    'price_b': [50, 51, 52]
}, index=pd.to_datetime(['2024-01-01', '2024-01-02', '2024-01-03']))

portfolio = stock_a.join(stock_b)
portfolio['total'] = portfolio['price_a'] + portfolio['price_b']