Pandas concat

concat() 函數用於將多個 DataFrame 或 Series 串接（concatenate）在一起。可以垂直串接（堆疊列）或水平串接（並排欄位）。

垂直串接（堆疊列）

基本用法

import pandas as pd

df1 = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30]
})

df2 = pd.DataFrame({
    'name': ['Charlie', 'David'],
    'age': [35, 28]
})

# 垂直串接
result = pd.concat([df1, df2])
print(result)

      name  age
0    Alice   25
1      Bob   30
0  Charlie   35
1    David   28

注意索引會保留原本的值。

重設索引

# 使用 ignore_index 重設索引
result = pd.concat([df1, df2], ignore_index=True)
print(result)

      name  age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   28

加上來源標記

result = pd.concat([df1, df2], keys=['first', 'second'])
print(result)

              name  age
first  0    Alice   25
       1      Bob   30
second 0  Charlie   35
       1    David   28

處理欄位不一致

自動對齊欄位

df1 = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

df2 = pd.DataFrame({
    'B': [5, 6],
    'C': [7, 8]
})

# 預設會保留所有欄位，缺失的填 NaN
result = pd.concat([df1, df2], ignore_index=True)
print(result)

     A  B    C
0  1.0  3  NaN
1  2.0  4  NaN
2  NaN  5  7.0
3  NaN  6  8.0

只保留共同欄位

result = pd.concat([df1, df2], join='inner', ignore_index=True)
print(result)

水平串接（並排欄位）

使用 axis=1 進行水平串接：

df1 = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30]
})

df2 = pd.DataFrame({
    'city': ['Taipei', 'Tokyo'],
    'salary': [50000, 60000]
})

# 水平串接
result = pd.concat([df1, df2], axis=1)
print(result)

    name  age    city  salary
0  Alice   25  Taipei   50000
1    Bob   30   Tokyo   60000

索引對齊

水平串接時，會根據索引對齊：

df1 = pd.DataFrame({'A': [1, 2, 3]}, index=[0, 1, 2])
df2 = pd.DataFrame({'B': [4, 5, 6]}, index=[1, 2, 3])

result = pd.concat([df1, df2], axis=1)
print(result)

     A    B
0  1.0  NaN
1  2.0  4.0
2  3.0  5.0
3  NaN  6.0

串接 Series

s1 = pd.Series([1, 2], name='A')
s2 = pd.Series([3, 4], name='B')

# 垂直串接
result = pd.concat([s1, s2])

# 水平串接成 DataFrame
result = pd.concat([s1, s2], axis=1)
print(result)

   A  B
0  1  3
1  2  4

串接多個 DataFrame

dfs = [df1, df2, df3]  # DataFrame 列表
result = pd.concat(dfs, ignore_index=True)

實際應用

合併多個 CSV 檔案

import glob

# 讀取所有 CSV 檔案
files = glob.glob('data/*.csv')
dfs = [pd.read_csv(f) for f in files]

# 合併成一個 DataFrame
df = pd.concat(dfs, ignore_index=True)

逐步建立 DataFrame

# 建立空的 DataFrame
result = pd.DataFrame()

for i in range(3):
    new_data = pd.DataFrame({'value': [i * 10]})
    result = pd.concat([result, new_data], ignore_index=True)

不過這種方式效能較差，建議先收集到 list 再一次 concat：

dfs = []
for i in range(3):
    dfs.append(pd.DataFrame({'value': [i * 10]}))
result = pd.concat(dfs, ignore_index=True)

驗證資料一致性

# 確保合併前欄位一致
result = pd.concat([df1, df2], verify_integrity=True, ignore_index=True)
# 如果有重複索引會報錯

concat vs append

在舊版 Pandas 中有 append() 方法，但在 Pandas 2.0 已被移除，統一使用 concat()：

# 舊寫法（已棄用）
# result = df1.append(df2)

# 新寫法
result = pd.concat([df1, df2], ignore_index=True)

效能注意事項

避免在迴圈中反覆 concat：每次 concat 都會建立新的 DataFrame，效能很差
先收集到 list 再一次 concat：這是最佳做法
預先分配空間：如果知道最終大小，可以預先建立空的 DataFrame

# 不建議
result = pd.DataFrame()
for df in dfs:
    result = pd.concat([result, df])

# 建議
result = pd.concat(dfs, ignore_index=True)