Pandas vs. Polars：语法、速度和内存的全面比较-青云TOP-AI综合资源站平台|青云聚合API大模型调用平台|全网AI资源导航平台

📢 转载信息

原文链接：https://www.kdnuggets.com/pandas-vs-polars-a-complete-comparison-of-syntax-speed-and-memory

原文作者：Bala Priya C

Pandas vs. Polars: A Complete Comparison of Syntax, Speed, and Memory

#

Introduction

如果您一直在使用 Python 处理数据，那么您几乎肯定会用到 pandas。十多年来，它一直是数据处理的首选库。但最近，Polars 正在获得极大的关注。Polars 承诺比 pandas 更快、更节省内存、更直观。但它值得学习吗？它到底有多大的不同？

在本文中，我们将并排比较 pandas 和 Polars。您将看到性能基准测试，并了解语法差异。读完后，您将能够为您的下一个数据项目做出明智的决定。

您可以在 GitHub 上找到代码。

#

Getting Started

首先，让我们安装这两个库：

pip install pandas polars

注意：本文使用 pandas 2.2.2 和 Polars 1.31.0。

为了进行比较，我们将使用一个足够大的数据集来观察实际的性能差异。我们将使用 Faker 来生成测试数据：

pip install Faker

现在我们准备开始编码了。

#

Measuring Speed By Reading Large CSV Files

让我们从最常见的操作之一开始：读取 CSV 文件。我们将创建一个包含 100 万行的数据集来观察实际的性能差异。

首先，让我们生成示例数据：

import pandas as pd from faker import Faker import random # Generate a large CSV file for testing fake = Faker() Faker.seed(42) random.seed(42) data = { 'user_id': range(1000000), 'name': [fake.name() for _ in range(1000000)], 'email': [fake.email() for _ in range(1000000)], 'age': [random.randint(18, 80) for _ in range(1000000)], 'salary': [random.randint(30000, 150000) for _ in range(1000000)], 'department': [random.choice(['Engineering', 'Sales', 'Marketing', 'HR', 'Finance']) for _ in range(1000000)] } df_temp = pd.DataFrame(data) df_temp.to_csv('large_dataset.csv', index=False) print("✓ Generated large_dataset.csv with 1M rows")

此代码创建了一个包含真实数据的 CSV 文件。现在让我们比较读取速度：

import pandas as pd import polars as pl import time # pandas: Read CSV start = time.time() df_pandas = pd.read_csv('large_dataset.csv') pandas_time = time.time() - start # Polars: Read CSV start = time.time() df_polars = pl.read_csv('large_dataset.csv') polars_time = time.time() - start print(f"Pandas read time: {pandas_time:.2f} seconds") print(f"Polars read time: {polars_time:.2f} seconds") print(f"Polars is {pandas_time/polars_time:.1f}x faster")

从示例 CSV 读取时的输出：

Pandas read time: 1.92 seconds Polars read time: 0.23 seconds Polars is 8.2x faster

这是怎么回事：我们记录了每个库读取相同 CSV 文件所需的时间。pandas 使用其传统的单线程 CSV 读取器，而 Polars 则自动跨多个 CPU 核心进行并行读取。我们计算了加速因子。

在大多数机器上，您会发现 Polars 读取 CSV 的速度要快 2-5 倍。对于更大的文件，这种差异会更加显著。

#

Measuring Memory Usage During Operations

速度不是唯一的考量因素。让我们看看每个库占用的内存量。我们将执行一系列操作并测量内存消耗。如果您还没有在工作环境中安装 psutil，请先 pip install psutil：

import pandas as pd import polars as pl import psutil import os import gc # Import garbage collector for better memory release attempts def get_memory_usage(): """Get current process memory usage in MB""" process = psutil.Process(os.getpid()) return process.memory_info().rss / 1024 / 1024 # — - Test with Pandas — - gc.collect() initial_memory_pandas = get_memory_usage() df_pandas = pd.read_csv('large_dataset.csv') filtered_pandas = df_pandas[df_pandas['age'] > 30] grouped_pandas = filtered_pandas.groupby('department')['salary'].mean() pandas_memory = get_memory_usage() - initial_memory_pandas print(f"Pandas memory delta: {pandas_memory:.1f} MB") del df_pandas, filtered_pandas, grouped_pandas gc.collect() # — - Test with Polars (eager mode) — - gc.collect() initial_memory_polars = get_memory_usage() df_polars = pl.read_csv('large_dataset.csv') filtered_polars = df_polars.filter(pl.col('age') > 30) grouped_polars = filtered_polars.group_by('department').agg(pl.col('salary').mean()) polars_memory = get_memory_usage() - initial_memory_polars print(f"Polars memory delta: {polars_memory:.1f} MB") del df_polars, filtered_polars, grouped_polars gc.collect() # — - Summary — - if pandas_memory > 0 and polars_memory > 0: print(f"Memory savings (Polars vs Pandas): {(1 - polars_memory/pandas_memory) * 100:.1f}%") elif pandas_memory == 0 and polars_memory > 0: print(f"Polars used {polars_memory:.1f} MB while Pandas used 0 MB.") elif polars_memory == 0 and pandas_memory > 0: print(f"Polars used 0 MB while Pandas used {pandas_memory:.1f} MB.") else: print("Cannot compute memory savings due to zero or negative memory usage delta in both frameworks.")

此代码测量内存占用情况：

我们使用 psutil 库 来跟踪操作之前和之后的内存使用情况。
两个库都读取同一个文件并执行过滤和分组操作。
我们计算内存消耗的差异。

示例输出：

Pandas memory delta: 44.4 MB Polars memory delta: 1.3 MB Memory savings (Polars vs Pandas): 97.1%

以上结果显示了 pandas 和 Polars 在 large_dataset.csv 上执行过滤和聚合操作时的内存使用差异。

pandas memory delta：表示 pandas 为这些操作消耗的内存。
Polars memory delta：表示 Polars 为相同操作消耗的内存。
Memory savings (Polars vs pandas)：此指标提供了 Polars 相对于 pandas 使用内存减少的百分比。

Polars 通常由于其列式数据存储和优化的执行引擎而表现出内存效率。通常，您会看到使用 Polars 可带来 30% 到 70% 的改进。

注意：然而，在同一个 Python 进程中使用 psutil.Process(...).memory_info().rss 进行顺序内存测量有时可能会产生误导。Python 的内存分配器并不总是立即将内存释放回操作系统，因此后续测试的“已清理”基线可能仍会受到先前操作的影响。为了最准确的比较，测试应在单独、隔离的 Python 进程中进行。

#

Comparing Syntax For Basic Operations

现在让我们看看这两个库的语法差异。我们将涵盖您将使用的最常见操作。

//

Selecting Columns

让我们选择一部分列。为此（以及后续示例），我们将创建一个小得多的 DataFrame。

import pandas as pd import polars as pl # Create sample data data = { 'name': ['Anna', 'Betty', 'Cathy'], 'age': [25, 30, 35], 'salary': [50000, 60000, 70000] } # Pandas approach df_pandas = pd.DataFrame(data) result_pandas = df_pandas[['name', 'salary']] # Polars approach df_polars = pl.DataFrame(data) result_polars = df_polars.select(['name', 'salary']) # Alternative: More expressive result_polars_alt = df_polars.select([pl.col('name'), pl.col('salary')]) print("Pandas result:") print(result_pandas) print("\nPolars result:") print(result_polars)

这里的关键区别在于：

pandas 使用方括号表示法：df[['col1', 'col2']]
Polars 使用 .select() 方法
Polars 还支持更具表达力的 pl.col() 语法，这在复杂操作中非常强大

输出：

Pandas result: name salary 0 Anna 50000 1 Betty 60000 2 Cathy 70000 Polars result: shape: (3, 2) ┌───────┬────────┐ │ name ┆ salary │ │ — - ┆ — - │ │ str ┆ i64 │ ╞═══════╪════════╡ │ Anna ┆ 50000 │ │ Betty ┆ 60000 │ │ Cathy ┆ 70000 │ └───────┴────────┘

两者产生的结果相同，但 Polars 的语法更明确地说明了您正在做什么。

//

Filtering Rows

现在让我们过滤行：

# pandas: Filter rows where age > 28 filtered_pandas = df_pandas[df_pandas['age'] > 28] # Alternative Pandas syntax with query filtered_pandas_alt = df_pandas.query('age > 28') # Polars: Filter rows where age > 28 filtered_polars = df_polars.filter(pl.col('age') > 28) print("Pandas filtered:") print(filtered_pandas) print("\nPolars filtered:") print(filtered_polars)

注意这些区别：

在 pandas 中，我们使用带方括号表示法的布尔索引。您也可以使用 .query() 方法。
Polars 使用带有 pl.col() 表达式的 .filter() 方法。
Polars 的语法更像 SQL：“where column age is greater than 28”。

输出：

Pandas filtered: name age salary 1 Betty 30 60000 2 Cathy 35 70000 Polars filtered: shape: (2, 3) ┌───────┬─────┬────────┐ │ name ┆ age ┆ salary │ │ — - ┆ — - ┆ — - │ │ str ┆ i64 ┆ i64 │ ╞═══════╪═════╪═════───╡ │ Betty ┆ 30 ┆ 60000 │ │ Cathy ┆ 35 ┆ 70000 │ └───────┴─────┴────────┘

//

Adding New Columns

现在让我们向 DataFrame 添加新列：

# pandas: Add a new column df_pandas['bonus'] = df_pandas['salary'] * 0.1 df_pandas['total_comp'] = df_pandas['salary'] + df_pandas['bonus'] # Polars: Add new columns df_polars = df_polars.with_columns([ (pl.col('salary') * 0.1).alias('bonus'), (pl.col('salary') * 1.1).alias('total_comp') ]) print("Pandas with new columns:") print(df_pandas) print("\nPolars with new columns:") print(df_polars)

输出：

Pandas with new columns: name age salary bonus total_comp 0 Anna 25 50000 5000.0 55000.0 1 Betty 30 60000 6000.0 66000.0 2 Cathy 35 70000 7000.0 77000.0 Polars with new columns: shape: (3, 5) ┌───────┬─────┬────────┬────────┬────────────┐ │ name ┆ age ┆ salary ┆ bonus ┆ total_comp │ │ — - ┆ — - ┆ — - ┆ — - ┆ — - │ │ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │ ╞═══════╪═════╪════════╪════════╪════════════╡ │ Anna ┆ 25 ┆ 50000 ┆ 5000.0 ┆ 55000.0 │ │ Betty ┆ 30 ┆ 60000 ┆ 6000.0 ┆ 66000.0 │ │ Cathy ┆ 35 ┆ 70000 ┆ 7000.0 ┆ 77000.0 │ └───────┴─────┴────────┴────────┴────────────┘

这是正在发生的事情：

pandas 使用直接列分配，这会就地修改 DataFrame。
Polars 使用 .with_columns() 并返回一个新的 DataFrame（默认不可变）。
在 Polars 中，您使用 .alias() 来命名新列。

Polars 的方法提倡不变性，并使数据转换更易读。

#

Measuring Performance In Grouping And Aggregating

让我们来看一个更有用的示例：对数据进行分组并计算多个聚合。此代码显示了我们如何按部门对数据进行分组，计算不同列上的多个统计信息，并计时这两个操作以查看性能差异：

# Load our large dataset df_pandas = pd.read_csv('large_dataset.csv') df_polars = pl.read_csv('large_dataset.csv') # pandas: Group by department and calculate stats import time start = time.time() result_pandas = df_pandas.groupby('department').agg({ 'salary': ['mean', 'median', 'std'], 'age': 'mean' }).reset_index() result_pandas.columns = ['department', 'avg_salary', 'median_salary', 'std_salary', 'avg_age'] pandas_time = time.time() - start # Polars: Same operation start = time.time() result_polars = df_polars.group_by('department').agg([ pl.col('salary').mean().alias('avg_salary'), pl.col('salary').median().alias('median_salary'), pl.col('salary').std().alias('std_salary'), pl.col('age').mean().alias('avg_age') ]) polars_time = time.time() - start print(f"Pandas time: {pandas_time:.3f}s") print(f"Polars time: {polars_time:.3f}s") print(f"Speedup: {pandas_time/polars_time:.1f}x") print("\nPandas result:") print(result_pandas) print("\nPolars result:") print(result_polars)

输出：

 Pandas time: 0.126s Polars time: 0.077s Speedup: 1.6x Pandas result: department avg_salary median_salary std_salary avg_age 0 Engineering 89954.929266 89919.0 34595.585863 48.953405 1 Finance 89898.829762 89817.0 34648.373383 49.006690 2 HR 90080.629637 90177.0 34692.117761 48.979005 3 Marketing 90071.721095 90154.0 34625.095386 49.085454 4 Sales 89980.433386 90065.5 34634.974505 49.003168 Polars result: shape: (5, 5) ┌─────────────┬──────────────┬────────────────┬──────────────┬───────────┐ │ department ┆ avg_salary ┆ median_salary ┆ std_salary ┆ avg_age │ │ — - ┆ — - ┆ — - ┆ — - ┆ — - │ │ str ┆ f64 ┆ f64 ┆ f64 ┆ f64 │ ╞═════════════╪══════════════╪═══════════════╪══════════════╪═══════════╡ │ HR ┆ 90080.629637 ┆ 90177.0 ┆ 34692.117761 ┆ 48.979005 │ │ Sales ┆ 89980.433386 ┆ 90065.5 ┆ 34634.974505 ┆ 49.003168 │ │ Engineering ┆ 89954.929266 ┆ 89919.0 ┆ 34595.585863 ┆ 48.953405 │ │ Marketing ┆ 90071.721095 ┆ 90154.0 ┆ 34625.095386 ┆ 49.085454 │ │ Finance ┆ 89898.829762 ┆ 89817.0 ┆ 34648.373383 ┆ 49.00669 │ └─────────────┴──────────────┴────────────────┘

分解语法：

pandas 使用字典来指定聚合，这在复杂操作中可能会令人困惑。
Polars 使用方法链：每个操作都清晰且命名。

Polars 的语法更详细，但也更易读。您可以立即看到正在计算哪些统计信息。

#

Understanding Lazy Evaluation In Polars

惰性求值是 Polars 最有用的特性之一。这意味着它不会立即执行您的查询。相反，它会规划整个操作并在运行前进行优化。

让我们看看实际效果：

import polars as pl # Read in lazy mode df_lazy = pl.scan_csv('large_dataset.csv') # Build a complex query result = ( df_lazy .filter(pl.col('age') > 30) .filter(pl.col('salary') > 50000) .group_by('department') .agg([ pl.col('salary').mean().alias('avg_salary'), pl.len().alias('employee_count') ]) .filter(pl.col('employee_count') > 1000) .sort('avg_salary', descending=True) ) # Nothing has been executed yet! print("Query plan created, but not executed") # Now execute the optimized query import time start = time.time() result_df = result.collect() # This runs the query execution_time = time.time() - start print(f"\nExecution time: {execution_time:.3f}s") print(result_df)

输出：

Query plan created, but not executed Execution time: 0.177s shape: (5, 3) ┌─────────────┬──────────────┬────────────────┐ │ department ┆ avg_salary ┆ employee_count │ │ — - ┆ — - ┆ — - │ │ str ┆ f64 ┆ u32 │ ╞═════════════╪══════════════╪════════════════╡ │ HR ┆ 100101.595816 ┆ 132212 │ │ Marketing ┆ 100054.012365 ┆ 132470 │ │ Sales ┆ 100041.01049 ┆ 132035 │ │ Finance ┆ 99956.527217 ┆ 132143 │ │ Engineering ┆ 99946.725458 ┆ 132384 │ └─────────────┴───────────────┴───────────────┘

在这里，scan_csv() 不会立即加载文件，它只计划读取它。我们链接了多个过滤器、分组和排序。Polars 分析整个查询并对其进行优化。例如，它可能会在读取所有数据之前先进行过滤。

只有当我们调用 .collect() 时，实际的计算才会发生。优化后的查询运行速度比单独执行每个步骤快得多。

#

Wrapping Up

如前所述，Polars 对于使用 Python 进行数据处理非常有用。它比 pandas 更快、内存效率更高，并且具有更清晰的 API。话虽如此，pandas 仍然会长期存在。它拥有十多年的开发历史、庞大的生态系统和数百万用户。对于许多项目来说，pandas 仍然是正确的选择。

如果您正在考虑用于数据工程项目等的大规模分析，请学习 Polars。语法差异并不大，性能提升是真实的。但请将 pandas 保留在您的工具箱中，以实现兼容性和快速的探索性工作。

从在副项目或运行缓慢的数据管道上试用 Polars 开始。您将很快了解它是否适合您的用例。祝您数据处理愉快！

Bala Priya C 是来自印度的开发者和技术作家。她喜欢在数学、编程、数据科学和内容创作的交叉领域工作。她的兴趣和专业领域包括 DevOps、数据科学和自然语言处理。她喜欢阅读、写作、编码和咖啡！目前，她正在通过撰写教程、操作指南、观点文章等来学习并将她的知识分享给开发者社区。Bala 还创建了引人入胜的资源概述和编码教程。

🚀 想要体验更好更全面的AI调用？

欢迎使用青云聚合API，约为官网价格的十分之一，支持300+全球最新模型，以及全球各种生图生视频模型，无需翻墙高速稳定，文档丰富，小白也可以简单操作。

目录CONTENT

Pandas vs. Polars：语法、速度和内存的全面比较

#

#

#

#

#

//

//

//

#

#

#

评论区