Harnessing Apache Arrow for High-Speed Data Transfer from SQL Server to Python with mssql-python

By

Overview

Fetching a million rows from SQL Server into a Polars DataFrame used to mean a million Python objects, a million garbage collector allocations, and then throwing it all away to build a DataFrame. That inefficient workflow is now a thing of the past. The mssql-python driver has introduced support for retrieving SQL Server data directly as Apache Arrow structures. This provides a faster and more memory-efficient path for anyone working with SQL Server data in Polars, Pandas, DuckDB, or any other Arrow-native library. The feature was contributed by community developer Felix Graßl (@ffelixg), and it marks a significant step forward in data interoperability.

Harnessing Apache Arrow for High-Speed Data Transfer from SQL Server to Python with mssql-python
Source: devblogs.microsoft.com

Before diving into the technical details, let's clarify some key terms:

Prerequisites

Before you begin, ensure you have the following:

Install the necessary packages using pip:

pip install mssql-python pyarrow polars pandas duckdb

Step-by-Step Instructions

1. Connecting to SQL Server

Start by establishing a connection to your SQL Server instance. The mssql-python driver uses a familiar connection string format:

import mssql

conn = mssql.connect(
    server='localhost',
    database='your_database',
    username='your_username',
    password='your_password'
)

If you're using integrated authentication (Windows), you can omit the username and password parameters. Ensure the server and database names are correct.

2. Fetching Data as Apache Arrow Structures

The key improvement in mssql-python is the ability to fetch query results directly into Arrow record batches. Use the cursor.execute() method as usual, but then call fetch_arrow() instead of fetchall():

cursor = conn.cursor()
cursor.execute("SELECT * FROM your_table")
batches = cursor.fetch_arrow()  # Returns a list of pyarrow.RecordBatch

The fetch_arrow() method returns a sequence of Arrow record batches. Each batch holds a contiguous chunk of columnar data. You can combine them into a single Arrow table using pyarrow.Table.from_batches():

import pyarrow as pa

arrow_table = pa.Table.from_batches(batches)

Now you have the entire result set in Arrow format, with zero Python object creation per row during fetching.

3. Using Arrow Data with Polars

Polars can consume Arrow data directly without any conversion overhead. Pass the Arrow table to Polars:

import polars as pl

df_polars = pl.from_arrow(arrow_table)

Alternatively, you can feed the record batches directly to Polars' from_arrow() method. Polars will use the underlying Arrow buffers, avoiding any additional copying. This is the recommended approach for high-performance pipelines.

4. Using Arrow Data with Pandas

Pandas supports Arrow-backed columns via the ArrowDtype. To create a Pandas DataFrame from an Arrow table, use:

Harnessing Apache Arrow for High-Speed Data Transfer from SQL Server to Python with mssql-python
Source: devblogs.microsoft.com
import pandas as pd

df_pandas = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)

This ensures the underlying data is stored as Arrow arrays, not as Python objects. The performance benefits are most noticeable with large datasets and string columns.

5. Using Arrow Data with DuckDB

DuckDB has first-class Arrow support. You can register an Arrow table as a virtual table and query it directly:

import duckdb

con = duckdb.connect()
con.register('my_data', arrow_table)
result = con.execute("SELECT * FROM my_data WHERE column > 100").fetcharrow()
print(result)

DuckDB can also read Arrow from memory without copies, making it an excellent companion for analytical workloads.

6. Advanced: Working with C Data Interface

For maximum performance, you can bypass Python-level wrappers and use the Arrow C Data Interface directly. This is useful when passing data between libraries without any intermediate representation. mssql-python exposes the underlying C pointer arrays; see the library documentation for advanced use cases. Most users will be fine with the high-level fetch_arrow() method.

Common Mistakes

Summary

Apache Arrow support in mssql-python revolutionizes how Python applications transfer data from SQL Server. By fetching data as columnar Arrow records directly from the driver, you avoid the traditional overhead of constructing Python objects row by row. This results in faster loading, lower memory consumption, and seamless interoperability with modern DataFrame libraries like Polars, Pandas (with ArrowDtype), and DuckDB. The prerequisites are minimal: a recent Python, the updated driver, and an Arrow-native library. Follow the step-by-step instructions above to start benefiting today.

Tags:

Related Articles

Recommended

Discover More

New Framework for AI Transparency: Decision Node Audit Breaks Down the Black BoxDesigners Ditch Adobe and Figma: Claude Design Sparks Industry Shift8 Key Insights into ElevenLabs' Massive Funding and Revenue MilestoneIntegrating AI Into Your Product: A User-Centric Guide to Avoiding Pitfalls10 Game-Changing Facts About OpenAI’s GPT-5.5 and NVIDIA’s Codex Revolution