如何有效地打开一个巨大的Excel文件

我有一个150MB的单页excel文件，需要大约7分钟，打开一个非常强大的机器上使用以下内容：

# using python import xlrd wb = xlrd.open_workbook(file) sh = wb.sheet_by_index(0)

有没有办法更快地打开Excel文件？我打开甚至非常古怪的build议（如hadoop，火花，c，java等）。理想情况下，我正在寻找一种方法来在30秒内打开文件，如果这不是一个梦想。另外，上面的例子是使用python，但它不一定是python。

注意：这是来自客户端的Excel文件。 在我们收到它之前，它不能被转换成任何其他的格式。 这不是我们的文件

更新：回答一个代码的工作示例，将在30秒内打开以下200MB的Excel文件将奖励赏金： https ： //drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view？ usp =共享。这个文件应该有string（col 1），date（col 9）和数字（col 11）。

那么，如果你的excel会像你的例子一样简单的CSV文件（ https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing ），你可以尝试打开文件为一个zip文件，并直接读取每个XML：

英特尔i5 4460,12GB内存，SSD三星EVO PRO。

如果你有很多的内存ram：这段代码需要很多内存，但需要20〜25秒。（您需要参数-Xmx7g）

 package com.devsaki.opensimpleexcel; import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.nio.charset.Charset; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.zip.ZipFile; public class Multithread { public static final char CHAR_END = (char) -1; public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx"; ZipFile zipFile = new ZipFile(excelFile); long init = System.currentTimeMillis(); ExecutorService executor = Executors.newFixedThreadPool(4); char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray(); Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor)); char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray(); Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString))); Object[][] sheet = futureSheet1.get(); String[] words = futureWords.get(); executor.shutdown(); long end = System.currentTimeMillis(); System.out.println("only read: " + (end - init) / 1000); ///Doing somethin with the file::Saving as csv init = System.currentTimeMillis(); try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) { for (Object[] rows : sheet) { for (Object cell : rows) { if (cell != null) { if (cell instanceof Integer) { writer.append(words[(Integer) cell]); } else if (cell instanceof String) { writer.append(toDate(Double.parseDouble(cell.toString()))); } else { writer.append(cell.toString()); //Probably a number } } writer.append(";"); } writer.append("\n"); } } end = System.currentTimeMillis(); System.out.println("Main saving to csv: " + (end - init) / 1000); } private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME; private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2); //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date public static String toDate(double s) { return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600)))); } public static String readEntry(ZipFile zipFile, String entry) throws IOException { System.out.println("Initialing readEntry " + entry); long init = System.currentTimeMillis(); String result = null; try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) { br.readLine(); result = br.readLine(); } long end = System.currentTimeMillis(); System.out.println("readEntry '" + entry + "': " + (end - init) / 1000); return result; } public static String[] processSharedStrings(CharReader br) throws IOException { System.out.println("Initialing processSharedStrings"); long init = System.currentTimeMillis(); String[] words = null; char[] wordCount = "Count=\"".toCharArray(); char[] token = "<t>".toCharArray(); String uniqueCount = extractNextValue(br, wordCount, '"'); words = new String[Integer.parseInt(uniqueCount)]; String nextWord; int currentIndex = 0; while ((nextWord = extractNextValue(br, token, '<')) != null) { words[currentIndex++] = nextWord; br.skip(11); //you can skip at least 11 chars "/t></si><si>" } long end = System.currentTimeMillis(); System.out.println("SharedStrings: " + (end - init) / 1000); return words; } public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException { System.out.println("Initialing processSheet1"); long init = System.currentTimeMillis(); char[] dimensionToken = "dimension ref=\"".toCharArray(); String dimension = extractNextValue(br, dimensionToken, '"'); int[] sizes = extractSizeFromDimention(dimension.split(":")[1]); br.skip(30); //Between dimension and next tag c exists more or less 30 chars Object[][] result = new Object[sizes[0]][sizes[1]]; int parallelProcess = 8; int currentIndex = br.currentIndex; CharReader[] charReaders = new CharReader[parallelProcess]; int totalChars = Math.round(br.chars.length / parallelProcess); for (int i = 0; i < parallelProcess; i++) { int endIndex = currentIndex + totalChars; charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i); currentIndex = endIndex; } Future[] futures = new Future[parallelProcess]; for (int i = charReaders.length - 1; i >= 0; i--) { final int j = i; futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result)); } for (Future future : futures) { future.get(); } long end = System.currentTimeMillis(); System.out.println("Sheet1: " + (end - init) / 1000); return result; } public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) { System.out.println("Initialing inParallelProcess : " + br.identifier); char[] tokenOpenC = "<cr=\"".toCharArray(); char[] tokenOpenV = "<v>".toCharArray(); char[] tokenAttributS = " s=\"".toCharArray(); char[] tokenAttributT = " t=\"".toCharArray(); String v; int firstCurrentIndex = br.currentIndex; boolean first = true; while ((v = extractNextValue(br, tokenOpenC, '"')) != null) { if (first && back != null) { int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1; first = false; System.out.println("Adding to : " + back.identifier + " From : " + br.identifier); back.plusLength(sum); } int[] indexes = extractSizeFromDimention(v); int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT); char type = 's'; //3 types: number (n), string (s) and date (d) if (s == 0) { // Token S = number or date char read = br.read(); if (read == '1') { type = 'n'; } else { type = 'd'; } } else if (s == -1) { type = 'n'; } String c = extractNextValue(br, tokenOpenV, '<'); Object value = null; switch (type) { case 'n': value = Double.parseDouble(c); break; case 's': try { value = Integer.parseInt(c); } catch (Exception ex) { System.out.println("Identifier Error : " + br.identifier); } break; case 'd': value = c.toString(); break; } result[indexes[0] - 1][indexes[1] - 1] = value; br.skip(7); ///v></c> } } static class CharReader { char[] chars; int currentIndex; int length; int identifier; public CharReader(char[] chars) { this.chars = chars; this.length = chars.length; } public CharReader(char[] chars, int currentIndex, int length, int identifier) { this.chars = chars; this.currentIndex = currentIndex; if (length > chars.length) { this.length = chars.length; } else { this.length = length; } this.identifier = identifier; } public void plusLength(int n) { if (this.length + n <= chars.length) { this.length += n; } } public char read() { if (currentIndex >= length) { return CHAR_END; } return chars[currentIndex++]; } public void skip(int n) { currentIndex += n; } } public static int[] extractSizeFromDimention(String dimention) { StringBuilder sb = new StringBuilder(); int columns = 0; int rows = 0; for (char c : dimention.toCharArray()) { if (columns == 0) { if (Character.isDigit(c)) { columns = convertExcelIndex(sb.toString()); sb = new StringBuilder(); } } sb.append(c); } rows = Integer.parseInt(sb.toString()); return new int[]{rows, columns}; } public static int foundNextTokens(CharReader br, char until, char[]... tokens) { char character; int[] indexes = new int[tokens.length]; while ((character = br.read()) != CHAR_END) { if (character == until) { break; } for (int i = 0; i < indexes.length; i++) { if (tokens[i][indexes[i]] == character) { indexes[i]++; if (indexes[i] == tokens[i].length) { return i; } } else { indexes[i] = 0; } } } return -1; } public static String extractNextValue(CharReader br, char[] token, char until) { char character; StringBuilder sb = new StringBuilder(); int index = 0; while ((character = br.read()) != CHAR_END) { if (index == token.length) { if (character == until) { return sb.toString(); } else { sb.append(character); } } else { if (token[index] == character) { index++; } else { index = 0; } } } return null; } public static int convertExcelIndex(String index) { int result = 0; for (char c : index.toCharArray()) { result = result * 26 + ((int) c - (int) 'A' + 1); } return result; } }

旧的答案（不需要参数Xms7g，所以占用较less的内存）：需要用HDD打开和读取约35秒（200MB）的示例文件，SDD需要less一点（30秒）。

这里的代码： https ： //github.com/csaki/OpenSimpleExcelFast.git

 import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; import java.io.PrintWriter; import java.nio.charset.Charset; import java.time.LocalDateTime; import java.time.format.DateTimeFormatter; import java.util.concurrent.ExecutionException; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; import java.util.concurrent.Future; import java.util.zip.ZipFile; public class Launcher { public static final char CHAR_END = (char) -1; public static void main(String[] args) throws IOException, ExecutionException, InterruptedException { long init = System.currentTimeMillis(); String excelFile = "D:/Downloads/BigSpreadsheet.xlsx"; ZipFile zipFile = new ZipFile(excelFile); ExecutorService executor = Executors.newFixedThreadPool(4); Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile)); Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile)); String[] words = futureWords.get(); Object[][] sheet1 = futureSheet1.get(); executor.shutdown(); long end = System.currentTimeMillis(); System.out.println("Main only open and read: " + (end - init) / 1000); ///Doing somethin with the file::Saving as csv init = System.currentTimeMillis(); try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) { for (Object[] rows : sheet1) { for (Object cell : rows) { if (cell != null) { if (cell instanceof Integer) { writer.append(words[(Integer) cell]); } else if (cell instanceof String) { writer.append(toDate(Double.parseDouble(cell.toString()))); } else { writer.append(cell.toString()); //Probably a number } } writer.append(";"); } writer.append("\n"); } } end = System.currentTimeMillis(); System.out.println("Main saving to csv: " + (end - init) / 1000); } private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME; private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2); //The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date public static String toDate(double s) { return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600)))); } public static Object[][] processSheet1(ZipFile zipFile) throws IOException { String entry = "xl/worksheets/sheet1.xml"; Object[][] result = null; char[] dimensionToken = "dimension ref=\"".toCharArray(); char[] tokenOpenC = "<cr=\"".toCharArray(); char[] tokenOpenV = "<v>".toCharArray(); char[] tokenAttributS = " s=\"".toCharArray(); char[] tokenAttributT = " t=\"".toCharArray(); try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) { String dimension = extractNextValue(br, dimensionToken, '"'); int[] sizes = extractSizeFromDimention(dimension.split(":")[1]); br.skip(30); //Between dimension and next tag c exists more or less 30 chars result = new Object[sizes[0]][sizes[1]]; String v; while ((v = extractNextValue(br, tokenOpenC, '"')) != null) { int[] indexes = extractSizeFromDimention(v); int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT); char type = 's'; //3 types: number (n), string (s) and date (d) if (s == 0) { // Token S = number or date char read = (char) br.read(); if (read == '1') { type = 'n'; } else { type = 'd'; } } else if (s == -1) { type = 'n'; } String c = extractNextValue(br, tokenOpenV, '<'); Object value = null; switch (type) { case 'n': value = Double.parseDouble(c); break; case 's': value = Integer.parseInt(c); break; case 'd': value = c.toString(); break; } result[indexes[0] - 1][indexes[1] - 1] = value; br.skip(7); ///v></c> } } return result; } public static int[] extractSizeFromDimention(String dimention) { StringBuilder sb = new StringBuilder(); int columns = 0; int rows = 0; for (char c : dimention.toCharArray()) { if (columns == 0) { if (Character.isDigit(c)) { columns = convertExcelIndex(sb.toString()); sb = new StringBuilder(); } } sb.append(c); } rows = Integer.parseInt(sb.toString()); return new int[]{rows, columns}; } public static String[] processSharedStrings(ZipFile zipFile) throws IOException { String entry = "xl/sharedStrings.xml"; String[] words = null; char[] wordCount = "Count=\"".toCharArray(); char[] token = "<t>".toCharArray(); try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) { String uniqueCount = extractNextValue(br, wordCount, '"'); words = new String[Integer.parseInt(uniqueCount)]; String nextWord; int currentIndex = 0; while ((nextWord = extractNextValue(br, token, '<')) != null) { words[currentIndex++] = nextWord; br.skip(11); //you can skip at least 11 chars "/t></si><si>" } } return words; } public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException { char character; int[] indexes = new int[tokens.length]; while ((character = (char) br.read()) != CHAR_END) { if (character == until) { break; } for (int i = 0; i < indexes.length; i++) { if (tokens[i][indexes[i]] == character) { indexes[i]++; if (indexes[i] == tokens[i].length) { return i; } } else { indexes[i] = 0; } } } return -1; } public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException { char character; StringBuilder sb = new StringBuilder(); int index = 0; while ((character = (char) br.read()) != CHAR_END) { if (index == token.length) { if (character == until) { return sb.toString(); } else { sb.append(character); } } else { if (token[index] == character) { index++; } else { index = 0; } } } return null; } public static int convertExcelIndex(String index) { int result = 0; for (char c : index.toCharArray()) { result = result * 26 + ((int) c - (int) 'A' + 1); } return result; } }

大多数使用Office产品的编程语言都有一些中间层，这通常是瓶颈所在，使用PIA / Interop或Open XML SDK就是一个很好的例子。

一种让数据处于较低级别的方式（绕过中间层）是使用驱动程序。

150MB的单张excel文件，大约需要7分钟。

我能做的最好的是135秒内的130MB文件，大约快了3倍：

 Stopwatch sw = new Stopwatch(); sw.Start(); DataSet excelDataSet = new DataSet(); string filePath = @"c:\temp\BigBook.xlsx"; // For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with "';Extended Properties=\"Excel 8.0;HDR=YES;\""; string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\""; using (OleDbConnection conn = new OleDbConnection(connectionString)) { conn.Open(); OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter ("select * from [Sheet1$]", conn); objDA.Fill(excelDataSet); //dataGridView1.DataSource = excelDataSet.Tables[0]; } sw.Stop(); Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = " + excelDataSet.Tables[0].Rows.Count);

在这里输入图像说明

赢得7×64，英特尔i5,2.3ghz，8GB内存，SSD250GB。

如果我可以推荐一个硬件解决scheme，如果您使用的是标准硬盘，请尝试使用SSD来解决这个问题。

_{注意：我无法下载您的Excel电子表格示例，因为我在公司防火墙后面。}

PS。请参阅MSDN – 最快的方法来导入200MB数据的xlsx文件，共识是OleDB是最快的。

PS 2.以下是如何使用python： http : //code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

我设法使用.NET核心和Open XML SDK在大约30秒内读取文件。

下面的示例返回包含所有行和单元格的对象列表，它们支持date，数字和文本单元格。该项目可在这里： https : //github.com/xferaa/BigSpreadSheetExample/ （应该在Windows，Linux和Mac OS上工作，不需要安装Excel或任何Excel组件）。

 public List<List<object>> ParseSpreadSheet() { List<List<object>> rows = new List<List<object>>(); using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false)) { WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart; WorksheetPart worksheetPart = workbookPart.WorksheetParts.First(); OpenXmlReader reader = OpenXmlReader.Create(worksheetPart); Dictionary<int, string> sharedStringCache = new Dictionary<int, string>(); int i = 0; foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements) { sharedStringCache.Add(i++, el.InnerText); } while (reader.Read()) { if(reader.ElementType == typeof(Row)) { reader.ReadFirstChild(); List<object> cells = new List<object>(); do { if (reader.ElementType == typeof(Cell)) { Cell c = (Cell)reader.LoadCurrentElement(); if (c == null || c.DataType == null || !c.DataType.HasValue) continue; object value; switch(c.DataType.Value) { case CellValues.Boolean: value = bool.Parse(c.CellValue.InnerText); break; case CellValues.Date: value = DateTime.Parse(c.CellValue.InnerText); break; case CellValues.Number: value = double.Parse(c.CellValue.InnerText); break; case CellValues.InlineString: case CellValues.String: value = c.CellValue.InnerText; break; case CellValues.SharedString: value = sharedStringCache[int.Parse(c.CellValue.InnerText)]; break; default: continue; } if (value != null) cells.Add(value); } } while (reader.ReadNextSibling()); if (cells.Any()) rows.Add(cells); } } } return rows; }

我在一台3年前的笔记本电脑上运行了这个程序，这个笔记本电脑有SSD驱动器，8GB内存和一个Windows 10 64位Intel Core i7-4710 CPU @ 2.50GHz（两个内核）。

请注意，尽pipe以stringforms打开和parsing整个文件需要的时间less于30秒，但在使用对象时（如上次编辑的示例），使用我的蹩脚笔记本电脑的时间将快到50秒。你的Linux服务器可能会接近30秒。

诀窍是使用SAX方法，如下所述：

https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

Python的Pandas库可以用来保存和处理你的数据，但是使用它来直接加载.xlsx文件将会非常慢，例如使用read_excel() 。

一种方法是使用Python自动将文件转换为使用Excel本身的CSV文件，然后使用Pandas使用read_csv()加载生成的CSV文件。这会给你一个很好的加速，但不会低于30秒：

 import win32com.client as win32 import pandas as pd from datetime import datetime print ("Starting") start = datetime.now() # Use Excel to load the xlsx file and save it in csv format excel = win32.gencache.EnsureDispatch('Excel.Application') wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx') excel.DisplayAlerts = False wb.DoNotPromptForConvert = True wb.CheckCompatibility = False print('Saving') wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2) excel.Application.Quit() # Use Pandas to load the resulting CSV file print('Loading CSV') df = pd.read_csv(r'c:\full path\temp.csv', dtype=str) print(df.shape) print("Done", datetime.now() - start)

列types
您的列的types可以通过传递parse_dates和converters和parse_dates来指定：

 df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)

您还应该指定infer_datetime_format=True ，因为这将大大加速date转换。

nfer_datetime_format ：布尔值，默认为False

如果启用了True和parse_dates，pandas将尝试推断列中的date时间string的格式，如果可以推断，则切换到parsing它们的更快的方法。在某些情况下，这可以将parsing速度提高5-10倍。

如果date的格式为DD/MM/YYYY则还需要添加dayfirst=True 。

select性列
如果实际上只需要处理列1 9 11 ，那么可以通过指定usecols=[0, 8, 10]来进一步减less资源，如下所示：

 df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])

结果数据框将只包含这3列数据。

RAM驱动器
使用RAM驱动器来存储临时CSV文件将进一步加快加载时间。

注意：这确实假定您正在使用可用Excel的Windows PC。

我正在使用戴尔Precision T1700工作站，并使用c＃我可以打开文件，并在大约24秒内使用标准代码读取它的内容，以使用互操作服务打开工作簿。使用对Microsoft Excel 15.0对象库的引用是我的代码。

我使用的语句：

 using System.Runtime.InteropServices; using Excel = Microsoft.Office.Interop.Excel;

打开和阅读工作簿的代码：

 public partial class MainWindow : Window { public MainWindow() { InitializeComponent(); Excel.Application xlApp; Excel.Workbook wb; Excel.Worksheet ws; xlApp = new Excel.Application(); xlApp.Visible = false; xlApp.ScreenUpdating = false; wb = xlApp.Workbooks.Open(@"Desired Path of workbook\Copy of BigSpreadsheet.xlsx"); ws = wb.Sheets["Sheet1"]; //string rng = ws.get_Range("A1").Value; MessageBox.Show(ws.get_Range("A1").Value); Marshal.FinalReleaseComObject(ws); wb.Close(); Marshal.FinalReleaseComObject(wb); xlApp.Quit(); Marshal.FinalReleaseComObject(xlApp); GC.Collect(); GC.WaitForPendingFinalizers(); } }

我创build了一个示例Java程序，能够在我的笔记本电脑（Intel i7 4核心，16 GB RAM）〜40秒内加载文件。

https://github.com/skadyan/largefile

该程序使用Apache POI库使用XSSF SAX API加载.xlsx文件。

callback接口com.stackoverlfow.largefile.RecordHandler实现可用于处理从Excel中加载的数据。这个接口只定义了一个带三个参数的方法

sheetname：string，excel表名
行号：int，行数
和data map ：Map：excel单元格引用和excel格式化的单元格值

类com.stackoverlfow.largefile.Main演示了这个接口的一个基本实现，它只是在控制台上打印行号。

更新

woodstoxparsing器似乎比标准SAXReader有更好的性能。（代码更新回购）。

另外为了满足所需的性能要求，您可以考虑重新实现org.apache.poi...XSSFSheetXMLHandler 。在实现中，可以实现更优化的string/文本值处理，并且可以跳过不必要的文本格式化操作。

看起来在Python中几乎是不可能实现的。如果我们解压一个表单数据文件，那么只需要30秒就可以通过基于C的迭代SAXparsing器（使用lxml ，这是一个快速封装libxml2 ）：

 from __future__ import print_function from lxml import etree import time start_ts = time.time() for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',), collect_ids=False, resolve_entities=False, huge_tree=True): pass print(time.time() - start_ts)

示例输出：27.2134890556

顺便说一下，Excel本身需要大约40秒来加载工作簿。

c＃和ole解决scheme仍然有一些瓶颈。所以我用c ++和ado来testing它。

 _bstr_t connStr(makeConnStr(excelFile, header).c_str()); TESTHR(pRec.CreateInstance(__uuidof(Recordset))); TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText)); while(!pRec->adoEOF) { for(long i = 0; i < pRec->Fields->GetCount(); ++i) { _variant_t v = pRec->Fields->GetItem(i)->Value; if(v.vt == VT_R8) num[i] = v.dblVal; if(v.vt == VT_BSTR) str[i] = v.bstrVal; ++cellCount; } pRec->MoveNext(); }

在i5-4460和HDD机器上，我发现xls中的50万个单元格会占用1.5个字节，而在xlsx中的相同数据量则需要2.829个。因此，可以在30秒内处理您的数据。

如果你真的需要30岁以下，使用RAM驱动器来减less文件IO.It将显着改善你的过程。我无法下载你的数据来testing，所以请告诉我结果。

另一种应该大大改善负载/操作时间的方法是RAMDrive

创build一个RAMDrive足够的空间为您的文件和10％.. 20％的额外空间…
复制RAMDrive的文件…
从那里加载文件…取决于你的驱动器和文件系统的速度提高应该是巨大的…

我最喜欢的是IMDisk工具包
（ https://sourceforge.net/projects/imdisk-toolkit/ ）在这里你有一个强大的命令行脚本的一切…

我也推荐SoftPerfect ramdisk
（ http://www.majorgeeks.com/files/details/softperfect_ram_disk.html ）

但这也取决于您的操作系统…

我想有更多的信息关于您打开文件的系统…无论如何：

查看您的系统中是否有Windows更新
“Office的Office文件validation加载项…”

如果你有它…卸载它…
该文件应加载得更快
特别是如果从共享加载

保存您的Excel工作表到一个制表符分隔的文件，并打开它，因为你通常会读一个普通的TXT 🙂

编辑：然后，您可以逐行阅读文件，并在选项卡上拆分行。通过索引获取所需的数据列。

你有没有尝试加载工作表的需求，从xlrd 0.7.1版本可用？

为此，您需要将on_demand=True传递给open_workbook() 。

xlrd.open_workbook（filename = None，logfile = <_ io.TextIOWrapper name =''mode ='w'encoding ='UTF-8'>，verbosity = 0，use_mmap = 1，file_contents = None，encoding_override = None，formatting_info = False，on_demand = False，ragged_rows = False）

我发现读取xlsx文件的其他潜在的python解决scheme：

阅读'xl / sharedStrings.xml'和'xl / worksheets / sheet1.xml'中的原始xml

尝试openpyxl库的只读模式，该模式声明在大文件的内存使用情况下也被优化。

 from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data'] for row in ws.rows: for cell in row: print(cell.value)

如果你在Windows上运行，你可以使用PyWin32和'Excel.Application'

 import time import win32com.client as win32 def excel(): xl = win32.gencache.EnsureDispatch('Excel.Application') ss = xl.Workbooks.Add() ...

如何有效地打开一个巨大的Excel文件

srand函数返回相同的值

当没有剩余内存时，.Net和Bitmap不会由GC自动处理

打开文件ReadOnly

如果我在C / C ++中定义一个0大小的数组会发生什么？

什么是逗号运算符的正确使用？

指针指针的指针

string和C＃中的string有什么区别？

对象数组初始化没有默认的构造函数

如何确定编译器使用的C ++标准的版本？

如何用g ++创build一个静态库？