更改Pandas中列的数据types

我想将一个表格（表示为列表清单）转换为Pandas DataFrame。作为一个非常简单的例子：

a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] df = pd.DataFrame(a)

什么是最好的方式将列转换为适当的types，在这种情况下，列2和3为浮动？有没有办法在转换为DataFrame时指定types？或者是最好先创buildDataFrame，然后遍历列来更改每列的types？理想情况下，我想以dynamic的方式做到这一点，因为可以有数百个列，我不想明确指定哪些列是哪种types。我只能保证每列都包含相同types的值。

您可以使用pd.to_numeric （在版本0.17中引入）将列或系列转换为数字types。该函数也可以应用在DataFrame的多个列上。

重要的是，该函数还会使用一个errors关键字参数，使您可以强制非数值为NaN ，或者简单地忽略包含这些值的列。

示例使用如下所示。

个别专栏/系列

下面是一个使用具有对象dtype的string系列的示例：

 >>> s = pd.Series(['1', '2', '4.7', 'pandas', '10']) >>> s 0 1 1 2 2 4.7 3 pandas 4 10 dtype: object

如果函数的默认行为是无法转换的，则会引发该行为。在这种情况下，它不能应付string“pandas”：

 >>> pd.to_numeric(s) # or pd.to_numeric(s, errors='raise') ValueError: Unable to parse string

不是失败，我们可能希望“大pandas”被认为是缺失/不好的价值。我们可以将无效的值强制为NaN ，如下所示：

 >>> pd.to_numeric(s, errors='coerce') 0 1.0 1 2.0 2 4.7 3 NaN 4 10.0 dtype: float64

如果遇到无效值，第三个选项就是忽略该操作：

 >>> pd.to_numeric(s, errors='ignore') # the original Series is returned untouched

多列/整个dataframe

我们可能想要将这个操作应用到多个列。依次处理每一列是很繁琐的，所以我们可以使用DataFrame.apply让函数在每一列上执行。

从这个问题借用DataFrame：

 >>> a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] >>> df = pd.DataFrame(a, columns=['col1','col2','col3']) >>> df col1 col2 col3 0 a 1.2 4.2 1 b 70 0.03 2 x 5 0

然后我们可以写：

 df[['col2','col3']] = df[['col2','col3']].apply(pd.to_numeric)

现在'col2'和'col3'根据需要有dtype float64 。

但是，我们可能不知道我们的哪些列可以可靠地转换为数字types。在这种情况下，我们可以写下：

 df.apply(pd.to_numeric, errors='ignore')

然后该函数将被应用于整个 DataFrame。可以转换为数字types的列将被转换，而不能（例如，它们包含非数字string或date）的列将被保留。

也有pd.to_datetime和pd.to_timedelta转换为date和时间戳。

软转换

版本0.21.0引入了用于将具有对象数据types的DataFrame的列转换为更具体的types的方法infer_objects() 。

例如，让我们用两列对象types创build一个DataFrame，其中一个保存整数，另一个保存整数的string：

 >>> df = pd.DataFrame({'a': [7, 1, 5], 'b': ['3','2','1']}, dtype='object') >>> df.dtypes a object b object dtype: object

然后使用infer_objects() ，我们可以将列“a”的types改为int64：

 >>> df = df.infer_objects() >>> df.dtypes a int64 b object dtype: object

因为它的值是string，而不是整数，所以列“b”已被单独留下。如果我们试图强制将两列转换为整数types，我们可以使用df.astype(int)来代替。

这个怎么样？

 a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] df = pd.DataFrame(a, columns=['one', 'two', 'three']) df Out[16]: one two three 0 a 1.2 4.2 1 b 70 0.03 2 x 5 0 df.dtypes Out[17]: one object two object three object df[['two', 'three']] = df[['two', 'three']].astype(float) df.dtypes Out[19]: one object two float64 three float64

这是一个函数，它将DataFrame和一列列作为参数，并将列中的所有数据强制转换为数字。

 # df is the DataFrame, and column_list is a list of columns as strings (eg ["col1","col2","col3"]) # dependencies: pandas def coerce_df_columns_to_numeric(df, column_list): df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce')

所以，举个例子：

 import pandas as pd def coerce_df_columns_to_numeric(df, column_list): df[column_list] = df[column_list].apply(pd.to_numeric, errors='coerce') a = [['a', '1.2', '4.2'], ['b', '70', '0.03'], ['x', '5', '0']] df = pd.DataFrame(a, columns=['col1','col2','col3']) coerce_df_columns_to_numeric(df, ['col2','col3'])

如何创build两个数据框，每个数据框的列都有不同的数据types，然后将它们附加在一起？

 d1 = pd.DataFrame(columns=[ 'float_column' ], dtype=float) d1 = d1.append(pd.DataFrame(columns=[ 'string_column' ], dtype=str))

结果

 In[8}: d1.dtypes Out[8]: float_column float64 string_column object dtype: object

在创build数据框之后，可以在第一列中填充浮点variables，在第二列中填充string（或任何需要的数据types）。

更改Pandas中列的数据types

个别专栏/系列

多列/整个dataframe

软转换

返回两个值，Tuple vs'out'和'struct'

Haskell的代数数据types

最大的整数，可以存储在一个双

scala对推断types的“可接受的复杂性”有何限制？

Convert.ToInt32和（int）有什么区别？

为什么编译器匹配“char”到“int”而不是“short”？

是否有可能在Python中创build匿名对象？

在C＃中，为什么List <string>对象不能存储在List <object>variables中

closuresvariables捕获的详细说明

HTTP“Content-Type”头的所有可能的值是什么？