Kaggle知识点：内存优化方法

文章目录[隐藏]

内存使用统计
Numpy内存优化
- 转换数据类型
- 使用稀疏矩阵
Pandas内存优化
模型内存优化
总结

文章转载自公众号：Coggle数据科学，版权归原作者所有！

在Kaggle和日常的代码运行中，我们的内存总是受限的。那么我们在有限的内存中让代码跑起来呢？本文给出了一些解决方法。

内存使用统计

在进行内存优化之前，可以使用如下函数对进行使用的内存进行统计。

import psutil impot os def cpu_stats(): pid = os.getpid() py = psutil.Process(pid) memory_use = py.memory_info()[0] / 2. 30return'memory GB:'+ str(np.round(memory_use, 2))

对于pandas读取的数据，可以使用如下函数查看内存使用：

# 整体内存使用df.info(memory_usage="deep")

# 每列内存使用df.memory_usage()

对于应用程序，可以使用filprofiler函数查看内存峰值。

Kaggle知识点：内存优化方法

https://github.com/pythonspeed/filprofiler

Numpy内存优化

转换数据类型

在Numpy支持多种数据类型，不同类型数据的内存占用相差很大。uint64类型比uint16内存占比大四倍。

>>> from numpy import ones >>> int64arr = ones((1024, 1024), dtype=np.uint64) >>> int64arr.nbytes 8388608

>>> int16arr = ones((1024, 1024), dtype=np.uint16)
>>> int16arr.nbytes
2097152

对于数据类型，可以根据矩阵的元素范围进行设置。比如对于整数可以参考以下常见类型的范围，并选取最为合适的。

类型	范围
int8	(-128 to 127)
int16	(-32768 to 32767)
int32	(-2147483648 to 2147483647)
int64	(-9223372036854775808 to 9223372036854775807)
uint8	(0 to 255)
uint16	(0 to 65535)
uint32	(0 to 4294967295)
uint64	(0 to 18446744073709551615)

对于浮点数，可以考虑使用float16、float32和float32来进行存储。Numpy具体支持的数据类型可以参考?文档。

https://numpy.org/devdocs/user/basics.types.html

使用稀疏矩阵

如果矩阵中数据是稀疏的情况，可以考虑稀疏矩阵。LGB和XGB支持稀疏矩阵参与训练。

>>> import sparse; import numpy as np >>> arr = np.random.random((1024, 1024)) >>> arr[arr < 0.9] = 0 >>> sparse_arr = sparse.COO(arr)

>>> arr.nbytes
8388608

>>> sparse_arr.nbytes
2514648

Pandas内存优化

分批读取

如果数据文件非常大，可以在读取时分批次读取，通过设置chunksize来控制批大小。

df = pd.read_csv(path, chunksize=1000000)

forchunkindf:# 分批次处理数据pass

选择读取部分列

df = pd.read_csv(path, usecols=["a"])

提前设置列类型

df = pd.read_csv(path, dtype={"a":"int8"})

将类别列设为category类型

df['a'] = df['a'].astype('category')

此操作对于类别列压缩非常有效，压缩比很大。同时在设置为category类型后，LightGBM可以视为类别类型训练。

自动识别类型并进行转换

def reduce_mem_usage(props): start_mem_usg = props.memory_usage().sum() / 10242print("Memory usage of properties dataframe is :",start_mem_usg," MB") NAlist = []# Keeps track of columns that have missing values filled in.forcolinprops.columns:ifprops[col].dtype != object:# Exclude strings# Print current column typeprint("")print("Column: ",col)print("dtype before: ",props[col].dtype)# make variables for Int, max and minIsInt = False mx = props[col].max() mn = props[col].min()# Integer does not support NA, therefore, NA needs to be filledifnot np.isfinite(props[col]).all(): NAlist.append(col) props[col].fillna(mn-1,inplace=True)# test if column can be converted to an integerasint = props[col].fillna(0).astype(np.int64) result = (props[col] - asint) result = result.sum()ifresult > -0.01 and result < 0.01: IsInt = True

# Make Integer/unsigned Integer datatypesifIsInt:ifmn >= 0:ifmx < 255:
props[col] = props[col].astype(np.uint8)elifmx < 65535:
props[col] = props[col].astype(np.uint16)elifmx < 4294967295:
props[col] = props[col].astype(np.uint32)else:
props[col] = props[col].astype(np.uint64)else:ifmn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
props[col] = props[col].astype(np.int8)elifmn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
props[col] = props[col].astype(np.int16)elifmn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
props[col] = props[col].astype(np.int32)elifmn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
props[col] = props[col].astype(np.int64)# Make float datatypes 32 bitelse:
props[col] = props[col].astype(np.float32)# Print new column typeprint("dtype after: ",props[col].dtype)print("")# Print final resultprint("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = props.memory_usage().sum() / 10242print("Memory usage is: ",mem_usg," MB")print("This is ",100*mem_usg/start_mem_usg,"% of the initial size")returnprops, NAlist

https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65

结合numpy.memmap使用

numpy.memmap可以将数据提前在磁盘上进行申请空间，并不需要读取进内存。而且支持多次写入。

所以将每列数据处理好，存储到磁盘，处理完成后再读取进入内存。

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56105

https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

模型内存优化

XBGboost

可以将数据集存储为libsvm格式，使用External Memory Version完成训练，或者从命令行训练。

https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html

LightGBM

使用LightGBM的自带的Dataset读取文件进行训练，比使用Numpy和Pandas数据更好。当然把内存数据转换为Dataset也有一定的效果。

https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

设置histogram_pool_size参数控制内存使用，也可以减少num_leaves和max_bin的取值。

https://lightgbm.readthedocs.io/en/latest/FAQ.html?highlight=Multiple#when-running-lightgbm-on-a-large-dataset-my-computer-runs-out-of-ram

深度学习模型

如果使用深度学习模型，可以考虑使用dataloder的方式分批次读取数据到内存。

总结

查看数据列和行，读取需要的数据；
查看数据类型，进行类型转换；
分批次或利用磁盘，处理数据；

【竞赛报名/项目咨询请加微信：mollywei007】

本文由 Molly老师发布在国际竞赛联盟，转载此文请保持文章完整性，并请附上文章来源（国际竞赛联盟）及本页链接。
原文链接：//www.paulakaye.com/zlk/324856.html

Kaggle知识点：内存优化方法

内存使用统计

Numpy内存优化

转换数据类型

使用稀疏矩阵

Pandas内存优化

模型内存优化

总结

雅思4月机考大作文预测：出狱又犯罪

十大需要注意的留学避坑指南！

最新发布

雅思写作9月份备考建议及小范围预测

SSEN项目是什么？SSEN如何申请？SSEN能提高录取率吗？

美国学位知多少？一文带你了解四大类别

20所美本理工牛校数据分析

KET真的达到中考难度了吗？

为什么强烈建议一二年级学生学习袋鼠数学竞赛？

最新文章

雅思写作9月份备考建议及小范围预测

SSEN项目是什么？SSEN如何申请？SSEN能提高录取率吗？

美国学位知多少？一文带你了解四大类别

20所美本理工牛校数据分析

KET真的达到中考难度了吗？

内存使用统计

Numpy内存优化

转换数据类型

使用稀疏矩阵

Pandas内存优化

模型内存优化

总结

雅思4月机考大作文预测：出狱又犯罪

十大需要注意的留学避坑指南！

你也可能喜欢

最新发布

最新文章