|||
之前有个师弟尝试过用TDA做一个较大数据集(约100,000节点)的数据分析,当时,我直接质疑TDA一定不能够。但后来证明我错了,经过一天一夜,数据还是跑出来了(师弟绑定的那台电脑也不咋地)。
今天读到这篇文章后,发现原来TDA是有处理大数据分析的机制的,看来我是低估该软件的运行效率。
我虽不会采用TDA来进行数据分析,但以下的经验在做大数据分析时还是值得参考的。
1. 采用64位操作系统,安装最大可能的内存;
2. 关闭任何与分析无关的系统程序;
3. 首先导入少量字段,再逐步增加字段;
另外,关于字段的特征。
含有大量项目的字段往往会消耗系统大量资源,这些字段往往包括:
1. 包含”NLP”字或词的字段;
2. 涉及参考文献字段以及来源于被引文献的字段,包括“引用的作者”以及“引用的期刊”;
3. 再次,就是作者、发明人、组织机构全称、或者一些非受控词字段。
4. 最后,尽可能的避免导入长文本数据(例如:专利权利要求、摘要)
5. 另外,有一些字段内的项目具有一些有趣的特征,比如,有些字段内的记录数存在长尾的分布特征,即大多数情况下该字段的一条记录仅包含1或2项数据。这种情况下,可以考虑创建一个字段,仅保留那些数量在N以上的记录,而删去原先的数据。
6. 最后,列表中的导入顺序十分有用。
Overview:
When you are working with large datasets in Thomson Data Analyzer (TDA), you may see an error message that you are running out of RAM. The following guidelines may help you to free up some system memory and continue to work. Which guidelines to apply will depend on your analytical needs and where you are in the workflow process. Strategies discussed in this document include:
Use a 64-bit Operating System and Install the maximum amount of RAM supported by your computer
Close other programs that are not essential to your analysis.
Import a small number of fields at first; Use “Import More Fields” to add other fields later.
Use a 64-bit OS and Install the Maximum RAM
Thomson Data Analyzer is a 32-bit application, and is subject to the per-process memory usage limits of the operating system. These limits exist regardless of how much physical memory the computer has installed.
If you are using Thomson Data Analyzer on a 32-bit version of Windows, the maximum amount of memory that TDA can use is 2 gigabytes. On a 64-bit Windows system, Thomson Data Analyzer can use up to 3 GB.
Close Non-essential Programs and *.vpt Files.
If you have other applications running that are not essential to your workflow, close them to make more system memory available for TDA to use. If you have more than one Thomson Data Analyzer data file (*.vpt) open, close all open data files except the one in which you are currently working.
Maintain a Dataset with as Few Fields as Possible
When you maintain a dataset with only the essential fields, you also keep the size (in MB) of the *.vpt file on the disk as small as possible. This is especially important when you import raw data files, and it is advisable to import only the “Title” field at first, so you do not run out of memory before you save the *.vpt file to a disk. Once you have saved your dataset as a *.vpt file, exit and restart TDA (to free up as much memory as possible) and open your saved dataset. You can use “Import More Fields” (from TDA’s “Fields” menu to add other fields you need after your data is imported and saved to a *.vpt file.
Here is a useful table to guide you in selecting a minimum set of fields:
Minimum Field Set | New Fields added | Additional Fields Needed | ||
|
|
|
Use discretion when choosing which fields to add. Whenever possible, avoid importing fields with Long Text (e.g. Patent Claims, Abstracts, etc.)
Fields with a very large number of items will also consume a lot of system resources. Examples of such fields include:
Fields with “NLP” Words or Phrases
“Cited References” fields and fields derived from Cited References (e.g. “Cited Authors” or “Cited Journals”).
Authors, Inventors, Full Organization Names, or fields with Uncontrolled vocabulary terms (see note below)
Note: Delete existing large fields that are not in use, but only if they can be readily imported again using “Import More Fields.” Use caution not to delete fields that have Groups you want to keep or “Cleaned” fields. “Cleaned” fields cannot be readily re-imported with “Import More Fields,”[1] (but the originating field on which the cleaning was done can usually be safely deleted.)
Fields that include a lot of items also tend to have long tails on their record frequency distributions. That is, a vast majority of the terms will occur in only one or two records. When this is the case, consider creating a group of all terms that occur in at least N records. You can then use “Create Field using Group Items” to make a new field with far fewer items, and delete the originating, much larger field.
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2025-1-9 06:11
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社