The face of processor cache, some old and some performance tuning skills already expired

2011-05-31  来源:本站原创  分类:CPP  人气:196 

Please note that this is not to explain the processor cache, if you do not know the concept of the cpu cache, please Google it.
In addition, this paper, for, like C, C + + that generate machine code language, for, like Java,. Net byte code language that mentioned here may be invalid, at least I have not studied.

Let me talk about what I have said these came from the old optimization techniques.
The reason is simple, if you're like me, many years only J2ME, or Flash like technology development, you are not likely to be concerned about the processor cache, but the performance with some other skills that meet the processor cache problem, becomes ineffective.
Then if your CPU, compilation, optimization of knowledge, like me, still at 80386 time, you and I categorically optimization techniques to master is out of date.

Failure of a technique, using pre-computed lookup table variable or

Now how to use lookup table to compute a 32-bit integer in the number of bits is 1.

static const unsigned char BitsSetTable256[256] =
{
//  Precalculated 256 8 The number of 1 bits
};

int calculateBitsCount(unsigned int n)
{
    unsigned char * p = (unsigned char *)&n;
    return BitsSetTable256[p[0]] +
    BitsSetTable256[p[1]] +
    BitsSetTable256[p[2]] +
    BitsSetTable256[p[3]];
}

Cool, is not it, only four addition operations, we can assume that the algorithm even than those of full cycle of multiplication and division algorithms fast.
With the CPU when the data cache, the situation is different. When calculateBitsCount for the first time to take BitsSetTable256 data, likely to lead to empty the data cache memory location to reload BitsSetTable256 will lead to a waste of hundreds of instruction cycles, which is hundreds of instruction cycles, enough with the median of the ordinary method.
Such as the following algorithm, from
http://graphics.stanford.edu/ ~ seander / bithacks.html

unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
  v &= v - 1; // clear the least significant bit set
}

This algorithm seems to find the table algorithm than the above instructions a lot more, there are cycles, but remember that data in order cost than many, many very low cost (the number of instructions beyond the instruction cache, except a lot), the value of the fare! True value of the fare, because I use this method after the lookup table instead, really fast.

Failure of two techniques, using local variables to cache the object's member variable operating

Please note that this technique is effective in most cases, and here only that some cases will fail.
For example there is such a function

void func(SomeObject * obj)
{
    int i, k, p;
    int count = obj->getCount();

    for(i = 0; i < 100; ++i) {
        for(int k = 0; k < 100; ++k) {
            for(int p = 0; p < count; ++p) {
                //  Process data obj
            }
        }
    }
}

GetCount is assumed to take a number.
This looks very good, very perfect, but a closer look there is a problem. If all local variables can be placed in registers, no problem. However, if the count can not be placed in register it? So each loop count memory to be read from the stack, but the data have to deal with obj, the two most likely not part of a data cache, this leads to frequent exchange of data cache, slow!
If you abandon the count, and to change the innermost loop

for(int p = 0; p < obj->getCount(); ++p) {
                //  Process data obj
            }

Because the data are read in the context of obj, if you are in the range of data in the cache, it will be quite fast.

Failure of the three techniques, in a loop to do all things

We may always feel that cycle is slow, because even jump, so we would rather be in a loop of all things to do.

ObjectA * objA;
    ObjectB * objB;
    for(int i = 0; i < 100; ++i) {
        //  ObjA to do something
        //  ObjA and do something else
        //  ObjB do something
    }

There are two problems:
1, once the loop in the code is longer than the instruction cache, then instruction cache each cycle should lead to instability, regardless of a few-level CPU cache, L1 is cleared to reload, hit the L1 cache is always slower than direct.
2, a more troublesome matter, loop operation in the two data blocks, unless the just distribution of the two objects close, otherwise the turbulence will inevitably lead to the data cache, slow.
If the loop segmentation,

ObjectA * objA;
    ObjectB * objB;
    for(int i = 0; i < 100; ++i) {
        //  ObjA to do something
    }
    for(int i = 0; i < 100; ++i) {
        //  ObjA and do something else
    }
    for(int i = 0; i < 100; ++i) {
        //  ObjB do something
    }

The instruction cache and data cache will feel very pleased, naturally, work a little faster.

Summary:
Although these techniques above will fail, and does not imply that these skills are wrong, in many cases also might work. And the processor cache optimization of this stuff together great uncertainties, no change of the rules, so when they do careful testing concrete in order to know which method is better.

相关文章
  • The face of processor cache, some old and some performance tuning skills already expired 2011-05-31

    Please note that this is not to explain the processor cache, if you do not know the concept of the cpu cache, please Google it. In addition, this paper, for, like C, C + + that generate machine code language, for, like Java,. Net byte code language t

  • Oracle performance tuning study notes (d) - Library Cache Statistics 2011-09-16

    Library Cache Statistics 1. When you change the objectives of the shared pool of soft parsing SQL statements. reload: reload object is invalid or changes in use need to reload. to reduce the reload value. INVLIDATIONS: in v $ librarycache, the number

  • Gallery of Processor Cache Effects 2013-07-26

    原文地址:http://igoro.com/archive/gallery-of-processor-cache-effects/ Most of my readers will understand that cache is a fast but small type of memory that stores recently accessed memory locations. This description is reasonably accurate, but the "borin

  • IO performance of the two: how to improve the cache and RAID disk IO performance 2010-10-18

    Cache (Cache) RAID (Redundant Array Of Inexpensive Disks) Changes in the four performance indicators IO response time (IO Response Time) IOPS Transmission speed (Transfer Rate) / Throughput (Throughput) Further reading From an article on the calculat

  • Oracle performance tuning study notes (E) - buffer Cache tuning A 2011-09-16

    buffer Cache tuning buffer cache hit ratio is not required to share pool hit rate so high. BD_BLOCK_SIZE: SGA in two queues: LRU lists: monitoring the buffer cache to use based on access frequency and duration of the order. Head for the most recent d

  • Oracle performance tuning study notes (E) - buffer Cache tuning B 2011-09-16

    Dynamic view of diagnostic tools v $ buffer_pool_statistics v $ buffer_pool v $ db_cache_advice v $ sysstat v $ sesstat v $ system_event v $ session_wait v $ bh v $ cache Statpack OEM's tuning to reduce the use of performance indicator information in

  • Oracle performance tuning study notes (E) - buffer Cache tuning C 2011-09-16

    Oracle Wait Interface inspection bottleneck causes v $ session_wait v $ session_event v $ system_event Case of need to increase the cache size 1. Any event wait. 2.SQL statement optimization 3 pages of serious operating system for 4 low hit rate Open

  • windows install PHP Cache Xcache 2010-07-20

    XCache is a fast and stable PHP opcode cache. Are well tested and high flow / high load and stable operation of the production machine. After (linux on) to test and support all the existing branches of the latest PHP Release, such as PHP4.4 PHP5.2, a

  • 关于CPU Cache -- 程序猿需要知道的那些事 2013-10-01

    先来看一张本文所有概念的一个思维导图 为什么要有CPU Cache 随着工艺的提升最近几十年CPU的频率不断提升,而受制于制造工艺和成本限制,目前计算机的内存主要是DRAM并且在访问速度上没有质的突破.因此,CPU的处理速度和内存的访问速度差距越来越大,甚至可以达到上万倍.这种情况下传统的CPU通过FSB直连内存的方式显然就会因为内存访问的等待,导致计算资源大量闲置,降低CPU整体吞吐量.同时又由于内存数据访问的热点集中性,在CPU和内存之间用较为快速而成本较高的SDRAM做一层缓存,就显得性价

  • 深入理解缓存cache(2):各级缓存测试 2012-10-09

    首先言明,本文严格意义上将不能算作原创,因为我写这些东西只不过是博客 Gallery of Processor Cache Effect的学习心得,不能将版权划到自己名下,毕竟咱不是喜欢45度角仰望天空的郭四姑娘.由于原文是英文版,而且作者用的是C++.原文提到的实验,我做了一部分,加深了对Cache的理解.英文比较好的兄弟就不必听我聒噪了,直接点链接看原文好了. OK,继续我们的探索之旅.深入理解cache(1)得到了我的PC的cache参数如下: L1 Cache : 32KB , 8路组相

  • Through the secondary cache to speed up your hibernate application 2009-04-19

    Keywords: hibernate second cache Flanging edge because I also learn, there is nothing inevitable place undue Translations are welcome to U.S. criticism Original Title: Speed Up Your Hibernate Application with Second-Level Caching Original Source: htt

  • The cache management hibernate 2009-11-14

    Hibernate to achieve a good Cache mechanism, can make use of Hibernate quickly Cache internal data to improve system performance read. Cache in Hibernate can be divided into two tiers: two-level Cache and Cache. 1 Cache: Session to achieve the first-

  • Hibernate cache configuration / batch processing 2010-04-07

    Hibernate cache configuration / batch processing Keywords: hibernate Hibernate In addition to automatically conduct the affairs of a Session-level cache, the second-level cache will need to achieve org.hibernate.cache.CacheProvider interface, Hiberna

  • Hibernate Notes => Session Cache 2010-04-13

    [Size = x-small] [size = x-small] Session interface is the Hibernate to operate the database application provides the main interface, it provides a basic save, update, delete, and query methods. Session has a cache, the cache is located in a persiste

  • Web cache to accelerate Guide 2010-05-08

    Original (English) Address: http://www.mnot.net/cache_docs/ This is an informative document, the main purpose is to enable Web caching concepts more easily understood by developers and used in actual application environments. The sake of brevity, som

  • timesten Series 7: Configure the high availability of TT, while implementation and integration of cache and background oracle 2010-07-01

    This model should be the most valuable services provided outside while two TT, TT replication to synchronize data directly, At the same time each TT through cache agent to change the data passed to the background of ORACLE, this architecture should b

  • Cache and connection pool to access the database on the performance of 2010-07-02

    Acquaintance cache and connection pool Envisage such a scenario: You suddenly feel thirsty, need a glass of water to ease, speaking from the heart, of course, the sooner the better. Typically, a glass of water production from water (well water, river

  • Collection of java applications 5: One case of mode ThreadLocal cache management 2010-07-20

    In the project often need to use the cached value of the drop-down menu, you can use a map cache management; Singleton pattern often used in the project, is very useful to a design pattern; ThreadLocal a framework for their own use of its spring fram

  • HTTP header associated with the cache of several parameters 2010-08-11

    1.Expires (expiration date) HTTP headers Expires (expiration time) property is the basic means of HTTP cache control, this property tells buffer: how much longer the relevant copy is fresh. After this time, the buffer will send a request to the sourc

  • vmstat shows the difference between buffer and cache 2010-09-03

    Try the manual first. $ Man vmstat ... Memory swpd: the amount of virtual memory used (kB). free: the amount of idle memory (kB). buff: the amount of memory used as buffers (kB). [Man in Red Hat 8.0 does not this line] cache: the amount of memory use