fbpx

Python Memory Footprint

This article applies to python 2.7 64-bit (32bit and py3k may be different)

Edit: I added simpler estimate formulas along side the actual formulas, so the python memory footprint can be quickly visualized.

Edit2: 32 bit python seems to be using around half of the memory; this seems to be due to the use of 32bit pointers instead of 64. That being said, you are limited to 2GB of memory.

Some developers are unaware of the memory footprint python has and tend to hit walls especially if they are trying to load big data into memory instead of using efficient cache oblivious algorithms and data-structures.

This post demonstrates the memory footprint of basic python objects/data-structures. You can use this data to estimate how much memory you would need to support your program or better layout your program if your memory starts to run out. This data was collected using the python profiler Guppy-PE.

  1. Boolean and Numerical Types
  2. Strings
  3. DataStructures (Lists, Tuples, Dict)

Boolean and Numerical Types

  • Boolean (bool): 24 bytes
    import guppy
    hp = guppy.hpy()
    
    In : hp.iso(True)
    Out: Partition of a set of 1 object. Total size = 24 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       24 100        24 100 bool
    
  • Integers (int): 24 bytes
    In : hp.iso(1)
    Out: Partition of a set of 1 object. Total size = 24 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       24 100        24 100 int
    
    In : hp.iso(2**62)
    Out: Partition of a set of 1 object. Total size = 24 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       24 100        24 100 int
  • Long Integers (long): 32 + int(math.log(NUMBER, 2) / 60) * 8 bytes
    In : hp.iso(long(0))
    Out: Partition of a set of 1 object. Total size = 24 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       24 100        24 100 long
    
    In : hp.iso(long(2**60))
    Out: Partition of a set of 1 object. Total size = 40 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       40 100        40 100 long
    
    In : hp.iso(2**120)
    Out: Partition of a set of 1 object. Total size = 48 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       48 100        48 100 long
    
    In : hp.iso(2**180)
    Out: Partition of a set of 1 object. Total size = 56 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       56 100        56 100 long
    
    # 32 + int(math.log(abs(0) or 1, 2) / 60) * 8
  • Float: 24 bytes
    In : hp.iso(1.0)
    Out: Partition of a set of 1 object. Total size = 24 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       24 100        24 100 float
    
    In : hp.iso(128301289308129083901231.09102783098192083091823089120839012)
    Out: Partition of a set of 1 object. Total size = 24 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       24 100        24 100 float
  • Decimal (decimal.Decimal): 40 bytes
    In : hp.iso(decimal.Decimal('1.0'))
    Out: Partition of a set of 1 object. Total size = 80 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       80 100        80 100 decimal.Decimal
    
    In : hp.iso(decimal.Decimal('128301289308129083901231.09102783098192083091823089120839012'))
    Out: Partition of a set of 1 object. Total size = 80 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       80 100        80 100 decimal.Decimal

Strings

Note: In python 2 string concat using the __add__ or essentially ‘+’ creates intermediate strings which will essentially grab much more memory than you need. The efficient way to join strings is to use the string join method or ‘%s’ string formatting (for example). Just avoid use of ‘+’ with strings until you move to python 3.

Every 8 chars use 8 bytes, with an initial 40 bytes (for up to 3 chars)

  • String: 40 + ((len(s) – 4) / 8 + 1) * 8 bytes ~= 40 + len(s) * 8
    In : hp.iso('a'*3)
    Out: Partition of a set of 1 object. Total size = 40 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       40 100        40 100 str
    
    In : hp.iso('a'*4)
    Out: Partition of a set of 1 object. Total size = 48 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       48 100        48 100 str
    
    In : hp.iso('a'*12)
    Out: Partition of a set of 1 object. Total size = 56 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       56 100        56 100 str
    
    In : hp.iso('a'*20)
    Out: Partition of a set of 1 object. Total size = 64 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100       64 100        64 100 str
    

DataStructures (Lists, Tuples, Dict)

the following is just the structure’s memory usage and not whats inside of it:

  • Tuple: 56 + 8 * len(t) bytes
    In : hp.iso(tuple())
    Out: Partition of a set of 1 object. Total size = 56 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       56 100        56 100 tuple
    
    In : hp.iso(tuple(range(1)))
    Out: Partition of a set of 1 object. Total size = 64 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       64 100        64 100 tuple
    
    In : hp.iso(tuple(range(2)))
    Out: Partition of a set of 1 object. Total size = 72 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       72 100        72 100 tuple
    
    In : hp.iso(tuple(range(100)))
    Out: Partition of a set of 1 object. Total size = 856 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100      856 100       856 100 tuple
    
  • List: 72 + 64 * int(1 + (len(l) + 1) / 8) bytes ~= 72 + len(l) * 8
    In : hp.iso(list())
    Out: Partition of a set of 1 object. Total size = 72 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100       72 100        72 100 list
    
    In : hp.iso(list(range(1)))
    Out: Partition of a set of 1 object. Total size = 136 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100      136 100       136 100 list
    
    In : hp.iso(list(range(8)))
    Out: Partition of a set of 1 object. Total size = 200 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100      200 100       200 100 list
    
    In : hp.iso(list(range(16)))
    Out: Partition of a set of 1 object. Total size = 264 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100      264 100       264 100 list
    
  • Dictionary (dict): memory based on number of buckets, below are the details, and here is the pattern that seems to be exhibited. the first 5 elements are included in the initial 280 bytes. The next bucket can hold up to (2**4) 16 more elements with 52.5 bytes per element. The next bucket can hold (2 ** 6) 64 more elements with 36 bytes per element. The next bucket can host (2 ** 8) 256 more elements with 36 bytes per element. The next can host (2** 10) 1024 more elements with 36 bytes per element … I have not tried to come up with a formula for this one, feel free to solve this in the comments.
    In : hp.iso(dict())
    In : hp.iso(dict([(x,None) for x in range(5)]))
    Out: Partition of a set of 1 object. Total size = 280 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100      280 100       280 100 dict (no owner)
    
    In : hp.iso(dict([(x,None) for x in range(6)]))
    In : hp.iso(dict([(x,None) for x in range(5 + 16)]))
    Out: Partition of a set of 1 object. Total size = 1048 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100     1048 100      1048 100 dict (no owner)
    
    In : hp.iso(dict([(x,None) for x in range(6 + 16)]))
    In : hp.iso(dict([(x,None) for x in range(5 + 16 + 64)]))
    Out: Partition of a set of 1 object. Total size = 3352 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100     3352 100      3352 100 dict (no owner)
    
    In : hp.iso(dict([(x,None) for x in range(6 + 16 + 64)]))
    In : hp.iso(dict([(x,None) for x in range(5 + 16 + 64 + 128)]))
    Out: Partition of a set of 1 object. Total size = 12568 bytes.
    Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
    0      1 100    12568 100     12568 100 dict (no owner)
    
    In : hp.iso(dict([(x,None) for x in range(6 + 16 + 64 + 128)]))
    In : hp.iso(dict([(x,None) for x in range(5 + 16 + 64 + 128 + 1024)]))
    Out: Partition of a set of 1 object. Total size = 49432 bytes.
     Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
         0      1 100    49432 100     49432 100 dict (no owner)
    

How much does each city Tweet?

At LeadSift, we do our own crawling. We’re whitelisted by Twitter (yippee) and so, we can download and have access to a lot more data for free than most other people can. Yet, to create a globally scalable solution, we need to buy the Twitter data. While setting a price point for our customers and also estimating our fund requirements, we set to task to find out how many tweets are posted by some of the major cities which we will index by end of year. Although we found a bunch of blogs comparing tweet rates, we were unable to find any numbers; which turned out to be a fun sub-project for us – estimate by running our own experiments!

 

What we found out was interesting – Toronto has more tweets than San Francisco (due to more population), well over half a million while Halifax has a lot more tweets than Moncton (showing the Internet who’s boss!) but New York is just graph busting with about 3 million tweets posted everyday! We also confirmed our suspicion that people post more tweets on workdays than weekends. Here’s what it looks like:

 

Infogram with tweets per city

Tweets Per City

Note: These are estimates only – from people who have specified a location on their profile or have geo-location turned on in tweets. Please feel free to contact us with any questions about this data or other data estimations you may want.

 

LeadSift – Twitter Report of Halifax and surrounding area

LeadSift has morphed a few times, albeit into something better and more focused every time. From a team that developed Twecan.com – an exploratory search engine (which definitely came before it’s time… over 3  years ago), now brings a new way of targeted marketing and generating leads – and it is completely automatic! That’s not all that puts up apart; the SMART ranking algorithm, a result of months of hard research by the 4 nerdy co-founders with graduate degrees in Computer Science ensures that the hottest leads get addressed first. Armed with the invaluable advice from the best mentors from Atlantic Canada and an unsurpassable passion for an excellent product, LeadSift is the entrepreneurial venture to watch out for.

To give the readers a little snippet of what we’re upto, we compiled some analytics we ran from the tweets we collected from Halifax and surrounding area, over a month…

 

From a sample of 11,000 twitter accounts we downloaded, 2,300 were corporate ones, which means about 80% of all user accounts are personal account.



 

Of all the users, over 81.5% have joined Twitter in the last 3 years and is an increasing trend, with more users signing up every year!

Also, interestingly, the majority of users post around a 1000 messages – enough to implicitly extract profile information and preferences.


27.64% of all users use smart phones that have geo-location enabled and 44.92% of the users have a website listed in their bio description – which hints that most Twitter users are tech savvy and educated. Interestingly, only around 0.5% of users identify their posts to be non-English.