Archive

Archive for July, 2018

fold / unfold URLs

Problem
When you visit a gallery, very often the URLs follow a pattern. For instance:
http://www.website.com/001.jpg, http://www.website.com/002.jpg, …, http://www.website.com/030.jpg. There is a sequence: [001-030]. Thus, these URLs can be represented in a compact way: http://www.website.com/ [001-030].jpg (without space). I call it a sequence URL.

There are two challenges here:

  1. Having a sequence URL, restore all the URLs. We can call it unpacking / unfolding.
  2. The opposite of the previous: having a list of URLs (that follow a pattern), compress them to a sequence URL. We can call it folding.

I met this challenge when I was working with URLs but it can be generalized to strings.

Unfolding
I wrote an algorithm for this (see later) but later I found a module that could do it better. I posed my question on Reddit and got a very good answer (see here). It was suggested that I should use the ClusterShell project. This project was made for administrating Linux clusters. We have nothing to do with Linux clusters, but it contains an implementation of string folding / unfolding that we can re-use here.

Installation is trivial: “pip install clustershell“.

Then, I made a wrapper function for unfolding:

from ClusterShell.NodeSet import NodeSet

def unfold_sequence_url(text):
    """
    From a sequence URL restore all the URLs (unpack, unfold).

    Input: "node[1-3]"
    Output: ["node1", "node2", "node3"]
    """
    # Create a new nodeset from string
    nodeset = NodeSet(text)
    res = [str(node) for node in nodeset]
    return res

Folding

Here is another wrapper function for folding:

from ClusterShell.NodeSet import NodeSet

def fold_urls(lst):
    """
    Now the input is a list of URLs
    that we want to compress (fold) to a sequence URL.

    Example:
    Input: ["node1", "node2", "node3"]
    Output: "node[1-3]"
    """
    res = NodeSet.fromlist(lst)    # it's a ClusterShell.NodeSet.NodeSet object
    return str(res)

My own implementation (old)
Naively, I implemented the unfolding since I didn’t know about ClusterShell. I put it here, but I suggest you should use ClusterShell (see above).

#!/usr/bin/env python3

"""
Unpack a sequence URL.

How it works:

First Gallery Image: http://www.website.com/001.jpg
Last Gallery Image: http://www.website.com/030.jpg
Sequence: [001-030]
Sequence URL: http://www.website.com/[001-030].jpg

From the sequence URL we restore the complete list of URLs.
"""

import re

from jive import mylogging as log


def is_valid_sequence_url(url, verbose=True):
    lst = re.findall("\[(.+?)-(.+?)\]", url)
    # print(lst)
    if len(lst) == 0:
        if verbose: log.warning(f"no sequence was found in {url}")
        return False
    if len(lst) > 1:
        if verbose: log.warning(f"several sequences were found in {url} , which is not supported")
        return False
    # else, if len(lst) == 1
    return True
        

def get_urls_from_sequence_url(url, statusbar=None):
    res = []

    if not is_valid_sequence_url(url):
        return []

    m = re.search("\[(.+?)-(.+?)\]", url)
    if m:
        start = m.group(1)
        end = m.group(2)

        prefix = url[:url.find('[')]
        postfix = url[url.find(']')+1:]

        zfill = start.startswith('0') or end.startswith('0')

        # print(url)
        # print(prefix)
        # print(postfix)

        if zfill and (len(start) != len(end)):
            log.warning(f"start and end sequences in {url} must have the same lengths if they are zero-filled")
            return []
        # else
        length = len(start)
        if start.isdigit() and end.isdigit():
            start = int(start)
            end = int(end)
            for i in range(start, end+1):
                middle = i
                if zfill:
                    middle = str(i).zfill(length)
                curr = f"{prefix}{middle}{postfix}"
                res.append(curr)
            # endfor
        # endif
    # endif

    return res

##############################################################################

if __name__ == "__main__":
    url = "http://www.website.com/[001-030].jpg"    # for testing

    urls = get_urls_from_sequence_url(url)
    for url in urls:
        print(url)

Links

Update

It turned out that ClusterShell doesn’t install on Windows. However, I could extract that part of it which does the (un)folding. Read this ticket for more info. The extracted part works on Windows too.

Advertisements