Home > python > fold / unfold URLs

fold / unfold URLs

Problem
When you visit a gallery, very often the URLs follow a pattern. For instance:
http://www.website.com/001.jpg, http://www.website.com/002.jpg, …, http://www.website.com/030.jpg. There is a sequence: [001-030]. Thus, these URLs can be represented in a compact way: http://www.website.com/ [001-030].jpg (without space). I call it a sequence URL.

There are two challenges here:

  1. Having a sequence URL, restore all the URLs. We can call it unpacking / unfolding.
  2. The opposite of the previous: having a list of URLs (that follow a pattern), compress them to a sequence URL. We can call it folding.

I met this challenge when I was working with URLs but it can be generalized to strings.

Unfolding
I wrote an algorithm for this (see later) but later I found a module that could do it better. I posed my question on Reddit and got a very good answer (see here). It was suggested that I should use the ClusterShell project. This project was made for administrating Linux clusters. We have nothing to do with Linux clusters, but it contains an implementation of string folding / unfolding that we can re-use here.

Installation is trivial: “pip install clustershell“.

Then, I made a wrapper function for unfolding:

from ClusterShell.NodeSet import NodeSet

def unfold_sequence_url(text):
    """
    From a sequence URL restore all the URLs (unpack, unfold).

    Input: "node[1-3]"
    Output: ["node1", "node2", "node3"]
    """
    # Create a new nodeset from string
    nodeset = NodeSet(text)
    res = [str(node) for node in nodeset]
    return res

Folding

Here is another wrapper function for folding:

from ClusterShell.NodeSet import NodeSet

def fold_urls(lst):
    """
    Now the input is a list of URLs
    that we want to compress (fold) to a sequence URL.

    Example:
    Input: ["node1", "node2", "node3"]
    Output: "node[1-3]"
    """
    res = NodeSet.fromlist(lst)    # it's a ClusterShell.NodeSet.NodeSet object
    return str(res)

My own implementation (old)
Naively, I implemented the unfolding since I didn’t know about ClusterShell. I put it here, but I suggest you should use ClusterShell (see above).

#!/usr/bin/env python3

"""
Unpack a sequence URL.

How it works:

First Gallery Image: http://www.website.com/001.jpg
Last Gallery Image: http://www.website.com/030.jpg
Sequence: [001-030]
Sequence URL: http://www.website.com/[001-030].jpg

From the sequence URL we restore the complete list of URLs.
"""

import re

from jive import mylogging as log


def is_valid_sequence_url(url, verbose=True):
    lst = re.findall("\[(.+?)-(.+?)\]", url)
    # print(lst)
    if len(lst) == 0:
        if verbose: log.warning(f"no sequence was found in {url}")
        return False
    if len(lst) > 1:
        if verbose: log.warning(f"several sequences were found in {url} , which is not supported")
        return False
    # else, if len(lst) == 1
    return True
        

def get_urls_from_sequence_url(url, statusbar=None):
    res = []

    if not is_valid_sequence_url(url):
        return []

    m = re.search("\[(.+?)-(.+?)\]", url)
    if m:
        start = m.group(1)
        end = m.group(2)

        prefix = url[:url.find('[')]
        postfix = url[url.find(']')+1:]

        zfill = start.startswith('0') or end.startswith('0')

        # print(url)
        # print(prefix)
        # print(postfix)

        if zfill and (len(start) != len(end)):
            log.warning(f"start and end sequences in {url} must have the same lengths if they are zero-filled")
            return []
        # else
        length = len(start)
        if start.isdigit() and end.isdigit():
            start = int(start)
            end = int(end)
            for i in range(start, end+1):
                middle = i
                if zfill:
                    middle = str(i).zfill(length)
                curr = f"{prefix}{middle}{postfix}"
                res.append(curr)
            # endfor
        # endif
    # endif

    return res

##############################################################################

if __name__ == "__main__":
    url = "http://www.website.com/[001-030].jpg"    # for testing

    urls = get_urls_from_sequence_url(url)
    for url in urls:
        print(url)

Links

Update

It turned out that ClusterShell doesn’t install on Windows. However, I could extract that part of it which does the (un)folding. Read this ticket for more info. The extracted part works on Windows too.

Advertisements
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: