Your data is not really your data

I have been posting links on LinkedIN once a day for almost one and a half year. I have been sharing links that are interesting, informative, maybe educational in a way. I used that as a way to save a stash of informations, with the intend yo keep the material, and possibly get back to it when I'm looking for it. The fantastic GDPR legislation that in so many are denigrating makes it so that any provider of services like website and any form of social, has to provide a way for the user to download, amend, and eventually to delete our own data from such sites.

So, periodically I was downloading back my data from LinkedIN, and retrieving a nice CSV with the entire history of my contextually titled list of links.

After some time, I noticed that this list of links was actually shown via some sort of URL shortener, or landing page of sort. Basically, a list of links all resembling all:

https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A6809869424712065024
https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A6809625393314746368
https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A6809408799674306560
https://www.linkedin.com/feed/update/urn%3Ali%3Ashare%3A6809123948245106688

I simply tought that was infuriating: where the hell are my links ?

Better than complaining, I just wanted to download the pages containing my articles, seriously. Given the file with the "links" that' s what I did:

#! /bin/bash

FILEIN=$1
archive=~/.archive/articles

test -d ${archive} || mkdir -p ${archive}

for x in $(cat ${FILEIN}); do
  h=$(echo ${x} | sha1sum | awk '{print $1}')
  echo "[$h]"

  if [ ! -f ${archive}/${h} ]; then
      echo "processing $x"
     curl -s ${x} > ${archive}/${h}
  else
     echo "Skipping: ${x}"
  fi

done

This produced a dir in my home with a bunch of html pieces ultimately containing my link, each per file:

ls ~/.archive/articles/
303ad17b154a005428ec3d6d60c6a58068c9e59b  641d858e6e1a8579d9c6e8a1659205686919b8ae  984fe42642d945889fed730a7aa7b36481147106  cb3b203d213786bc2f588457f006ae699b09e646  fa8a07765b8681fb465df491ca4af272cd073c4a
303dc9a3e9ff2bb9df86899d402f1cb5eaf8eaa1  64707d545989e0499398c795886e056315689a76  986bfb6478b810c7bc3b9b92e2c057c79021583a  cb73f59931619380d74b721e301fb7769b0cf098  fa91eb78dd73169dccfd80563d9736d80dcaf698
3042e54118a69069ba905e8cad6184884e2681a2  64e3b987fb5a3bcd7f267f954ecfdb1eba8a2aa8  9884c31b2a8bb5a5b58c29fefe46b1a8dd528a79  cbc64f833c757b35dca17f54c8e30a31b7061c54  faee568a034640ee7d83355864e64e60357223e3
30687ee40edda03f419618207ed9909018d5e962

And so on... But what about having a link db, with a description, something that you can actually query, like:

CREATE TABLE link(id char(40) NOT NULL PRIMARY KEY, url text, description text);

Easily done:

#! /bin/bash

prog_name="archive"
archive_db=~/.${prog_name}/archive.db

DEPENDENCIES="sqlite3 uuidgen"

sqlite_cmd="sqlite3 ${archive_db} "
dependencies_check()
{
  for d in $DEPENDENCIES; do
    CHK=$(which $d)
    if [ "$CHK" = "" ]; then
      echo "missing deps: [$d]";
      exit 1
    fi
  done
}

function write_url()
{
  id_link=$1
  url=$2
  description=$3

  sql="INSERT INTO link(id, url, description) VALUES ('${id_link}','${url}','${description}')"
  ${sqlite_cmd} "${sql}"
}

function create_db()
{
  sql="CREATE TABLE IF NOT EXISTS link(id char(40) NOT NULL PRIMARY KEY, url text, description text);"
  ${sqlite_cmd} "${sql}"
}

test -d ${archive} || mkdir -p ${archive}

dependencies_check
test -f ${archive_db} || create_db

id=""
url=""
description=""

if [ "$#" = "1" ]; then
  url=$1
else
  echo "Insert URL:"
  read url
fi

id=$(echo ${url} | sha1sum | awk '{print $1}')
echo "Insert Description:"
read description
write_url "${id}" "${url}" "${description}"

And maybe you want to be able to add information related to the link, some metadata, I mean like:

CREATE TABLE meta_link(id char(40) NOT NULL PRIMARY KEY, id_link char(40), channel char(12));

Well, let us do it:

#! /bin/bash

FILEIN=$1

prog_name="archive"
archive=~/.${prog_name}/articles
archive_db=~/.${prog_name}/archive.db

DEPENDENCIES="sqlite3"

dependencies_check()
{
  for d in $DEPENDENCIES; do
    CHK=$(which $d)
    if [ "$CHK" = "" ]; then
      echo "missing deps: [$d]";
      exit 1
    fi
  done
}

function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }


function debug_data()
{
  for h in $(ls -1 ${archive}); do
    encoded=$(cat ${archive}/${h} | grep mini-card__title-link | sed -e s/.*href=\"//g | sed -e 's/&amp;.*//g' | sed -e 's/.*url=//g')
    title=$(cat ${archive}/${h} | grep og:title | sed -e s/.*LinkedIn:// | sed -e 's/">$//g')
    url=$(urldecode ${encoded} )
    echo "[${h}] ${title} ${url}"
  done
}

function write_data()
{
  echo -n "Writing "
  for h in $(ls -1 ${archive}); do
    encoded=$(cat ${archive}/${h} | grep mini-card__title-link | sed -e s/.*href=\"//g | sed -e 's/&amp;.*//g' | sed -e 's/.*url=//g')
    title=$(cat ${archive}/${h} | grep og:title | sed -e s/.*LinkedIn:// | sed -e 's/">$//g')
    url=$(urldecode ${encoded} )
    sql="INSERT INTO link(id, url, description) VALUES ('${h}','${url}','${title}')"
    ${sqlite_cmd} "${sql}"
    
    id=$(uuidgen)
    channel="linkedin"
    sql="INSERT INTO meta_link(id, id_link, channel) VALUES ('${id}','${h}','${channel}')"
    ${sqlite_cmd} "${sql}"

    echo -n "."
  done
}

function create_db()
{
  sql="CREATE TABLE IF NOT EXISTS link(id char(40) NOT NULL PRIMARY KEY, url text, description text);"
  ${sqlite_cmd} "${sql}"
  
  sql="CREATE TABLE IF NOT EXISTS meta_link(id char(40) NOT NULL PRIMARY KEY, id_link char(40), channel char(12));"
  ${sqlite_cmd} "${sql}"
}

sqlite_cmd="sqlite3 ${archive_db} "
test -d ${archive} || mkdir -p ${archive}

dependencies_check
test -f ${archive_db} || create_db
write_data

Now I have a database of articles that I can tag:

#! /bin/bash

prog_name="archive"
archive_db=~/.${prog_name}/archive.db

DEPENDENCIES="sqlite3 uuidgen"

sqlite_cmd="sqlite3 ${archive_db} "
dependencies_check()
{
  for d in $DEPENDENCIES; do
    CHK=$(which $d)
    if [ "$CHK" = "" ]; then
      echo "missing deps: [$d]";
      exit 1
    fi
  done
}

function tag_url()
{
  id_link=$1
  channel=$2
  id=$(uuidgen)
  
  sql="INSERT INTO meta_link(id, id_link, channel) VALUES ('${id}','${id_link}','${channel}')"
  ${sqlite_cmd} "${sql}"
}

test -d ${archive} || mkdir -p ${archive}

dependencies_check

id=""
url=""
description=""

if [ "$#" = "1" ]; then
  url=$1
else
  echo "Insert URL:"
  read url
fi

id_link=$(echo ${url} | sha1sum | awk -e '{print $1}')
echo "Insert Channel:"
read channel
tag_url "${id_link}" "${channel}"

And I can retrieve URLs that I saved for some reasons, the one that are still untagged:

#! /bin/bash

prog_name="archive"
archive_db=~/.${prog_name}/archive.db

DEPENDENCIES="sqlite3 uuidgen"
sqlite_cmd="sqlite3 ${archive_db} "

dependencies_check()
{
  for d in $DEPENDENCIES; do
    CHK=$(which $d)
    if [ "$CHK" = "" ]; then
      echo "missing deps: [$d]";
      exit 1
    fi
  done
}

function write_url()
{
  id_link=$1
  url=$2
  description=$3

  sql="INSERT INTO link(id, url, description) VALUES ('${id_link}','${url}','${description}')"
  ${sqlite_cmd} "${sql}"
}

function dump_unpublished()
{
  sql="select * from link where id not in (select id_link from meta_link);"
  ${sqlite_cmd} "${sql}"
}


dependencies_check
dump_unpublished

Paolo Lulli 2021

[gdpr]