Let us say you are asked to generate substrings of a large string with a defined length and overlap, following is python(3) script.
Import the libraries
import numpy as np
import sys
Collect user input, as three arguments. First input is a string and will be converted as string. Second argument is length of the substring and third argument is overlap window size. All must be supplied by a space in between.
string = str(sys.argv[1])
Len = int(sys.argv[2])
Wind = int(sys.argv[3])
If user enters a window size larger than or equal to string size, the code stops.
if Wind >= Len:
print("String is smaller than or equal to over lapping window")
sys.exit()
Now, let us generate stard and end positions of the substrings on user input string
Start = list(range(0, len(string) + 1 - Len, Len - Wind))
End = list(range(Len, len(string) + 1, Len - Wind))
Store output in a list named “test”
test = [string[i: j] for i, j in zip(Start, End)]
Now, in the list count the unique strings and print the substring and number of their occurences.
values, counts = np.unique(test, return_counts=1)
for i, j in zip(values, counts):
print(i, "\t", j)
Save the script as **print_overlaps.py”.
Command line argument would as follows below with an example:
python3 print_overlaps.py mystring 3 2
Note Output is sorted alphabetically.
Output looks like below:
> python3 print_overlaps.py mystring 4 2
myst 1
ring 1
stri 1
another example:
> python3 print_overlaps.py mystring 5 3
mystr 1
strin 1